Buy Trustpilot Reviews

Anyone who wishes for us here can buy very easily. You can easily select how much you buy although a very easy method if there is any problem, you can get help from our managers on Skype or Email…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How I managed to scrap over 100k properties from a Spanish real estate website

When I started this project, I had no idea on how to scrap the web. I knew what was needed to process the data obtained from a response and store it, say, in a pandas dataframe. But no clue on how to consistently make requests without getting banned or blacklisted. Fortunetly, I ended up figuring out a method that made the process really simple. Following a few simple steps I managed to get information from over one hundred thousand properties within the Spanish territory!

First of all
There’s three important aspects that one must consider when it comes to scraping the web in order to avoid getting blacklisted:

Installation process
To achieve the things I mentioned above, I used:

Libraries
We just need a few Python libraries, some of them do not require installation, others do. I used these:

>>> pip install pysocks
>>> pip install requests

(Note that we are only writing code to acquire the html files in the form of .txt where the information we look for is stored. How I processed those files to obtain the data points I was looking for are covered in this post.)

To install tor:

After installing tor, let it run in the background with "tor &"

Note the line where tor is opening a «Socks listener on 127.0.0.1:9050 «.
This is the information of the proxy we are going to use.

Now we are all set to start writing some code :)

Making a request to check if the proxy is working

Connect to the address and port mentioned before.

If everything is working fine, the ip displayed here should differ from the one you’d get by just clicking the link above.

Rotating random user-agents

Once I had the rotating ip’s, I could start writing a random user-agent fucntion. This turned out to be really simple:

There’s plenty of sites where you can get strings of user-agents, writing a function that returns one of them randomly will do the trick.

Create a function and iterate

The web I was working with would have a maximum of 30 properties listed on each page. I first tried to scrap all of Spain’s properties, but it wouldn’t allow me to get past the 999th page. So I decided it would be a better idea to iterate a certain amount of times for every autonomous community in the country. I calculated how many pages were needed in order to get information on 100k properties. Since there are 17 autonomous communities in spain, I needed around 6000 houses listed on each one of them to get past 100k. Considering the 30 listings on each page, thats a total of 200 pages for each community. Since the web would end each url with the page number, it was really simple to iterate the requests with a for loop.

So what the code would do is to make a request and wait a random amount of seconds specified on a list. It would also check how many requests were made with the same ip address. After ten requests, the code would sleep for a minute, and then continue on. This slows down the process a lot, but it for sure won’t congest the server. If the ip address would change before the 10 requests limit was reached, the count would get back to zero and the user-agent would be changed.

This is how the harvester function looked.

The console would give me information on how the process went:

And every response was saved in the same folder as a .txt:

I let it runing all night. The next morning I had a total of 3.476 files, adding to a total of 100.745 rows of data:

As you can see, this method is easy to configure and use. One thing you should be awared of is that some webs can detect tor IPs and prevent you from getting a 200 response. So it’s not infallible.

Hope you enjoyed the post! See you next time.

Add a comment

Related posts:

The world of warrior game

GameFi technology Due to its high demand, usefulness, and number of use cases, has become really dominant in the blockchain and has experienced an exponential rise in market capitalization. The world…

High Pitch Music System Destroying the Night

You are in a fun mood. You party. You play loud music throughout the night. You think all others must partake in the fun, no matter what hour of the day or night it is or what mood they are in…

Observations from the most locked down city in the world as it relates to well being and mass hysteria.

The world catastrophe had been declared and the leaders of nations responded as they saw fit. The media covered the story and the people were informed. Only, in this day and age, the leaders must be…