Anatomy Of A Web-Scraping Robot

2020zoomblur500.png

What does a web scraper do?

The most famous web scraping robot we all know of is from Google. The Googlebot scrapes the internet to organize the world's websites so that we can easily search and access them.

Of course, not every website is indexed on Google. You've probably heard about the Dark Web, which is the part of the internet not indexed by search engines like Google or Bing.

But that's another topic for another day.


A web scraping bot can be used to automate pretty much anything a human could do on the internet.

You can get really creative.

How does a robot visit a website?

The first step is to send an HTTP Request to the server to request the web page.

A robot makes the request with a programming language.

I do most of my scraping with Python, so I will use that language to demo in this post - Python 3, to be specific!

What is an HTTP request?

  • HTTP stands for Hypertext Transfer Protocol.
  • It is a protocol that allows computers to communicate over the Internet to request and send data. Mostly the client sends HTTP requests and the website sends HTTP responses with the requested resources.
  • Usually, the client is your web browser, and the server is the computer that holds the files that make up the website you're visiting.

A robot can send the same HTTP request to a server, asking for the website resources like HTML, CSS, JavaScript, and other types of files that make up a web page or application, such as images.

HTTP request methods

When you send an HTTP request as a robot, you need to consider what your goal is, and what kind of request you're making.

For example, your goal might just be to view the homepage of the website.

HTTP methods are sometimes called HTTP verbs because the method or verb you use indicates to the server that you want it to perform a certain action.

If you want to visit the homepage of a website, you just want the server to return the resources that make up the home page so that you can view it.

Most common HTTP request methods

  1. GET

    A GET request simply requests a data source from the server. It is one of the most common types of requests. When you visit a web page in your browser, a GET request has likely been used to fetch the page resources.

  2. POST

    A POST request is used to send data to the server. If you fill out a form - say you sign up for an account on a website. That data will usually be sent as a POST request.

Less Common HTTP request methods
  • HEAD - just requests the headers, or metadata, that would be returned if a GET request were used, but without the actual resource. More on headers later.
  • PUT - similar to a POST request, but it is usually used to update a resource.
  • PATCH - used to partially modify a web resource.
  • DELETE - deletes a web resource.

These are not as commonly used for our purposes in web scraping, so don't worry to much if it's not obvious how they work right now.

You can read more about HTTP request methods here.


How to write a basic GET request in Python

We will use the python-requests library to fetch the homepage of this website, https://lvngd.com.

First create a virtualenv for the project and install python-requests.

mkvirtualenv robotenv

Install python-requests.

pip install requests

Okay, now we're ready.

A basic GET request for https://lvngd.com would look like this.

import requests
response = requests.get('https://lvngd.com')

Notice I imported requests at the top.

Check the status code to make sure the request was successful:

response.status_code 
200

The status code is 200, which means that it was successful.

Other status codes you may be familiar with are:

  • 404 - page not found.
  • 403 - forbidden - for example, if you need to be logged in to access a page but you aren't and try to request it anyway, you might get this error.
  • 500 - internal server error.
  • 503 - service unavailable - you might come across this one if your robot misbehaves and someone decides to block you.

Request parameters

On some websites you can search for information by typing a query into a search bar and hitting enter, which will usually perform a GET request and send your search query to the server for processing.

Sometimes you will see the request URL with a query string in your browser bar.

If the URL doesn't change after you perform the search, the website is probably using some form of AJAX to send the request. In this case, the URL with query parameters still exists somewhere, just not in your browser URL bar.


An example would be if I had a website with a search bar, and typed "books" into the search bar and pressed enter.

The resulting URL might look like

http://example.com/page?q=books

The query parameter is here:

?q=books

My query for "books" is then received by the server, where the web application would use it in some way.

You can send the query parameters with python-requests either by simply requesting the formatted and encoded URL with the parameters, or by adding the parameters as key-value pairs.

url = 'http://example.com'
query_parameters = {'q': 'books'}
response = requests.get(url,params=query_parameters)

Note: this example is just a demo and will not work.


In a POST request, the parameters are sent in the request body.

Request body

Not all requests have a body. If you send a GET or a HEAD request where you are just requesting resources from the server, there will usually not be a need for anything in the body.

A POST request will send the data to the server in the request body.

If you fill out a form to register on a website, the data you entered into the form fields will be sent in the request body.


HTTP headers

HTTP headers are sent along with a request or response as a set of key-value pairs.

They provide metadata about requests and responses.

Request headers

Let's look at the headers from the request I made earlier with Python.

import requests
response = requests.get('https://lvngd.com')

You can view the request headers like this.

response.request.headers

Which gives us this output.

{'User-Agent': 'python-requests/2.20.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
  • The User-Agent header identifies that I am using python-requests. More on this in a bit!
  • Accept-Encoding tells the server which content-encoding it can understand - gzip and deflate are compression algorithms, and if the server decided to compress the response, it could use one of these methods.
  • Accept tells the server which content types it can understand, which could be text, images, etc. Expressed as MIME types. Here it accepts any MIME type.
  • Connection tells the server whether or not to close the TCP connection. keep-alive keeps it open for subsequent requests.

I usually customize the User-Agent header when web scraping.

The User Agent lets the website know information about you, such as your browser, operating system or application you're using, like python-requests, Mozilla, Chrome, etc.

You can see your User Agent at this website.

Changing the User Agent in a Python request

The User Agent header is pretty easily to manipulate.

I'm a bot, so I will identify myself as LVNGDBot 1.0 for my User Agent.

I will perform the same request as I did earlier, only this time adding my User Agent header to the request in a headers dictionary.

import requests
headers = {'User-Agent': 'LVNGDBot 1.0'}
response = requests.get('https://lvngd.com', headers=headers)

If you want to add or manipulate other headers, they would be added to this dictionary as well as key-value pairs.

Let's look at the request headers that were sent to the server this time.

response.request.headers

Notice the updated User-Agent header.

{'User-Agent': 'LVNGDBot 1.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Another useful header for bots is the From header, which is where you can put a contact email address where website administrators can reach you if they need to for whatever reason.


Response headers

Let's look at the response headers from the server for this same request.

response.headers

Output:

{'Date': 'Mon, 23 Dec 2019 15:18:36 GMT', 
'Content-Type': 'text/html; charset=utf-8', 
'Transfer-Encoding': 'chunked', 
'Connection': 'keep-alive', 
'Set-Cookie': '__cfduid=dd527aa35f621c9a5f5c31d85bc4142351577114316; expires=Wed, 22-Jan-20 15:18:36 GMT; path=/; domain=.lvngd.com; HttpOnly; SameSite=Lax, csrftoken=Sg0jagwLC5nyhcwHWACuvsoXEHefFpZNmUgftOAg9joLuwW0BhYUzBNRc9jA90F3; expires=Mon, 21-Dec-2020 15:18:36 GMT; Max-Age=31449600; Path=/',
'X-Frame-Options': 'SAMEORIGIN',
'Vary': 'Cookie',
'CF-Cache-Status': 'DYNAMIC',
'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
'Server': 'cloudflare',
'CF-RAY': '549b4f1cf807e6d4-EWR', 
'Content-Encoding': 'gzip'}

Some interesting ones here.

  • Content-Type - this is used to indicate the media type of the requested resource, which is text/html in this case.
  • Server - tells which software the server is using to respond to the request.
  • Set-Cookie - sends cookies from the server to the client.
  • X-Frame-Options - indicates whether or not a browser should be able to render a page in a frame or embedded object. read more here.

The headers CF-Cache-Status and CF-RAY are from Cloudflare.


One non-standard header of note for someone who is writing web scrapers and might be using a proxy IP address, is the x-forwarded-for header. This header contains your real IP address if you use certain types of proxy IP addresses, which I will talk about next.

A full discussion of HTTP headers could easily take up an entire post or series of posts, so I will leave it at that for now.


IP Addresses

Your IP address can be thought of as your computer's address.

You might not want to use your real IP address for privacy concerns or other reasons.

Proxy IP addresses

If you use a proxy IP address, your request will be routed through the proxy IP address, to the website server, and then the response will return back through the proxy IP before it gets to you.

It is a gateway between you and the rest of the Internet and can offer anonymity, so you might prefer to use a proxy IP address and not have your real IP address tied to your robot.

In this post I'm only talking about HTTP and HTTPS proxy IPs. Proxy IP addresses can be used for other types of connections such as FTP as well.

Not all proxy IP addresses are created equal

There are different types of proxy IP addresses, and with some the website will be able to see your real IP address.

Transparent proxy
  • A transparent proxy IP tells websites that it is a proxy IP address.
  • Your real IP address is sent in the 'x-forwarded-for' HTTP header.
Anonymous proxy
  • An anonymous proxy IP identifies itself as a proxy IP.
  • It does not reveal your real IP address.
Elite proxy
  • An elite proxy IP does not identify itself as a proxy IP.
  • It does not reveal your real IP address.

Even if you are trying to be a good robot, you could still do something to trigger a block, and you might not want your real IP address getting blocked, especially if you are scraping a website that you also use as a human.

With a quick search online, you will find endless sources for free proxy IP addresses as well as paid services.

Some websites will maintain lists of known proxy IP addresses and block them, so if you are using free proxy IP services that a lot of other people are likely using as well, you might find that you need to iterate through a pool of them if some have already been blacklisted.

Iterating through a pool of proxy IP addresses can be a good idea in general.

Using a proxy IP address in a Python request

Adding proxy IP addresses to your request is similar to adding the HTTP headers that we added above.

When you use a proxy IP address you need both the IP address and the port.

This page has a list of free proxy IP addresses and you can see the addresses with ports.

I'm using 123:456:78:9:8080 as a fake proxy IP address in this example - the port is 8080.

import requests
proxies = {
 “http”: “http://123:456:78:9:8080”,
 “https”: “http://123:456:78:9:8080”,
}
response = requests.get('https://lvngd.com', proxies=proxies)

In the proxies dictionary I've identified a proxy IP address to use for http and https requests. In this case it is the same one.

If you need to authenticate your proxy IP address, you can use HTTP Basic Auth with this syntax.

proxies = {'http': 'http://user:pass@123:456:78:9:8080/'}

From the python-requests docs.


That's it!

I've outlined the basics of what goes into client-server communications with HTTP requests and responses, and what is actually happening under the hood when you visit websites and interact with them.

You can perform these tasks as a robot using Python or any number of other programming languages.

Just keep in mind
  1. Which request method do you need to use, GET or POST?
  2. Do you need to send any parameters with the request, like search query terms?
  3. Which request headers do you need to specify, if any, like your User Agent?
  4. Do you want to use a proxy IP address?

And you're all set!

Let me know if you have any questions or comments either by writing below or on Twitter @LVNGD.

Thanks for reading!

blog comments powered by Disqus

Recent Posts

mortonzcurve.png
Computing Morton Codes with a WebGPU Compute Shader
May 29, 2024

Starting out with general purpose computing on the GPU, we are going to write a WebGPU compute shader to compute Morton Codes from an array of 3-D coordinates. This is the first step to detecting collisions between pairs of points.

Read More
webgpuCollide.png
WebGPU: Building a Particle Simulation with Collision Detection
May 13, 2024

In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.

Read More
abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
Get the latest posts as soon as they come out!