What is web scraping?

robotbackground500.png

Web scraping can mean a lot of things, but it usually refers to writing a program to visit websites and extract information from them.

You may have heard of bots - short for robots - and a web scraper is a type of bot.

I will explain a bit more and then have a demo of a web scraper written in Python later in this post.


Extracting information

The scraping part refers to extracting the information, usually by finding what you want in the page HTML and writing a program to extract it. This might sound complicated, but there are many tools out there to assist in doing this, and I will demonstrate one later.

Other Internet tasks

Web scraping can also refer simply to automating any task you might perform on a web page, such as filling out and submitting a form.

That's where it can get really interesting - picture yourself on the Internet, visiting web pages.

Pretty much anything that you're doing as a human can also be done with a programming language. It can likely be done faster, and on a larger scale if you have some kind of repetitive task you're doing.

If you have a big research project, for example, it can get time consuming to visit lots of websites and copy and past the information you want by hand.

The Data you need or want might not be available in a nice and neat package.

Often the data you want to use is not available in exactly the form that you need it, so you have to take matters into your own hands to create the dataset of your dreams.

In the best case, you might be able to get the data you need directly from an API or by downloading a file.

Say you want to analyze Twitter data and perform sentiment analysis on tweets using a certain hashtag.Twitter has a convenient API to access their data in a nice and neat json format.

Unfortunately, you won't always have it so easy if you want data from some other source that does not offer an API or other way to easily download the data, or maybe you want to compile a dataset from multiple sources.

Scraping and compiling the data, and then cleaning it are the types of tasks that computers excel at and is why web scraping can be such a great tool.

Web Scraping Demo

We will look at articles from the New York Times fashion section and scrape the article text.

This will be a 3-part project that ultimately ends up with this graph of names and other entities taken from the articles.

In this post you can see how the graph was created.

The three parts:
  1. Scraping the data - today's post.
  2. Processing the text data to extract named entities - in this post.
  3. Visualizing the processed data with D3.js - in this post.

The first step is to scrape the NY Times fashion section and extract text data from articles, which is what we will be doing today.

The Scraper

  • Written in Python 3.
  • Using the requests-html library, which is a library that takes care of most of the under-the-hood details that go into requesting website resources from the servers that host them.
A 2-step process to scrape the article text data
  1. Compile a list of links to fashion-related articles from the NY Times fashion section homepage.
  2. Visit each article link and extract the article text data.

Setup

First create a virtualenv for the project.

mkvirtualenv scraperenv

Install requests-html.

pip install requests_html

Compiling the article links

First we scrape the homepage of the NY Times fashion section and gather the links to fashion articles we are interested in.

Import HTMLSession from requests_html.

from requests_html import HTMLSession

url = 'https://www.nytimes.com/section/fashion'
headers = {'User-Agent': 'LVNGDBot 1.0'}

I've also initialized a couple of variables.

  • The NY Times fashion section url, which we will be scraping.
  • A request header to set my User Agent. I typically set this when web scraping so that the sites I visit will know who I am. Your User Agent usually shows information about your browser and operating system - you can see what yours is at this link!

Now initialize an HTMLSession() instance.

session = HTMLSession()

And request the url - this is where we add the headers to the request as well.

r = session.get(url, headers=headers)

I'm using requests-html in this example to simplify things a bit. It offers functionalities to parse HTML and easily extract the links from the page.

Otherwise I would use a library like BeautifulSoup to parse the HTML and extract the data.

With requests-html I'm able to get the links on the page with one line of code.

links = r.html.links

Now we have a list of links - actually a set of links, but regardless, it contains every link from the page, so we will need to filter out the ones we don't want.

The set of links includes links to all of the sections of the New York Times website, among others that are not links to fashion articles.

The full output from links is pretty long, so below is just a summary.

{...
 '/2019/12/16/business/fashion-nova-underpaid-workers.html',
 '/2019/12/16/style/International-best-dressed-list-goodbye.html',
 '/2019/12/17/fashion/isabel-rueben-toledo-memorial-love-story.html',
 '/2019/12/17/style/best-leotards-for-dancers.html',
 '/2019/12/17/style/fast-fashion-gen-z.html',
 '/2019/12/18/fashion/hms-supply-chain-transparency.html',
 '/2019/10/16/style/self-care/cbd-oil-benefits.html',
 '/2019/10/16/style/self-care/celery-juice-benefits.html',
 '/slideshow/2019/10/01/fashion/runway-womens/louis-vuittonlouis-vuitton.html',
 '/slideshow/2019/10/01/fashion/runway-womens/miu-miu-spring-2020.html',
 'https://www.nytimes.com/spotlight/arts-listings',
 'https://www.nytimes.com/spotlight/pop-culture',
 'https://www.nytimes.com/subscription/cooking.html',
 'https://www.nytimes.com/subscription/crosswords',
 'https://www.nytimes.com/subscription?campaignId=37WXW',
 'https://www.nytimes.com/times-journeys',
 'https://www.nytimes.com/video',
 'https://www.nytimes.com/video/arts',
 'https://www.nytimes.com/video/opinion',
 'https://www.nytimes.com/watching',
 'http://www.nytimes.com/services/mobile/index.html'}

For my purposes I just want to keep the fashion-related articles.

Web scraping is all about finding patterns to use in a program to extract the data you want.

We can see that the articles all end with .html, so that's a start.

fashion_links = []
for link in links:
    if link.endswith('html'):
        if not link.startswith('/slideshow/') and not link.startswith('http') and not '/self-care/' in link:
            if link not in fashion_links:
                fashion_links.append(link)

However, some of the links are for articles about things like celery juice benefits, which is not exactly what I want. The celery juice type of articles are in a sub-section /self-care/ so I've filtered those out.

I've also filtered the slideshow articles - they start with /slideshow/ - because they are just images without much text.

Lastly I'm filtering any links starting with http because none of those are relevant to my interests either.

Check out the filtered list, in fashion_links.

{'/2017/05/03/fashion/style-questions-newsletter-open-thread.html',
 '/2019/02/06/style/9-reasons-to-not-hate-february.html',
 '/2019/02/13/style/how-to-marie-kondo-your-wardrobe.html',
 '/2019/03/13/style/5-great-new-shoe-lines-one-common-denominator.html',
 '/2019/03/20/style/spring-trends.html',
 '/2019/03/27/style/how-to-wear-fall-2019-trends-now.html',
 '/2019/08/03/style/polyamory-nonmonogamy-relationships.html',
 '/2019/08/17/style/swimming-holes-california.html',
 '/2019/08/31/style/summer-streets.html',
 '/2019/10/19/style/the-nightlife-outlaws-of-east-los-angeles.html',
 '/2019/10/24/style/rebag-clair-handbag-stock-market.html',
 '/2019/10/29/style/29china-ban-black-clothing-hong-kong-protests.html',
 '/2019/11/04/style/zac-posen-barneys-brand-closed.html',
 '/2019/11/14/style/Queen-Elizabeth-II-fashion-royal-dresser.html',
 '/2019/11/14/style/jennifer-nettles-cms-equal-play.html',
 '/2019/11/21/style/tulsi-gabbard-democratic-debate-white-pantsuit.html',
 '/2019/11/23/style/usc-ucla-marching-band.html',
 '/2019/11/27/style/black-friday-has-no-meaning.html',
 '/2019/11/27/style/skin-care-beauty-what-the-marvelous-mrs-maisel-wears-offscreen.html',
 '/2019/12/02/style/melania-trump-white-house-christmas-decorations.html',
 '/2019/12/04/style/beauty-holiday-makeup-when-your-makeup-is-the-party.html',
 '/2019/12/05/style/nancy-pelosi-pantsuit-impeachment.html',
 '/2019/12/11/fashion/terry-de-havilland-dead.html',
 '/2019/12/12/style/bergdorf-goodman-darcy-penick.html',
 '/2019/12/12/style/do-i-need-a-wallet.html',
 '/2019/12/12/style/the-long-claims-behind-longer-eyelashes.html',
 '/2019/12/16/business/fashion-nova-underpaid-workers.html',
 '/2019/12/16/style/International-best-dressed-list-goodbye.html',
 '/2019/12/17/fashion/isabel-rueben-toledo-memorial-love-story.html',
 '/2019/12/17/style/best-leotards-for-dancers.html',
 '/2019/12/17/style/fast-fashion-gen-z.html',
 '/2019/12/18/fashion/hms-supply-chain-transparency.html'}

Looks mostly good - it will do for our purposes here!

When you're writing a scraper for whatever purpose, the filtering will be customized based on what you want to achieve.


Parsing the Articles

So now we have a collection of relevant links to extract articles from.

The process is very similar to when we compiled the links earlier.

We will loop through fashion_links, visit each article page and finally extract the article text.

for link in fashion_links:

Initialize an HTMLSession() instance.

session = HTMLSession()

Now we need to make sure we are requesting a valid URL. You can see in our collection of URLs in fashion_links that each link only has the URL path ending without the base domain.

So we need to fix that by concatenating the base NY Times domain, 'https://www.nytimes.com', to the URL path.

full_link = ''.join(['https://www.nytimes.com',link])

With our valid URL, we can now make the request.

r = session.get(full_link, headers=headers)

I'm using the same headers from before to identify myself in these requests as well.

Now it's time to find and extract the article text.

if r:
    article_texts = r.html.find('div.StoryBodyCompanionColumn')
    all_text = []
    for text in article_texts:
        section = text.text
        all_text.append(section)
    content = ' '.join(all_text)
    content = content.replace("\n"," ")

I've looked through the HTML of the article pages - you can do this with the developer tools in any browser - and found that each piece of the article text is within a div with class StoryBodyCompanionColumn, so I first extracted all of those divs.

For one of the articles, the output of article_texts looks like this:

[<Element 'div' class=('css-1fanzo5', 'StoryBodyCompanionColumn')>,
 <Element 'div' class=('css-1fanzo5', 'StoryBodyCompanionColumn')>,
 <Element 'div' class=('css-1fanzo5', 'StoryBodyCompanionColumn')>,
 <Element 'div' class=('css-1fanzo5', 'StoryBodyCompanionColumn')>,
 <Element 'div' class=('css-1fanzo5', 'StoryBodyCompanionColumn')>]

So I looped through those divs and extracted the text, which I appended to a list, all_text, and then concatenated the text chunks back together into one chunk.

The last line content.replace("\n"," ") removes newline characters from the text chunk.


Here is the loop all together to visit the article pages and extract the text.

for link in fashion_links:
    session = HTMLSession()
    full_link = ''.join(['https://www.nytimes.com',link])
    r = session.get(full_link, headers=headers)
    all_text = []
    if r:
        article_texts = r.html.find('div.StoryBodyCompanionColumn')
        for text in article_texts:
            section = text.text
            all_text.append(section)
        content = ' '.join(all_text)
        content = content.replace("\n"," ")

Note that I have not done anything to save the data that I extracted with this code.

Saving the data

After collecting the data, there are many ways you could save it.

Or the scraper might be part of a web application you have, so you might save it straight to a database.

The possibilities are endless!

Next Steps

The data has been scraped, and the next thing to do is to perform any cleaning, processing, or analysis to extract any insights you might find.


Processing the data

  • The next step for this project is to extract entities from the article text.
  • Entities in this case are names, brands, locations, and that sort of thing.
  • I will have a separate post to detail how I extracted the entities. If you are interested in Natural Language Processing, or NLP, I've extracted named entities from the articles.

Visualizing the data with D3.js

I created a network graph of the data using D3.js, which I will discuss in another post.


Thanks for visiting!

If you have any questions/comments/concerns feel free to leave a comment below or reach out to me on twitter @LVNGD.

blog comments powered by Disqus

Recent Posts

mortonzcurve.png
Computing Morton Codes with a WebGPU Compute Shader
May 29, 2024

Starting out with general purpose computing on the GPU, we are going to write a WebGPU compute shader to compute Morton Codes from an array of 3-D coordinates. This is the first step to detecting collisions between pairs of points.

Read More
webgpuCollide.png
WebGPU: Building a Particle Simulation with Collision Detection
May 13, 2024

In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.

Read More
abstract_tree.png
Solving the Lowest Common Ancestor Problem in Python
May 9, 2023

Finding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.

Read More
Get the latest posts as soon as they come out!