AI Insights

How to Build a Web Scraping Pipeline in Python Using BeautifulSoup

March 28, 2022


article featured image

Use Case: Extracting Information about Products from an Online Store

In this tutorial, you will learn how to:

  • Create a web scraping pipeline in Python
  • Navigate and parse HTML code
  • Use Beautiful Soup and Requests to fetch and extract data from websites
  • Go through multiple pages and avoid crash by handling exception
  • Clean and store extracted data in a meaningful way
  • Build a mindset to mentally prepare for web scraping

Web Scraping is the process of automating data extraction from websites. You can build a web scraper to take something out of a web page, such as gathering reviews of books from a third-party platform, downloading all the lyrics of your favorite songs, or just for fun as a surfer.

Several popular tools are available for web scraping, like Beautiful Soup, Scrapy, Selenium, and so on. Beautiful Soup and Scrapy are both excellent starting points. While Selenium is powerful in web automation, such as clicking a button or selecting elements from a menu, etc., it’s a little bit tricky to use. 

This tutorial focuses on Beautiful Soup and will build a web scraper step-by-step to extract information about books listed on Book to Scrape website. Book to Scrape is a demo website for web scraping purposes, with a typical and well-presented structure of retail websites. In this tutorial, we assume that you’re new to web scraping, so using a static and durable website will be a good choice for your learning and practicing. Enjoy the journey!

We will go through three phases involving six steps: 

  • Phase 1 – Setup: i.e., identifying your scraping goal, exploring and inspecting the website, installing or importing necessary packages. 
  • Phase 2 – Acquisition: i.e., accessing the website and parsing its HTML. 
  • Phase 3 – Extraction and processing: that is, extracting, cleaning, and storing data of interest, saving the final result. 

Let’s start!

STEP 1: Identify Your Goal and Explore the Website of Interest

Yes, the initial step is not to open a Jupyter Notebook or your favorite IDE. You will start web scraping by clarifying your scraping goal and how the target website presents the information you want. In this tutorial, our goal is to extract information about products listed on a book store website, which may include category, book title, price, rating, availability, etc.

Book Store Website

By visiting the home page, you will find that your web scraper may be able to do a lot of things, such as: 

  • Filtering books by category
  • Getting the information you’re interested in, like book title and price
  • Going to next page to load more books
  • Going to single product page to get more detailed information about a book

STEP 2: Inspect Web Page’s HTML

HTML (Hypertext Markup Language) is the standard markup language for Web pages. With HTML, you can create your website or scrape existing websites.

It is fine to follow the tutorial without HTML background knowledge, as we will introduce the basic structure of HTML to make your work easier. You also can refer to Wikipedia or other resources for more information about HTML.

After exploring the website, it’s time to switch your identity from a customer to a scraper and get yourself familiar with the HTML structure of the target website. By right-clicking on the item of interest and selecting ‘inspect’, you are opening the Developer Tools. You can see the HTML text and how it can be expanded,  collapsed, and even edited. 

 Inspect Web Page’s HTML

Inspect Web Page’s HTML

Don’t be intimidated by the whole HTML page if it’s your first time to inspect a website. All you need is to be patient, and you will find what you want! In this example, all the books are listed under <ol class="row">. For each book, the structure of HTML is the same: under <li class = "col-xs-6 col-sm-4 col-md-3 col-lg-3"> and then under <article class="product_pod">, there are several sub-sections containing book image, rating, title, price, and so on. Here,  <ol>, <li>, <article> are tags of HTML; they represent the ordered list, article, and list item respectively. Note that all the tags come in pairs. 

Let’s take a closer look at the <h3> tag.

...
<ol class="row">
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">...</div>
<p class="star-rating Three">...</p>
<h3>
<a href="../../../the-secret-garden_413/index.html" title="The Secret Garden">The Secret Garden</a>
</h3>
<div class="product_price">...</div>
<article>
<li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">...<li>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">...<li>
...

<h1> to <h6> defines HTML headings, the <a> tag under <h3> defines a hyperlink, i.e., the URL of the product page. Also, we can find book title within <a>

By right-clicking on the price and inspecting it, you will find the price information is located as follows:

...
<div class="product_price">
<p class="price_color">£15.08</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
...

In addition to price, the in-stock status is also available here.

Right-clicking what you’re interested in then inspecting its HTML text is the must-have skill to fetch information from a website, and you will be familiar with the HTML structure of a website through using this skill again and again. The more you explore and inspect the website, the more familiar you are with its HTML, and therefore the better your web scraper will work.

STEP 3: Install and Import Libraries

It’s coding time!

Your web scraping goal and the target website determine what libraries will be installed or imported. Generally, you need to get the following tools to be ready:

  • Requests: allows you to send access requests to website easily
  • Beautiful Soup: pulls data out of HTML files
  • Pandas: handles extracted data (The alternatives include csv, json, or other libraries you prefer) 

Optional:

  • Re: a Python’s standard library, can be used to check if a string contains the specified search pattern. 
  • Datetime:  a Python’s standard library, supplies classes for manipulating dates and time

Install beautifulsoup4, requests, and pandas packages first, if you haven’t done so (Code Snippet 1). Then import these modules(Code Snippet 2). re and datetime are both Python built-in libraries, which can be imported directly as needed. 

pip install beautifulsoup4
pip install requests
pip install pandas

Code Snippet 1

from bs4 import BeautifulSoup as soup
import requests 
import pandas as pd 
import re
import datetime

Code Snippet 2

The next two steps will help you construct the main body of your script, that is, retrieving the website and parsing its HTML, as well as extracting and storing data of interest.

STEP 4: Retrieve Website and Parse HTML

This step is extremely important but pretty easy to achieve; only a few lines of code are needed (Code Snippet 3):

# Identify the target website's address, i.e., URL
books_url = 'https://books.toscrape.com/index.html'
# Create a response object to get the web page's HTML content
get_url = requests.get(books_url)
# Create a beautiful soup object to parse HTML text with the help of the html.parser
books_soup = soup(get_url.text, 'html.parser')
# Check website's response
print(get_url)

Code Snippet 3

If you get 200 by running print(get_url), your target website has replied to you saying that it’s ok to connect. Getting a 403 means your request is a legal request, but the server is refusing to respond to it. 404 means the requested page could not be found but may be available again in the future. 

# Get some intuition by printing out HTML
# This step is not required to build a web scraper
print(get_url.text)
print(books_soup)
# Use prettify() method to make the HTML be nicely formatted
print(books_soup.prettify())

Code Snippet 4

If you print out the HTML text of the website (Code Snippet 4), you will notice that the output of print(get_url.text) looks like the HTML text you inspected in Step 2. Beautiful Soup makes the HTML content easy to access, and the prettify() method can be used to present the output nicely.

Now, you have successfully retrieved the target website and created a beautiful soup object to parse HTML of the website.

STEP 5: Extract, Clean, and Store Data

Step 5 is a bit complicated. It involves several sub-steps within.

STEP 5.1: Get Ready By Answering Questions (Mental Preparation)

Before extracting data, you have to clarify some problems, such as:

  • What data will you scrape?
  • How do you store and save the extracted data?
  • What is your workflow?
  • What are some problems you may encounter?

It is extremely important to take time to think about such questions before scraping. But where to find the answers to these questions? Here are some hints:

  • Answer the first question based on your purpose and the content of your target website
  • Answer the second question based on the types of your extracted data (Number? Text? Image?) or your preference (Store data in a list? A dictionary? A list of dictionaries? Save data as a CSV file? JSON? Excel?)
  • Answer the third question based on the structure of your target website
  • Answer the fourth question as you try to answer the above three questions

Let’s go through the questions one by one. 

Q1: What data will you scrape?

A1: Suppose we will use the data extracted from the website for some particular purposes, including analyzing price volatility, tracking in-stock status, and comparing books of different categories. Based on these purposes, we consider scraping the following information: 

  • scraping date: Information changes over time, like price, rating, in-stock status, etc.. You may want to do a time series analysis in the future, but firstly, you need to record the date of scraping. How to leave a timestamp? (Notes: Book to Scrape is a demo website; prices and ratings were randomly assigned and remain unchanged. But that is not the case for real-world websites.)
  • book id: The unique id to identify a book (Generally, books need an ISBN and non-book products need a UPC as an identifier, but in this case, only UPC is available. ¹) It also can be a unique product id created by the website itself.

You can’t find a book’s UPC on the products list page. It’s on the single product page, and within a table. How to get information from a table?

How to get information from a table

¹ “ISBN and UPC product code information – ECPA.” https://www.ecpa.org/page/Codes/ISBN-and-UPC-product-code-information.htm?page=businesssolutions. Accessed 15 Feb. 2022.

  • book title: No worries? Yes worries!

Book Title

Not all the books show the complete book title. If you insepet the book, you will find the text part of <a> is incomplete but the value of <title> looks good. How to fetch what you want exactly?

<a href="../../../and-then-there-were-none_119/index.html" title="And Then There Were None">And Then There Were ...</a>

  • category: When you fetch categories from the website, you may find ‘Books’ is also listed in the extracted data. You definitely don’t want ‘Book’ as a category, but how to drop what you don’t want exactly?
  • price: There are different prices listed in the product information table above. Which price needs to be fetched? It’s up to you. In this tutorial, we will not extract price from the table but use the single price located under the book title. Can we keep the numeric part only by removing the currency symbols? In this case, the currency symbols might be £.
  • in stock: In-stock status may be either in stock or out of stock. You also can use Y and N to represent the status through some data cleaning work. Is it necessary to take data cleaning into account when building a web scraper?
  • availability: As you can see, in the product information table, availability is combined with in-stock status, like ‘In stock (5 available)’. But what if it is ‘Out of stock (0 available)’? How to extract content having the same pattern but with different contexts?
  • rating: The rating on Book to Scrape websiteIt is odd. Generally, data is not embedded in the class name. For example:

On Amazon, the rating is 4.8 out of 5 stars instead of a-icon-alt

<span class="a-icon-alt">4.8 out of 5 stars</span></i>

On Walmart, the rating is (4.6) instead of f7 rating-number

<span class="f7 rating-number">(4.6)</span>

But on Book to Scrape: The rating is Two

<p class="star-rating Two">

So Is there a silver bullet to handle all forms of websites?

  • link: The unique web address, i.e., URL. 

Note: URL is the string assigned to href in the <a> element, but the problem is it looks abnormal. For example: 

<h3>
        <a href="../../../the-secret-garden_413/index.html" title="The Secret Garden">The Secret Garden</a>
</h3>

The unique web address

You can go to the specific book page by clicking the link directly when you do inspection, but if you copy (right-click the HTML text then select Copy element) and paste the string into your notebook, you will find it does not work. That is because the link put in the <a> element is a specific site location, which is the path to a book’s unique resource. The complete URL is as follows: 

How do you make your scraped URL (the incomplete one) work well?

After deciding what information to scrape, and of course, raising more questions, let’s go to the next one.

Q2: How to store and save the extracted data?

A2: Choose a meaningful way to store data. As you can imagine, in this case, the extracted data will be numbers or text, unless you want to get the image of a book. In the following work, we will build a list of dictionaries to store all the books. Why? Because it’s easy to append a dictionary, which contains the information of a book, to a list, just like how you put a real book into a shopping cart. And after extracting all the data, you can simply convert the list of dictionaries to a dataframe and save it as a CSV file. That is one of the meaningful ways we stated before. It is based on the types of your extracted data, your purpose, and sometimes, your preference.

Q3: What is your workflow?

A3: You are visiting the website’s home page, exploring it, and intending to get the information of all the books by category; you are not satisfied with the information listed on the products list page and want to get more. So we guess that your workflow would be: 

Find all categories (from the first to the last), go to the products list page under a category, scrape all the books listed on the current page, including the detailed information of a book on the single product page, go to scrape the next page if the current page is not the last one. Done!

We can build several functions to go through the workflow.

def find_categories():
# Find all categories

def fetch_books_by_category():
# Fetch all the books under a category page by page

def fetch_current_page_books():
# Fetch all the books listed on the current page
# Build a dictionary to store extracted data 
# Append the dictionary to a list
# Go to next page if current page is not the last one

def fetch_more_info():
# Get detailed info about a book

def fetch_all_books():
# Fetch all the books of all the categories
# Return the list of dictionaries that contains all the extracted data
return books_all

Code Snippet 5

Q4: What problems you may encounter?

A4: The problems highlighted in bold above and more! You don’t need to (and it’s also impossible) resolve all the problems before you start to extract data. But it is definitely a good practice to consider these problems. You will explore the website, inspect and search HTML elements, get the answers (and there might be more questions), write and test your codes, back and forth.

If this is your first web scraper, we advise you to begin with a small chunk, such as fetching categories only, to get some experience, which is also a good way to see if your script works well.

STEP 5.2 Start by Fetching a Single Variable

Start by Fetching a Single Variable

Let’s take a look at the HTML elements of categories.

...
<ul class="nav nav-list">
<li>
<a href="catalogue/category/books_1/index.html">Books</a>
<ul>
<li>
<a href="catalogue/category/books/travel_2/index.html">Travel</a>
</li>
</li>...</li>
</li>...</li>
</li>...</li>
...

Books is under <ul class="nav nav-list"> and then under <li>. If we consider <ul class="nav nav-list"> as a grandparent element and <li> as a parent element, <ul> (without class name) under <li> can be treated as a child element, which has a lot of grandchildren elements, also named as <li>, and each of them represents a category. 

Our purpose is to fetch categories, excluding Books. So the most important thing here is to find the right element in the books_soup. The books_soup is a beautiful soup object, created in Code Snippet 3. A beautiful soup object contains all the HTML elements and can be accessed by find() or find_all() method.

# Find all the categories listed on the web page
# This step is used for testing and practicing, which can be skipped for the final scraper
categories = books_soup.find('ul', {'class': 'nav nav-list'}).find('li').find('ul').find_all('li')

Code Snippet 6

find() method finds out the specific element you want in the soup object, and find_all() method captures all matching elements on the page. Here, we first find the correct <ul> element with a specific class name 'nav nav-list'(enclosed in a dictionary), and then find <li> and  <ul> successively, under which we will find out all the <li>

Print out categories and len(categories) to see what you have found. If we change the code to books_soup.find('ul', {'class': 'nav nav-list'}).find_all('li'), what will happen? Try it!

For a single category, the text within the pair of <a> tags is the category name, which can be acquired by text attribute. The value of 'href' is the category’s URL, which can be extracted by the get() method. We will loop through categories to fetch the name and URL of each category.

# Loop through categories
for category in categories: 
# Get category name by extracting the text part of <a> element
# Strip the spaces before and after the name
category_name = category.find('a').text.strip()
# Get the URL, which leads to the products list page under the category
category_url_relative = category.find('a').get('href')
# Complete category's URL by adding the base URL
category_url = base_url_for_category + category_url_relative
print(f"{category_name}'s URL is: {category_url}")

Code Snippet 7

Did you notice the base_url_for_category? What’s it for? The URL you find and get is "catalogue/category/books/travel_2/index.html", which is a relative URL and can not lead to any effective web page after extracting it. The absolute URL is "https://books.toscrape.com/catalogue/category/books/travel_2/index.html". So here, we complete the link by assigning "https://books.toscrape.com" to base_url_for_category and add the variable before category_url_relative

The partial result of running the for loop is:

Travel's URL is: https://books.toscrape.com/catalogue/category/books/travel_2/index.html
Mystery's URL is: https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
Historical Fiction's URL is: https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html

Up to now, you have successfully fetched all the categories with name and the URL to the products list page. We are going to fetch all the items.

STEP 5.3 Fetch All the Items through Searching for HTML, Extracting information, Cleaning and Storing Data

Be prepared by getting some variables ready.

# Identify base URL
base_url_for_category = 'https://books.toscrape.com/'
base_url_for_book = 'https://books.toscrape.com/catalogue'
# Get the date of scraping
scraping_date = datetime.date.today()
# Create a dictionary to convert words to digits
# We will use it when fetching rating
w_to_d = {'One': 1,
'Two': 2,
'Three': 3,
'Four': 4,
'Five': 5
}
# Create a list to store all the extracted items
books_all = []

Code Snippet 8

Can you recall the workflow and the functions we mentioned above (Code Snippet 5)? Please look back if you can’t.

Let’s start by building the last function, i.e.,  fetch_all_books() to integrate the workflow. To run this function, we need to input the books_soup created before, which contains all the HTML information in text.

def fetch_all_books(soup):
    # Fetch all the books information
    # Return books_all, a list of dictionary that contains all the extracted data
    
    # Find all the categories by running find_categories() function
    categories = find_categories(soup)
    # Loop through categories
    for category in categories:
        # Fetch product by category
        # Within the fetch_books_by_category function, we will scrape products page by page        
        category_name = category.find('a').text.strip()
        fetch_books_by_category(category_name, category)
        
    return books_all    

Code Snippet 9

It’s easy to find categories as we have done before.

def find_categories(soup):
# Find all the categories

categories = books_soup.find('ul', {'class': 'nav nav-list'}).find('li').find('ul').find_all('li')

return categories

Code Snippet 10

Next, we will fetch books under a single category page by page. Sometimes, it’s a bit tricky to get all the books under a category since there may be one or more pages. We can scrape the next page only if it exists! 

But how can we figure out if the current page is the last page or not? Inspect the next button!

Inspect Web

We take the Fiction category as an example. There are a total of 4 pages.  When you inspect the next button on page 1 through page 3, you can find the next page’s URL, which is under <li class="next">.

<li class="next">
<a href="page-4.html" style="">next</a>
<li>

But for the last page, i.e., page 4, there is no next button. So we can let our web scraper try to find <li class="next"> and <a>. If successful, go to the next page and then fetch products; if failed, break the fetching work and then switch to the next category.

One thing we want to highlight here is that the following three lines of code will be written for every time when you intend to retrieve a web page. It might be the website’s home page, the first/last/any products list page under a category, or a single product page. You will identify the URL of your target web page, create a response object to get the page’s HTML content, and create a Beautiful Soup object to parse the HTML text, repeatedly.

# Identify the target website's address
web_page_url = 'https...'
# Create a response object to get the web page's HTML content
get_url = requests.get(web_page_url)
# Create a beautiful soup object to parse HTML text with the help of the html.parser
soup = BeautifulSoup(get_url.text, 'html.parser')

Now, we can create a function (Code Snippet 11) to fetch all the books under a category, page by page. Notes: This function runs within the for loop under the fetch_all_books() function (Code Snippet 9).

def fetch_books_by_category(category_name, category):
# Fetch books by category
# Scrape all the books listed on one page
# Go to next page if current page is not the last page
# Break the loop at the last page

# Get category URL, i.e., the link to the first page of books under the category
books_page_url = base_url_for_category + category.find('a').get('href')
# Scape books page by page only when the next page is available
while True:
# Retrieve the products list page's HTML
get_current_page = requests.get(books_page_url)
# Create a beautiful soup object for the current page
current_page_soup = soup(get_current_page.text, 'html.parser')
# Run fetch_current_page_books function to get all the products listed on the current page
fetch_current_page_books(category_name, current_page_soup)
# Search for the next page's URL
# Get the next page's URL if the current page is not the last page
try:
find_next_page_url = current_page_soup.find('li', {'class':'next'}).find('a').get('href') 
# Find the index of the last '/'
index = books_page_url.rfind('/')
# Skip the string after the last '/' and add the next page url
books_page_url = books_page_url[:index+1].strip() + find_next_page_url 
except:
break

Code Snippet 11

When the web scraper reaches the last page, it’s impossible to find <li class="next">. As a result, current_page_soup.find('li', {'class':'next'}).find('a') will raise an AttributeError, that is , ‘NoneType’ object has no attribute ‘find’. 

We definitely don’t want the web scraper crash just because of an avoidable error. So here, Exception Handling is very important. We can skip the error raised from running the try block, and do the things defined under except. In the case above, once an unexcited element raises an error, the web scraper will break the while loop. You can actually do anything in the except block as you need. 

Besides, within the try block, we acquire the next page’s URL by modifying the current page’s URL. You can get some hints by comparing them. For example: 

The URL of a landing page, i.e., the first page under a category,  is

https://books.toscrape.com/catalogue/category/books/fiction_10/index.html

The second page’s absolute URL is

https://books.toscrape.com/catalogue/category/books/fiction_10/page-2.html

The second page’s relative URL, i.e., the incomplete one, is

page-2.html

What we do above is to change the last part of the URL. 

The biggest section of the scraper is to build a function to fetch one-page books (Code Snippet 12). We can get the most information from a book through it. In this function, every single piece of code has been discussed before.

def fetch_current_page_books(category_name, current_page_soup):
# Fetch all the books listed on the current page
# Build a dictionary to store extracted data 
# Append book information to the books_all list

# Find all products listed on the current page
# Here, we don’t need to identify the class name of <li> (Do you know why?)
current_page_books = current_page_soup.find('ol', {'class':'row'}).find_all('li')

# Loop through the products 
for book in current_page_books: 
# Extract book info of interest

# Get book title
# Replace get('title') with text to see what will happen
title = book.find('h3').find('a').get('title').strip()

# Get book price
price = book.find('p', {'class':'price_color'}).text.strip()

# Get in stock info
instock = book.find('p', {'class': 'instock availability'}).text.strip()

# Get rating
# We will get a list, ['star-rating', 'Two'], by using get('class') only, so here, we slice the list to extract rating only
rating_in_words = book.find('p').get('class')[1]
rating = w_to_d[rating_in_words]

# Get link 
link = book.find('h3').find('a').get('href').strip()
link = base_url_for_book + link.replace('../../..', '')

# Get more info about a book by running fetch_more_info function
product_info = fetch_more_info(link)

# Create a book dictionary to store the book’s info
book = {
'scraping_date': scraping_date, 
'book_title': title, 
'category': category_name, 
'price': price,
'rating': rating,
'instock': instock,
# Suppose we’re only interested in availability and UPC only
'availability': product_info['Availability'],
'UPC': product_info['UPC'],
'link':link 
}
# Append book dictionary to books_all list
books_all.append(book)

Code Snippet 12

We do some data cleaning work in this function. For example,  we use strip() method to remove spaces at the beginning and at the end of a string; the w_to_d dictionary created before can help us convert rating from words, i.e., One, Two, Three, Four and Five, to digits, i.e., 1, 2, 3, 4, and 5. These simple data cleaning work will make your scraped data ready to use. 

You may be wondering why we did not process the price by removing the currency symbol £? Actually, we can do it, and it’s better to do so. But before that, we need to confirm all the prices are in GBP.  If that is not the case, the extracted digit-only price will be misleading.

Last but not least, the fetch_more_info() function. The main purpose of this function is to fetch the product information table on the single product page. It happens a lot for a web scraper to go in depth to fetch more information as needed.

def fetch_more_info(link):
# Go to the single product page to get more info 

# Get url of the web page
get_url = requests.get(link)
# Create a beautiful soup object for the book
book_soup = soup(get_url.text, 'html.parser')

# Find the product information table
book_table = book_soup.find('table',{'class':'table table-striped'}).find_all('tr')
# Build a dictionary to store the information in the table
product_info = {}
# Loop through the table 
for info in book_table:

# Use header cell as key
key = info.find('th').text.strip()
# Use cell as value
value = info.find('td').text.strip() 
product_info[key] = value

# Extract number from availability using Regular Expressions
text = product_info['Availability']
# reassign the number to availability
product_info['Availability'] = re.findall(r'(d+)', text)[0]

return product_info

We use Regular Expressions to extract the number of available books.  Regular Expressions can be used to find a string with the particular pattern from text. Here, r'(d+)' represents one or more digits. We can find 12 from ‘In stock (12 available)’, and we also can find 0 from ‘Out of stock (0 available)’. 

We have got all we want by running the five functions. And now, we can save the result with joy.

STEP 6: Save File

At the end of scraping, all the extracted data is stored in a list of dictionaries, which can be saved as a csv, json, or excel file as you prefer (Code Snippet 13). It’s better to remove the duplicates at the same time.

def output(books_list):
# Convert the list with scraped data to a data frame, drop the duplicates, and save the output as a csv file

# Convert the list to a data frame, drop the duplicates
books_df = pd.DataFrame(books_list).drop_duplicates()
print(f'There are totally {len(books_df)} books.')
# Save the output as a csv file
books_df.to_csv(f'books_scraper_{scraping_date}.csv', index = False)

Code Snippet 13

Take a look at the final result:

Result

Summary:

In this tutorial, we built a web scraper in Python using Beautiful Soup and requests. The steps are as follows:

  • Step 1: Identify your goal and explore the website of interest
  • Step 2: Inspect web page’s HTML
  • Step 3: Install and import libraries
  • Step 4: Retrieve website and parse HTML
  • Step 5: Extract, clean, and store data
    • Step 5.1: Get ready by answering questions
    • Step 5.2: Start by fetching a single variable
    • Step 5.3: Fetch all the items through searching for HTML, extracting information, cleaning and storing data
  • Step 6: Save File

In addition to the pipeline and fundamental skills, we also emphasize the way of thinking, that is:

  • Ask yourself a bunch of questions before scraping

Some tips for web scraping:

  • Real-world websites change constantly. After practicing your scraping skill on a durable demo website, it’s time to go to real-world websites. Please keep in mind, it’s normal that you find your web scraper worked well yesterday but crashed today because of the website’s changes. Fortunately, most of the changes are small and you just need to modify your scraper a little.
  • Each website is unique. If you want to scrape products from different retail websites, you have to build a customized web scraper for each retailer. 
  • Modern websites are usually dynamic websites, which are customized to the clients’ browsers. When interactivate with such websites, web scrapers also need to be smart enough. For example, a Selenium-equipped web scraper can automatically accept/decline cookies, maximize or scroll down the screen to display all the content, and so on. 

Learn by doing and good luck! 

This article is written by Yuan Yin and Mulugheta T. SOLOMON.

Want to work with us too?

media card
A Simple Guide to Optimizing Memory Usage and Computation Time in Big Data
media card
Revolutionizing Short-term Traffic Congestion Prediction with Machine Learning
media card
Using Advanced Data Mining Techniques for Educational Leadership
media card
A Beginner’s Guide to Exploratory Data Analysis with Python