Summary and Schedule

This is a new lesson built with The Carpentries Workbench.

Download files required for the lesson

00h 00m

What’s behind a website, and how can I extract information from it?
What ethical and legal considerations should I keep in mind before scraping a website?

00h 50m

2. Scraping a real website

How can I get the data and information from a real website?
How can I start automating my web scraping tasks?

02h 05m

3. Dynamic websites

What are the differences between static and dynamic websites?
Why is it important to understand these differences when doing web scraping?
How can I start my own web scraping project?

02h 40m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

In this workshop, you’ll learn how to extract data from websites using Python — a process known as web scraping.

Episode 1 begins with an introduction to how websites are structured using HTML. You’ll learn how to explore this structure using your browser and how to extract information from it using the BeautifulSoup package.

In Episode 2, you’ll learn how to retrieve the HTML of a webpage using the requests package and continue practicing how to parse and extract specific content with BeautifulSoup.

Toward the end of the workshop, in Episode 3, we’ll explore the difference between static and dynamic webpages, and how to scrape dynamic content using Selenium.

This workshop is intended for learners who already have a basic understanding of Python. In particular, you should be comfortable with:

Install and import packages and modules
Use lists and dictionaries
Use conditional statements (if, else, elif)
Use for loops
Calling functions, understanding parameters/arguments and return values

Software Setup

Steps:

If you already have Anaconda, Jupyter Lab or Jupyter Notebooks installed in your computer, skip to step 2. Follow Miniforge’s download and installation instructions for your respective operating system. If you are using a Windows machine, make sure you mark the option to “Add Miniforge3 to my PATH environment variable”.
If you are using Mac or Linux, open the ‘Terminal’. If you are using Windows, open the ‘Command Prompt’ or ‘Miniforge Prompt’.
Activate the base conda environment by typing and running the code below to activate your environment.

conda activate

Install the necessary packages by running:

pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab

Start Jupyter Lab by running:

jupyter lab

In a new Jupyter Notebook run the following code in a cell to check the necessary libraries can be loaded:

PYTHON

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

Additional resources

Mitchell, R. (Ryan E. ). (2024). Web scraping with Python : data extraction from the modern web (3rd edition.). O’Reilly Media, Inc.
Chapagain, A. (2023). Hands-On Web Scraping with Python : Extract Quality Data from the Web Using Effective Python Techniques (Second edition.). Packt Publishing.