Summary and Schedule
This is a new lesson built with The Carpentries Workbench.
| Setup Instructions | Download files required for the lesson | |
| Duration: 00h 00m | 1. Hello-Scraping |
What’s behind a website, and how can I extract information from
it? What ethical and legal considerations should I keep in mind before scraping a website? |
| Duration: 00h 50m | 2. Scraping a real website |
How can I get the data and information from a real website? How can I start automating my web scraping tasks? |
| Duration: 02h 05m | 3. Dynamic websites |
What are the differences between static and dynamic websites? Why is it important to understand these differences when doing web scraping? How can I start my own web scraping project? |
| Duration: 02h 40m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
In this workshop, you’ll learn how to extract data from websites using Python — a process known as web scraping.
Episode 1 begins with an introduction to how websites are structured
using HTML. You’ll learn how to explore this structure using your
browser and how to extract information from it using the
BeautifulSoup package.
In Episode 2, you’ll learn how to retrieve the HTML of a webpage
using the requests package and continue practicing how to
parse and extract specific content with BeautifulSoup.
Toward the end of the workshop, in Episode 3, we’ll explore the
difference between static and dynamic webpages, and how to scrape
dynamic content using Selenium.
This workshop is intended for learners who already have a basic understanding of Python. In particular, you should be comfortable with:
- Install and import packages and modules
- Use lists and dictionaries
- Use conditional statements (
if,else,elif) - Use
forloops - Calling functions, understanding parameters/arguments and return values
Software Setup
To run the code in this workshop, you will need to install:
-
The following Python libraries:
requests, beautifulsoup4, selenium, webdriver-manager, pandas, tqdm, jupyterlab. -
Google Chrome: Please install the latest version of
the Google Chrome web browser, as we’ll use its web developer tools. If
you already have it, please check for updates by visiting
chrome://settings/helpin Chrome.
If you already have a preferred workflow for managing Python
environments (e.g., Conda or venv), you may proceed as you normally do.
However, if you are new to this or want a hassle-free setup, we highly
recommend using pixi instructions below.
Setting up your environment with pixi
As described in their website, pixi is a cross-platform,
multi-language (including Python and R) package manager and workflow
tool built on the foundation of the conda ecosystem. In short, it is a
tool that simplifies installing software and managing libraries
(packages).
Steps to configure your workshop environment::
-
Install
pixi:Follow the instructions for your operating system here https://pixi.prefix.dev/latest/installation/.- Note: Once the installation finishes, restart your Terminal (close
it and open it again) to make sure the
pixicommand is recognized.
- Note: Once the installation finishes, restart your Terminal (close
it and open it again) to make sure the
Navigate to your folder: In your Terminal, use the
cdcommand to move to the folder where you want to keep your workshop files (e.g.,cd Desktoporcd Documents).Initialize the project: Run the following command to create a new folder named
webscrapingwith the necessary configuration files
- Enter the folder: Move into the newly created project folder
- Install libraries: Run this command to install Python and all the required tools (this may take a minute)
- Start JupyterLab: Launch the notebook interface by running
- Verify your setup: Inside JupyterLab, create a new Notebook (File > New > Notebook), copy the code below into a cell, and run it by pressing Shift+Enter
You are now ready for the workshop! Learn more about pixi by reading their documentation.
Additional resources
- Mitchell, R. (Ryan E. ). (2024). Web scraping with Python : data extraction from the modern web (3rd edition.). O’Reilly Media, Inc.
- Chapagain, A. (2023). Hands-On Web Scraping with Python : Extract Quality Data from the Web Using Effective Python Techniques (Second edition.). Packt Publishing.