Summary and Schedule
This is a new lesson built with The Carpentries Workbench.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Hello-Scraping |
What’s behind a website, and how can I extract information from
it? What ethical and legal considerations should I keep in mind before scraping a website? |
Duration: 00h 50m | 2. Scraping a real website |
How can I get the data and information from a real website? How can I start automating my web scraping tasks? |
Duration: 02h 05m | 3. Dynamic websites |
What are the differences between static and dynamic websites? Why is it important to understand these differences when doing web scraping? How can I start my own web scraping project? |
Duration: 02h 40m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
In this workshop, you’ll learn how to extract data from websites using Python — a process known as web scraping.
Episode 1 begins with an introduction to how websites are structured
using HTML. You’ll learn how to explore this structure using your
browser and how to extract information from it using the
BeautifulSoup
package.
In Episode 2, you’ll learn how to retrieve the HTML of a webpage
using the requests
package and continue practicing how to
parse and extract specific content with BeautifulSoup
.
Toward the end of the workshop, in Episode 3, we’ll explore the
difference between static and dynamic webpages, and how to scrape
dynamic content using Selenium
.
This workshop is intended for learners who already have a basic understanding of Python. In particular, you should be comfortable with:
- Install and import packages and modules
- Use lists and dictionaries
- Use conditional statements (
if
,else
,elif
) - Use
for
loops - Calling functions, understanding parameters/arguments and return values
Software Setup
Steps:
- If you already have Anaconda, Jupyter Lab or Jupyter Notebooks installed in your computer, skip to step 2. Follow Miniforge’s download and installation instructions for your respective operating system. If you are using a Windows machine, make sure you mark the option to “Add Miniforge3 to my PATH environment variable”.
- If you are using Mac or Linux, open the ‘Terminal’. If you are using Windows, open the ‘Command Prompt’ or ‘Miniforge Prompt’.
- Activate the base conda environment by typing and running the code below to activate your environment.
conda activate
- Install the necessary packages by running:
pip install requests beautifulsoup4 selenium webdriver-manager pandas tqdm jupyterlab
- Start Jupyter Lab by running:
jupyter lab
- In a new Jupyter Notebook run the following code in a cell to check the necessary libraries can be loaded:
Additional resources
- Mitchell, R. (Ryan E. ). (2024). Web scraping with Python : data extraction from the modern web (3rd edition.). O’Reilly Media, Inc.
- Chapagain, A. (2023). Hands-On Web Scraping with Python : Extract Quality Data from the Web Using Effective Python Techniques (Second edition.). Packt Publishing.