Hello-Scraping


  • Every website is built on an HTML document that structures its content.
  • An HTML document is composed of elements, usually defined by an opening <tag> and a closing </tag>.
  • Elements can have attributes that define their properties, written as <tag attribute_name="value">.
  • We can parse an HTML document using BeautifulSoup() and search for elements with the .find() and .find_all() methods.
    • We can extract the text inside an element with .get_text() and access attribute values using .get("attribute_name").
  • Always review and respect a website’s Terms of Service (TOS) before scraping its content.

Scraping a real website


  • Use the requests package with requests.get('website_url').text to retrieve the HTML content of any website.
  • In your web browser, you can explore the HTML structure and identify elements of interest using the “View Page Source” and “Inspect” tools.
  • An HTML document is a nested tree of elements; navigate it by accessing an element’s children (.contents), parent (.parent), and siblings (.next_sibling, .previous_sibling)
  • To avoid overwhelming a website’s server, add delays between requests using the sleep() function from Python’s built-in time module.

Dynamic websites


  • Dynamic websites load content using JavaScript, so the data may not be present in the initial HTML. It’s important to distinguish between static and dynamic content when planning your scraping approach.
  • The Selenium package and its webdriver module simulates a real browser, allowing you to execute JavaScript and interact with the page as a user would —clicking, scrolling, or filling out forms
  • Key Selenium commands:
    • webdriver.Chrome(): Launch the Chrome browser simulator
    • .get("website_url"): Visit a given website
    • .find_element(by, value) and .find_elements(by, value): Locate one or multiple elements
    • .click(): Click a selected element
    • .page_source: Retrieve the full HTML after JavaScript execution
    • .quit(): Close the browser
  • The browser’s “Inspect” tool allows users to view the HTML document after dynamic content has loaded. This is useful for identifying which elements contain the data you want to scrape.
  • A typical web scraping pipeline includes: 1) Understanding the website structure; 2) Determining whether content is static or dynamic; 3) Choosing the right tools (requests + BeautifulSoup or Selenium); 4) Extracting and cleaning the data; 5) Storing the data in a structured format.