Skip to content

Latest commit

 

History

History
261 lines (215 loc) · 8.72 KB

03-IntroductionToCSS.md

File metadata and controls

261 lines (215 loc) · 8.72 KB

Introduction to CSS

CSS (Cascading Style Sheets) is used to control the presentation of web pages, including layout, colors, fonts, and other visual aspects. Understanding CSS is important for web scraping because it helps you identify and select the elements you need to scrape, especially when the structure is complex.

CSS Basics: Selectors, Properties, and Values

Selectors

Selectors are patterns used to select the elements you want to style. There are various types of selectors, each with its specific use case.

  1. Type Selector

    • Selects all elements of a given type.
    p {
        color: blue;
    }
    • This will turn all <p> elements blue.
  2. Class Selector

    • Selects all elements with a specific class attribute.
    .highlight {
        background-color: yellow;
    }
    • This will apply a yellow background to all elements with the class highlight.
  3. ID Selector

    • Selects the element with a specific id attribute. Note that IDs should be unique within a page.
    #header {
        font-size: 24px;
    }
    • This will set the font size of the element with the id header to 24px.
  4. Attribute Selector

    • Selects elements based on an attribute or attribute value.
    [type="text"] {
        border: 1px solid black;
    }
    • This will apply a border to all input elements of type text.
  5. Descendant Selector

    • Selects all elements that are descendants of a specified element.
    div p {
        margin: 10px;
    }
    • This will apply a margin to all <p> elements that are within a <div>.
  6. Child Selector

    • Selects all elements that are direct children of a specified element.
    ul > li {
        list-style-type: none;
    }
    • This will remove the bullet points from list items that are direct children of an unordered list.

Properties and Values

Properties are the aspects of the elements you want to change, and values are the settings you apply to these properties.

  1. Color and Background

    h1 {
        color: red;
        background-color: lightgray;
    }
    • This sets the text color of all <h1> elements to red and their background to light gray.
  2. Font

    p {
        font-family: Arial, sans-serif;
        font-size: 16px;
    }
    • This sets the font family of all <p> elements to Arial (with sans-serif as a fallback) and the font size to 16px.
  3. Margin and Padding

    div {
        margin: 20px;
        padding: 10px;
    }
    • This sets the margin around all <div> elements to 20px and the padding inside them to 10px.
  4. Border

    img {
        border: 2px solid black;
    }
    • This applies a 2px solid black border to all <img> elements.
  5. Display and Visibility

    .hidden {
        display: none;
    }
    .invisible {
        visibility: hidden;
    }
    • This hides elements with the class hidden completely (they don't take up space on the page), while elements with the class invisible are hidden but still take up space.

How CSS Affects Web Scraping

While CSS primarily affects the visual presentation of web pages, it can also impact web scraping in several ways:

Element Visibility

  • Elements styled with display: none or visibility: hidden may still be present in the HTML but are not visible to the user. When scraping, you might need to decide whether to include or ignore these elements.
  • Example:
    <div class="hidden">This content is hidden</div>
    .hidden {
        display: none;
    }
    • The <div> with class hidden will not be visible on the webpage, but it will still be present in the HTML source.

Dynamic Content

  • CSS can be used in combination with JavaScript to dynamically show or hide content. For example, a menu might be hidden by default and shown when a user clicks a button. Scraping such dynamic content may require using tools like Selenium to interact with the page.
  • Example:
    <button onclick="showMenu()">Show Menu</button>
    <div id="menu" style="display: none;">Menu content</div>
    #menu {
        display: none;
    }

Element Styling for Identification

  • Understanding CSS can help you write more precise selectors for scraping. For example, if you know an element has a unique class or id, you can use that in your scraping script to accurately target the data.

  • Example:

    <p class="price">$19.99</p>
    • You can use the class price to select this paragraph when scraping.

Example: Scraping Blog Post Data

Imagine you want to scrape information from a blog to gather the titles, authors, and publication dates of blog posts. Each blog post on the webpage has the following HTML structure:

<div class="blog-post" id="post-001">
    <h2 class="post-title">How to Learn Python</h2>
    <p class="post-author">By Jane Doe</p>
    <time class="post-date" datetime="2023-06-01">June 1, 2023</time>
</div>
<div class="blog-post" id="post-002">
    <h2 class="post-title">Web Scraping with BeautifulSoup</h2>
    <p class="post-author">By John Smith</p>
    <time class="post-date" datetime="2023-05-20">May 20, 2023</time>
</div>
<!-- More blog posts -->

You need to extract the following data for each blog post:

  • Title
  • Author
  • Publication Date

Here's a Python script using BeautifulSoup to scrape this data:

from bs4 import BeautifulSoup

# Sample HTML content (in a real scenario, you'd fetch this with the requests library)
html_content = '''
<div class="blog-post" id="post-001">
    <h2 class="post-title">How to Learn Python</h2>
    <p class="post-author">By Behnam</p>
    <time class="post-date" datetime="2024-06-01">June 1, 2023</time>
</div>
<div class="blog-post" id="post-002">
    <h2 class="post-title">Web Scraping with BeautifulSoup</h2>
    <p class="post-author">By Amir</p>
    <time class="post-date" datetime="2024-05-20">May 20, 2023</time>
</div>
<!-- More blog posts -->
'''

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all blog post elements
blog_posts = soup.find_all('div', class_='blog-post')

# Extract and print blog post details
for post in blog_posts:
    title = post.find('h2', class_='post-title').text
    author = post.find('p', class_='post-author').text.replace('By ', '')
    date = post.find('time', class_='post-date')['datetime']

    print(f'Title: {title}')
    print(f'Author: {author}')
    print(f'Date: {date}')

Explanation:

  1. Parse HTML Content: Using BeautifulSoup to parse the HTML content.

    soup = BeautifulSoup(html_content, 'html.parser')
  2. Find Blog Post Elements: Locate all <div> elements with the class blog-post.

    blog_posts = soup.find_all('div', class_='blog-post')
  3. Extract and Print Blog Post Details:

    • Title: Extract the text inside the <h2 class='post-title'> element.
      title = post.find('h2', class_='post-title').text
    • Author: Extract the text inside the <p class='post-author'> element and remove the "By " prefix.
      author = post.find('p', class_='post-author').text.replace('By ', '')
    • Date: Extract the datetime attribute from the <time class='post-date'> element.
      date = post.find('time', class_='post-date')['datetime']

The output of this script will be:

Title: How to Learn Python
Author: Behnam
Date: 2024-06-01

Title: Web Scraping with BeautifulSoup
Author: Amir
Date: 2024-05-20

Understanding the HTML and CSS

  • HTML Structure: The structure of each blog post is defined by the div elements with the class blog-post. Inside each div, there are h2 elements for titles, p elements for authors, and time elements for dates.
  • CSS Selectors: The script uses CSS selectors (class_='post-title', class_='post-author', and class_='post-date') to target specific elements within each blog post.

You have successfully scraped the title, author, and publication date for each blog post using BeautifulSoup. This practical example demonstrates how to navigate and extract data from a structured HTML document using CSS selectors.


By understanding CSS, you can better navigate the structure of web pages and write more effective web scraping scripts. You can accurately select the elements you need, handle dynamic content, and interpret the presence or absence of elements based on their styles.