IT-INFO | Crawlee for Python: Quick Start Guide

Crawlee for Python: Quick Start Guide

Learn how to start web scraping with Crawlee and Python

Introduction

Crawlee is a versatile and modern web scraping framework for Python, designed to make crawling and scraping websites easy and efficient. In this guide, we'll explore the basics of using Crawlee, starting from installation and moving to a practical example.

Prerequisites

Before you get started, make sure you have the following installed:

Python 3.7+
Basic knowledge of Python programming
Familiarity with web scraping (optional)

To install Crawlee, use the following pip command:

pip install crawlee

Getting Started with Crawlee

Let's walk through a simple example where we will scrape the title and meta description from a webpage.

# Import the necessary modules from Crawlee
from crawlee import PlaywrightCrawler

# Define an asynchronous function to handle the scraping logic
async def handle_request(context):
    page = context.page
    title = await page.title()
    description = await page.query_selector("meta[name='description']")
    description_content = await description.get_attribute('content') if description else 'No description'
    
    print(f"Page Title: {title}")
    print(f"Meta Description: {description_content}")

# Create the crawler and attach the request handler
crawler = PlaywrightCrawler(handle_request)

# Add a request to the crawler
crawler.add_requests('https://example.com')

# Run the crawler
crawler.run()

Explanation of the Code

PlaywrightCrawler: Crawlee integrates with Playwright, making it easy to handle modern web pages, even those using JavaScript.
handle_request(context): This is an asynchronous function that defines the logic for each page. We retrieve the page's title and meta description in this example.
crawler.add_requests: We add the URL we want to scrape (in this case, https://example.com).
crawler.run(): This starts the crawler.

Further Customization

Crawlee provides multiple options to enhance your web scraping experience, such as:

Middleware support: Intercept and modify requests and responses.
Concurrency controls: Set the number of concurrent requests to avoid overwhelming servers.
Custom request headers: Modify headers to bypass restrictions.

You can explore more advanced features in the official guides.

Conclusion

Crawlee offers a robust and flexible solution for web scraping in Python, making it easy to handle dynamic websites. In this quick-start guide, we covered how to scrape a webpage's title and meta description. For more advanced use cases, such as handling multiple pages, setting up concurrency, or customizing requests, be sure to dive deeper into Crawlee’s documentation.

Created on Sept. 19, 2024, 7:10 p.m.