Skip to content

How to economize on RAM when starting a crawl with a large list of urls? #816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matecsaj opened this issue Dec 15, 2024 · 5 comments
Closed
Assignees
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Dec 15, 2024

A very long list of starting URLs consumes a significant amount of RAM throughout the crawler's runtime. I tried converting the get_urls() function into a generator, but the crawler.run() method did not accept it. What is the recommended approach?

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


def get_urls():
    urls = []
    ids = range(1, 100000)
    for id in ids:
        url = f"https://example.com/product/{pinside_id}"
        urls.append(url)
    return urls

async def crawl_example() -> None:
        # PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
        crawler = PlaywrightCrawler()

        # Define a request handler to process each crawled page and attach it to the crawler using a decorator.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            context.log.info(f'Processing {context.request.url} ...')
            # Extract relevant data from the page context.
            data = {
                'url': context.request.url,
                'title': await context.page.title(),
            }
            # Store the extracted data.
            await context.push_data(data)

            # Extract links from the current page and add them to the crawling queue.
            # await context.enqueue_links()

        # Add initial URLs to the queue and start the crawl.
        await crawler.run(get_urls())
        
asyncio.run(crawl_example())
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 15, 2024
@matecsaj
Copy link
Contributor Author

In the example code, the variable pinside_id should just be id.

@monk3yd
Copy link

monk3yd commented Dec 15, 2024

I have exactly the same question, and adding to it if someone knows if there is a better way to requeue/retry the failed requests when the dataset is large (I've been requeing for almost 3 days a 1.4M dataset) it seems that the retry is pretty expensive or the concurrency doesn't apply but is much slower than crawling or scraping. Been thinking on letting the requests fail and retry then manually for more speed but it isn't ideal imo.

Also just want to thank the main devs for such a nice framework, it should become the python standard once it matures 😁

Regards,

@janbuchar
Copy link
Collaborator

Hello, and thank you for your interest in Crawlee! Unfortunately, the current local implementation of RequestQueue does not handle large lists of URLs very well - see #354 for a similar problem.

We are planning to fix this by only keeping a part of the pending requests in memory and keeping the rest on the filesystem - this is tracked in #433.

Also, after #777 is released (version 0.5), you will be able to implement a custom RequestSource that can feed the crawler with start_urls gradually via RequestSourceTandem.

@janbuchar
Copy link
Collaborator

With #777 merged, it should be possible to make a RequestList that loads URLs from a generator (regular or async). Then you can pass that to your crawler like this:

crawler = PlaywrightCrawler(request_manager=await RequestList(your_generator()).to_tandem())

Feel free to test this with a beta release such as https://pypi.org/project/crawlee/0.5.0b25/

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 6, 2025

Thank you. I tried this on the recent release of V0.5.0 and it worked perfectly.

A note to anyone else that tries it, you will need this:

from crawlee.request_loaders import RequestList

@matecsaj matecsaj closed this as completed Jan 6, 2025
@vdusek vdusek added this to the 105th sprint - Tooling team milestone Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants