How to economize on RAM when starting a crawl with a large list of urls? #816

matecsaj · 2024-12-15T04:27:22Z

A very long list of starting URLs consumes a significant amount of RAM throughout the crawler's runtime. I tried converting the get_urls() function into a generator, but the crawler.run() method did not accept it. What is the recommended approach?

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext


def get_urls():
    urls = []
    ids = range(1, 100000)
    for id in ids:
        url = f"https://example.com/product/{pinside_id}"
        urls.append(url)
    return urls

async def crawl_example() -> None:
        # PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library.
        crawler = PlaywrightCrawler()

        # Define a request handler to process each crawled page and attach it to the crawler using a decorator.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            context.log.info(f'Processing {context.request.url} ...')
            # Extract relevant data from the page context.
            data = {
                'url': context.request.url,
                'title': await context.page.title(),
            }
            # Store the extracted data.
            await context.push_data(data)

            # Extract links from the current page and add them to the crawling queue.
            # await context.enqueue_links()

        # Add initial URLs to the queue and start the crawl.
        await crawler.run(get_urls())
        
asyncio.run(crawl_example())

matecsaj · 2024-12-15T06:06:33Z

In the example code, the variable pinside_id should just be id.

monk3yd · 2024-12-15T13:01:39Z

I have exactly the same question, and adding to it if someone knows if there is a better way to requeue/retry the failed requests when the dataset is large (I've been requeing for almost 3 days a 1.4M dataset) it seems that the retry is pretty expensive or the concurrency doesn't apply but is much slower than crawling or scraping. Been thinking on letting the requests fail and retry then manually for more speed but it isn't ideal imo.

Also just want to thank the main devs for such a nice framework, it should become the python standard once it matures 😁

Regards,

janbuchar · 2024-12-16T12:31:39Z

Hello, and thank you for your interest in Crawlee! Unfortunately, the current local implementation of RequestQueue does not handle large lists of URLs very well - see #354 for a similar problem.

We are planning to fix this by only keeping a part of the pending requests in memory and keeping the rest on the filesystem - this is tracked in #433.

Also, after #777 is released (version 0.5), you will be able to implement a custom RequestSource that can feed the crawler with start_urls gradually via RequestSourceTandem.

janbuchar · 2024-12-20T09:29:33Z

With #777 merged, it should be possible to make a RequestList that loads URLs from a generator (regular or async). Then you can pass that to your crawler like this:

crawler = PlaywrightCrawler(request_manager=await RequestList(your_generator()).to_tandem())

Feel free to test this with a beta release such as https://pypi.org/project/crawlee/0.5.0b25/

matecsaj · 2025-01-06T19:31:33Z

Thank you. I tried this on the recent release of V0.5.0 and it worked perfectly.

A note to anyone else that tries it, you will need this:

from crawlee.request_loaders import RequestList

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 15, 2024

B4nan assigned janbuchar Dec 16, 2024

vdusek mentioned this issue Dec 16, 2024

Introduce new base & memory storage clients #783

Open

matecsaj closed this as completed Jan 6, 2025

vdusek added this to the 105th sprint - Tooling team milestone Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to economize on RAM when starting a crawl with a large list of urls? #816

How to economize on RAM when starting a crawl with a large list of urls? #816

matecsaj commented Dec 15, 2024 •

edited by vdusek

Loading

matecsaj commented Dec 15, 2024

monk3yd commented Dec 15, 2024

janbuchar commented Dec 16, 2024

janbuchar commented Dec 20, 2024

matecsaj commented Jan 6, 2025

How to economize on RAM when starting a crawl with a large list of urls? #816

How to economize on RAM when starting a crawl with a large list of urls? #816

Comments

matecsaj commented Dec 15, 2024 • edited by vdusek Loading

matecsaj commented Dec 15, 2024

monk3yd commented Dec 15, 2024

janbuchar commented Dec 16, 2024

janbuchar commented Dec 20, 2024

matecsaj commented Jan 6, 2025

matecsaj commented Dec 15, 2024 •

edited by vdusek

Loading