-
Notifications
You must be signed in to change notification settings - Fork 376
Crawling very slow and timeout error #354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It use a lot of ram also (after 4h-6h of crawl): [crawlee.autoscaling.autoscaled_pool] INFO current_concurrency = 183; desired_concurrency = 173; cpu = 0.0; mem = 0.0; event_loop = 0.252; client_info = 0.0
[crawlee.autoscaling.snapshotter] WARN Memory is critically overloaded. Using 7.04 GB of 7.81 GB (90%). Consider increasing available memory.
[crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics {
"requests_finished": 30381,
"requests_failed": 7,
"retry_histogram": [
30374,
7,
7
],
"request_avg_failed_duration": 1.340926,
"request_avg_finished_duration": 120.59418,
"requests_finished_per_minute": 87,
"requests_failed_per_minute": 0,
"request_total_duration": 3663781.171706,
"requests_total": 30388,
"crawler_runtime": 20939.883378
} It's the user that have to limit the number of url added in the queue or the lib manage that? (hard limit etc) |
Interesting. What is your total available memory? |
32 Giga is available on my system. |
I export my storage on Google Drive so you can test that: https://drive.google.com/file/d/1P8AgbgbVLmujiceYRtMIKK91zn9GVjen/view?usp=sharing CRAWLEE_PURGE_ON_START=0 python test.py When there is a lot of pending requests, crawlee is very very slow. |
I'm seeing slow scraping too. About 200 requests per minute. I even self host the webpage to scrape. There are numerous times when scraper literally does nothing and waits for something. |
@marisancans Would you mind sharing your scraper code as well? It might help us debug. |
I also have the warning |
Unless you're limiting the memory usage knowingly, no, there isn't, at least without digging deep in Crawlee's internals. Of course, if you're working with a cloud platform such as Apify, you can configure the available memory there. |
Thank you for your response :D |
What is your PC setup (RAM, CPU)? I try to increase speed and can't go over 25 req/min. Maybe you can advise something to increase speed, some parameters, I use concurrency set.: ConcurrencySettings(min_concurrency=10, max_concurrency=200, max_tasks_per_minute=200, desired_concurrency=110) |
Hello, I'm experiencing performance issues with my web crawler after approximately 1.5 to 2 hours of runtime. The crawling speed significantly decreases to about one site per minute or less, and I'm encountering numerous timeout errors.
Questions:
Here is the code I use:
The logs and errors:
The text was updated successfully, but these errors were encountered: