feat: add `respect_robots_txt_file` option #1162

Mantisus · 2025-04-17T23:35:30Z

Description

This PR implements automatic skipping of requests based on the robots.txt file. It works based on a new boolean flag in the crawler options called respect_robots_txt_file.

Issues

Related: Implement respectRobotsTxtFile crawler option #1144

Testing

Add tests to check respect_robots_txt_file functioning in ‘EnqueueLinksFunction’ for crawlers
Add tests for RobotsTxtFile

Copilot

Pull Request Overview

This PR introduces a new boolean flag, respect_robots_txt_file, to automatically skip crawling disallowed URLs based on a site's robots.txt rules. Key changes include the addition of tests for robots.txt handling across multiple crawler implementations, integration of robots.txt checking in the crawling pipeline, and the implementation of a RobotsTxtFile utility.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/server_endpoints.py	Added a static ROBOTS_TXT response to simulate a robots.txt file.
tests/unit/server.py	Introduced a new endpoint to serve robots.txt and updated routing logic.
tests/unit/crawlers/_playwright/test_playwright_crawler.py	Added tests verifying that the PlaywrightCrawler correctly respects robots.txt.
tests/unit/crawlers/_parsel/test_parsel_crawler.py	Introduced tests for the ParselCrawler to validate robots.txt respect.
tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py	Added tests to ensure BeautifulSoupCrawler adheres to robots.txt rules.
tests/unit/_utils/test_robots.py	New tests for generating, parsing, and validating robots.txt file behavior.
src/crawlee/crawlers/_playwright/_playwright_crawler.py	Integrated robots.txt enforcement in the link extraction logic.
src/crawlee/crawlers/_basic/_basic_crawler.py	Updated request adding and session handling to respect robots.txt directives.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py	Added robots.txt checking in link extraction for HTTP-based crawling.
src/crawlee/_utils/robots.py	Implemented the RobotsTxtFile class for parsing and handling robots.txt data.
pyproject.toml	Added dependency for protego to support robots.txt parsing.

src/crawlee/crawlers/_playwright/_playwright_crawler.py

src/crawlee/crawlers/_basic/_basic_crawler.py

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

Co-authored-by: Copilot <[email protected]>

janbuchar · 2025-04-23T11:03:07Z

pyproject.toml

@@ -40,6 +40,7 @@ dependencies = [
    "eval-type-backport>=0.2.0",
    "httpx[brotli,http2,zstd]>=0.27.0",
    "more-itertools>=10.2.0",
+    "protego>=0.4.0",


It's fun to see another scrapy project here, but I guess that it guarantees some stability, so... all good.

Yes, I was planning to use RobotFileParser, but it doesn't support Google's specification. 😞

janbuchar · 2025-04-23T11:04:32Z

src/crawlee/_utils/robots.py

+        self._robots = robots
+        self._original_url = URL(url).origin()
+
+    @staticmethod


I'd prefer using @classmethod and the Self return type annotation

src/crawlee/crawlers/_basic/_basic_crawler.py

Co-authored-by: Jan Buchar <[email protected]>

vdusek

Nice! I have a few details... And also, could you please write a new guide/example regarding this feature?

src/crawlee/_utils/robots.py

Co-authored-by: Vlada Dusek <[email protected]>

Copilot

Pull Request Overview

This pull request adds support for automatically skipping requests disallowed by robots.txt files. Key changes include:

Introducing a new boolean option (respect_robots_txt_file) across multiple crawler implementations.
Adding caching and locking in the BasicCrawler to optimize fetching of robots.txt files.
Adding new tests and examples for verifying correct respect of robots.txt rules.

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/server_endpoints.py	Added a ROBOTS_TXT constant as binary content for testing robots.txt responses
tests/unit/server.py	Added an endpoint and handler function for serving robots.txt
tests/unit/crawlers/_playwright/test_playwright_crawler.py	Added test to verify the crawling respects robots.txt rules in the PlaywrightCrawler
tests/unit/crawlers/_parsel/test_parsel_crawler.py	Added test to verify the crawling respects robots.txt rules in the ParselCrawler
tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py	Added test to verify the crawling respects robots.txt rules in the BeautifulSoupCrawler
tests/unit/crawlers/_basic/test_basic_crawler.py	Added a test to ensure the robots.txt fetching lock is acquired only once
tests/unit/_utils/test_robots.py	Introduced tests for the RobotsTxtFile class functionality
src/crawlee/storage_clients/_memory/_request_queue_client.py	Removed extraneous type ignore comment from sortedcollections import
src/crawlee/crawlers/_playwright/_playwright_crawler.py	Integrated robots.txt check into the link extraction logic
src/crawlee/crawlers/_basic/_basic_crawler.py	Extended BasicCrawler with respect_robots_txt_file support and caching/locking mechanisms
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py	Integrated robots.txt check in the abstract HTTP crawler’s link extraction method
src/crawlee/_utils/robots.py	Added a new RobotsTxtFile class that leverages Protego for parsing and evaluating rules
pyproject.toml	Updated dependencies to include protego and sortedcollections
docs/examples/code_examples/respect_robots_txt_file.py	Provided an example demonstrating the usage of respect_robots_txt_file option

Files not reviewed (1)

docs/examples/respect_robots_txt_file.mdx: Language not supported

src/crawlee/crawlers/_basic/_basic_crawler.py

vdusek

Nice! LGTM

src/crawlee/crawlers/_basic/_basic_crawler.py

Mantisus added 9 commits April 17, 2025 15:43

basic_robots_allow

427b00a

add respect robots_txt_file

638b5be

update load

33be1c8

change RobotFileParser to Protego

a44dff1

add tests

538672e

fix

b9b35be

update tests

a49ab66

update TODO comments

46a2356

update docstrings

10077b6

Mantisus requested a review from Copilot April 17, 2025 23:41

Copilot AI reviewed Apr 17, 2025

View reviewed changes

src/crawlee/crawlers/_playwright/_playwright_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Outdated Show resolved Hide resolved

Apply suggestions from code review

b3e9789

Co-authored-by: Copilot <[email protected]>

Mantisus self-assigned this Apr 18, 2025

Mantisus mentioned this pull request Apr 21, 2025

feat: add on_skipped_request decorator, to process links skipped according to robots.txt rules #1166

Open

Mantisus marked this pull request as ready for review April 21, 2025 22:24

Mantisus requested review from janbuchar and vdusek April 21, 2025 22:24

janbuchar reviewed Apr 23, 2025

View reviewed changes

Mantisus and others added 3 commits April 23, 2025 14:40

Update src/crawlee/crawlers/_basic/_basic_crawler.py

4f4529e

Co-authored-by: Jan Buchar <[email protected]>

fix docstrings

8973618

change staticmethod to classmethod

73a7bc6

vdusek requested changes Apr 23, 2025

View reviewed changes

src/crawlee/_utils/robots.py Outdated Show resolved Hide resolved

src/crawlee/_utils/robots.py Outdated Show resolved Hide resolved

src/crawlee/_utils/robots.py Outdated Show resolved Hide resolved

Mantisus and others added 5 commits April 23, 2025 22:40

Update src/crawlee/_utils/robots.py

8039fb5

Co-authored-by: Vlada Dusek <[email protected]>

add _robots_txt_locks_cache

125804c

update pyproject.toml

4b7346b

update docstrings

e6099ed

add docs example

41b803d

Mantisus requested a review from Copilot April 24, 2025 00:52

Copilot AI reviewed Apr 24, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

Mantisus force-pushed the respect_robots_txt branch from 48a93b1 to 41b803d Compare April 24, 2025 01:07

Mantisus added 2 commits April 24, 2025 01:08

update comment

8f25d83

resolve

d46a3c6

vdusek approved these changes Apr 24, 2025

View reviewed changes

janbuchar reviewed Apr 24, 2025

View reviewed changes

src/crawlee/crawlers/_basic/_basic_crawler.py Outdated Show resolved Hide resolved

one lock to rule them all

260ad92

janbuchar approved these changes Apr 24, 2025

View reviewed changes

janbuchar merged commit c23f365 into apify:master Apr 24, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `respect_robots_txt_file` option #1162

feat: add `respect_robots_txt_file` option #1162

Mantisus commented Apr 17, 2025

Copilot AI left a comment

janbuchar Apr 23, 2025

Mantisus Apr 23, 2025

janbuchar Apr 23, 2025

vdusek left a comment

Copilot AI left a comment

vdusek left a comment

feat: add respect_robots_txt_file option #1162

feat: add respect_robots_txt_file option #1162

Conversation

Mantisus commented Apr 17, 2025

Description

Issues

Testing

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

janbuchar Apr 23, 2025

Choose a reason for hiding this comment

Mantisus Apr 23, 2025

Choose a reason for hiding this comment

janbuchar Apr 23, 2025

Choose a reason for hiding this comment

vdusek left a comment

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

vdusek left a comment

Choose a reason for hiding this comment

feat: add `respect_robots_txt_file` option #1162

feat: add `respect_robots_txt_file` option #1162