Skip to content

Crawler doesn't respect configuration argument #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tlinhart opened this issue Sep 23, 2024 · 1 comment · Fixed by #691
Closed

Crawler doesn't respect configuration argument #539

tlinhart opened this issue Sep 23, 2024 · 1 comment · Fixed by #691
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@tlinhart
Copy link

Consider this sample program:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration(persist_storage=False, write_metadata=False)
    crawler = ParselCrawler(request_handler=default_handler, configuration=config)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())

The configuration argument given to ParselCrawler is not respected, during the run it creates the ./storage directory and persist all the (meta)data. I have to work around it by overriding the global configuration likes this:

import asyncio

from crawlee.configuration import Configuration
from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext


async def default_handler(context: ParselCrawlingContext) -> None:
    for category in context.selector.xpath(
        '//div[@class="side_categories"]//ul/li/ul/li/a'
    ):
        await context.push_data({"category": category.xpath("normalize-space()").get()})


async def main() -> None:
    config = Configuration.get_global_configuration()
    config.persist_storage = False
    config.write_metadata = False
    crawler = ParselCrawler(request_handler=default_handler)
    await crawler.run(["https://books.toscrape.com"])
    data = await crawler.get_data()
    print(data.items)


if __name__ == "__main__":
    asyncio.run(main())
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 23, 2024
@janbuchar janbuchar added the bug Something isn't working. label Sep 23, 2024
@janbuchar
Copy link
Collaborator

janbuchar commented Sep 23, 2024

Hello, and thanks for the reproduction! It seems that the problem is here:

https://github.com/apify/crawlee-python/blob/master/src/crawlee/storages/_creation_management.py#L122-L132

It looks like service_container.get_storage_client does not consider the adjusted configuration.

Also, we have a test for this - https://github.com/apify/crawlee-python/blob/master/tests/unit/basic_crawler/test_basic_crawler.py#L630-L639 - which probably fails because we're looking inside a different storage directory than the global one.

@vdusek vdusek added this to the 99th sprint - Tooling team milestone Sep 23, 2024
vdusek added a commit that referenced this issue Oct 1, 2024
vdusek added a commit that referenced this issue Oct 1, 2024
vdusek added a commit that referenced this issue Oct 8, 2024
vdusek added a commit that referenced this issue Oct 24, 2024
vdusek added a commit that referenced this issue Nov 7, 2024
@vdusek vdusek removed this from the 102nd sprint - Tooling team milestone Nov 18, 2024
@vdusek vdusek added this to the 103rd sprint - Tooling team milestone Nov 26, 2024
@vdusek vdusek closed this as completed in 1d31c6c Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
3 participants