Skip to content

How can I disable cache completely? #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1hachem opened this issue Jul 27, 2024 · 7 comments · Fixed by #691
Closed

How can I disable cache completely? #369

1hachem opened this issue Jul 27, 2024 · 7 comments · Fixed by #691
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@1hachem
Copy link

1hachem commented Jul 27, 2024

I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).

here is my attempt so far, I tried with persist_storage=False and purge_on_start=True in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.

async def main(
    website: str,
    include_links: list[str],
    exclude_links: list[str],
    depth: int = 5,
) -> str:
    crawler = BeautifulSoupCrawler(
        # Limit the crawl to max requests. Remove or increase it for crawling all links.
        max_requests_per_crawl=depth,
    )
    dataset = await Dataset.open(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
        ),
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:  # type: ignore
        # Extract data from the page.
        text = context.soup.get_text()

        await dataset.push_data({"content": text})

        # Enqueue all links found on the page.
        await context.enqueue_links(
            include=[Glob(url) for url in include_links],
            exclude=[Glob(url) for url in exclude_links],
        )

    # Run the crawler with the initial list of URLs.
    await crawler.run([website])
    data = await dataset.get_data()

    content = "\n".join([item["content"] for item in data.items])  # type: ignore

    return content

also is there a way to simple get the result of the crawl as a string, and not use Dataset ?

any help is appreciated 🤗 thank you in advance !

@janbuchar
Copy link
Collaborator

Hello and thank you for your interest in Crawlee! This seems closely related to #351. Could you please re-check that you get an empty string if you run this after removing the storage directory? I can imagine getting an empty string on a second run without deleting the storage (because of both persist_storage=False and purge_on_start functioning incorrectly), but what you're describing sounds strange.

@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 31, 2024
@tlinhart
Copy link

After some debugging I found a workaround to avoid re-using the cache. Basically we have to ensure that each time the crawler runs it uses a different request queue, e.g. like this:

import uuid

...
config = Configuration.get_global_configuration()
config.default_request_queue_id = uuid.uuid4().hex
...

It would be great if we could actually disable caching at all but this works for now.

@vdusek
Copy link
Collaborator

vdusek commented Sep 25, 2024

@tlinhart Thanks. I will also link #541 here as it provides additional context.

@vdusek vdusek self-assigned this Sep 25, 2024
@vdusek vdusek added the bug Something isn't working. label Sep 25, 2024
@vdusek vdusek added this to the 99th sprint - Tooling team milestone Sep 25, 2024
@tlinhart
Copy link

Thanks. If it helps, I found out during debugging that the problem seems to be in that there’s the same instance of MemoryStorageClient used across runs. There must be some reference left out after the first run.

@janbuchar
Copy link
Collaborator

Thanks. If it helps, I found out during debugging that the problem seems to be in that there’s the same instance of MemoryStorageClient used across runs. There must be some reference left out after the first run.

Yes, that is the case. We're carrying a lot o historical baggage here and maybe this mechanism won't even be necessary in the end. Until then, I'm happy that you found a workaround.

@vdusek vdusek removed this from the 100th sprint - Tooling team milestone Oct 21, 2024
@vdusek vdusek removed their assignment Nov 1, 2024
@vdusek vdusek changed the title how can I disable cache completely ? How can I disable cache completely? Nov 14, 2024
@vdusek vdusek self-assigned this Nov 26, 2024
@vdusek vdusek closed this as completed in 1d31c6c Dec 13, 2024
@amindadgar
Copy link

Has anyone found any solutions instead of just changing the request_queue_id?

@janbuchar
Copy link
Collaborator

Has anyone found any solutions instead of just changing the request_queue_id?

I don't know of anything, but we're refactoring the storage code so that this can work as expected out of the box. #1107 is the first step towards this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants