Skip to content

Add paritial read feature for open_async #967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Bing1996
Copy link

Feature

This feature supports to the open_async function in S3FileSystem Class, enabling efficient async reading of S3 objects without blocking the event loop. This feature is particularly valuable for applications that need to process large S3 files asynchronously or implement streaming data processing pipelines. And add a relative test for async partial read.

Benefits

  • Flexibility: Support for both full and partial file access

Example

A paritial read example for open_async based on #871

fn = test_bucket_name + "/target"
data = b"hello world" * 1000
out = []

async def read_stream():
    fs = S3FileSystem(
        anon=False,
        client_kwargs={"endpoint_url": endpoint_uri},
        skip_instance_cache=True,
    )
    await fs._mkdir(test_bucket_name)
    await fs._pipe(fn, data)
    f = await fs.open_async(fn, mode="rb", loc=0, size=len(data) // 2)

    while True:
        got = await f.read(1000)
        assert f.size == len(data) // 2
        assert f.tell()
        if not got:
            break
        out.append(got)

asyncio.run(read_stream())
assert len(b"".join(out)) == len(data) // 2
assert b"".join(out) == data[: len(data) // 2]

@martindurant
Copy link
Member

Sorry for being slow to respond.
I am trying to wrap my head around this - it feels like half way between the random-access normal file and the streaming file. So you do get a streaming response (with whatever internal chunking or none), but over only part of the data.

@Bing1996
Copy link
Author

Bing1996 commented Jun 5, 2025

Yes, it does indeed achieve this kind of purpose. For example, with a large object that has a bytes location index file, we only need to asynchronously fetch the parts we need from this large file based on this index file, and then collect them together at the end. This will greatly accelerate the reading step, and is especially suitable for targeted extraction processes (like partial variables in some earth science datasets) from big data files that don't require full data retrieval.

@martindurant
Copy link
Member

martindurant commented Jun 5, 2025

I think my question should have been: can you not achieve everything you want using the async fs._cat_file(path, start, end)?

Even, if you are reading many pieces, you cat use _cat_ranges, which also does some smart collation of nearby ranges to reduce the number of requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants