Add paritial read feature for open_async #967

Bing1996 · 2025-05-24T06:53:49Z

Feature

This feature supports to the open_async function in S3FileSystem Class, enabling efficient async reading of S3 objects without blocking the event loop. This feature is particularly valuable for applications that need to process large S3 files asynchronously or implement streaming data processing pipelines. And add a relative test for async partial read.

Benefits

Flexibility: Support for both full and partial file access

Example

A paritial read example for open_async based on #871

fn = test_bucket_name + "/target"
data = b"hello world" * 1000
out = []

async def read_stream():
    fs = S3FileSystem(
        anon=False,
        client_kwargs={"endpoint_url": endpoint_uri},
        skip_instance_cache=True,
    )
    await fs._mkdir(test_bucket_name)
    await fs._pipe(fn, data)
    f = await fs.open_async(fn, mode="rb", loc=0, size=len(data) // 2)

    while True:
        got = await f.read(1000)
        assert f.size == len(data) // 2
        assert f.tell()
        if not got:
            break
        out.append(got)

asyncio.run(read_stream())
assert len(b"".join(out)) == len(data) // 2
assert b"".join(out) == data[: len(data) // 2]

martindurant · 2025-06-04T17:30:48Z

Sorry for being slow to respond.
I am trying to wrap my head around this - it feels like half way between the random-access normal file and the streaming file. So you do get a streaming response (with whatever internal chunking or none), but over only part of the data.

Bing1996 · 2025-06-05T09:28:40Z

Yes, it does indeed achieve this kind of purpose. For example, with a large object that has a bytes location index file, we only need to asynchronously fetch the parts we need from this large file based on this index file, and then collect them together at the end. This will greatly accelerate the reading step, and is especially suitable for targeted extraction processes (like partial variables in some earth science datasets) from big data files that don't require full data retrieval.

martindurant · 2025-06-05T13:03:52Z

I think my question should have been: can you not achieve everything you want using the async fs._cat_file(path, start, end)?

Even, if you are reading many pieces, you cat use _cat_ranges, which also does some smart collation of nearby ranges to reduce the number of requests.

add paritial read for open_async

a90ab79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add paritial read feature for open_async #967

Add paritial read feature for open_async #967

Uh oh!

Bing1996 commented May 24, 2025

Uh oh!

martindurant commented Jun 4, 2025

Uh oh!

Bing1996 commented Jun 5, 2025

Uh oh!

martindurant commented Jun 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add paritial read feature for open_async #967

Are you sure you want to change the base?

Add paritial read feature for open_async #967

Uh oh!

Conversation

Bing1996 commented May 24, 2025

Feature

Benefits

Example

Uh oh!

martindurant commented Jun 4, 2025

Uh oh!

Bing1996 commented Jun 5, 2025

Uh oh!

martindurant commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

martindurant commented Jun 5, 2025 •

edited

Loading