Skip to content

Not getting partial reads of large parquet file in AWS S3 opened from pre-signed URL #1367

@dude0001

Description

@dude0001

What happens?

I am trying to read a large Parquet file from AWS S3 using pre-signed URLs. The expectation is that DuckDB-WASM will be able to do partial reads of the large file. The observed behavior is the entire file is read. I can reproduce this by first registering the file with registerFileURL writing a SELECT against the file alias as well as putting the pre-signed URL inline in a SELECT.

To Reproduce

  1. When using await this.db.registerFileURL('data.parquet', 'https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI....', DuckDBDataProtocol.HTTP, false);

Request 1: HEAD w/ Range: bytes=0- header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header and in the console logs falling back to full HTTP read for: https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI...
This is expected since the pre-signed URL only allows GET

Request 2: GET w/ no relevant header results in 200 reading the entire file

  1. When using select column_1 from 'https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI...' limit 1;

Request 1: HEAD w/ no relevant header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header
Again, this is expected since the pre-signed URL only allows GET

Request 2: GET w/ Range: bytes=0-0 header results in 206 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header and a single byte P returned in the Response body.
Seems to prove Range header can be used to fetch a partial file.

But then... instead of starting to partially read the data needed to serve the query, what I observed when using registerFileURL seems to repeat.

Request 3: HEAD w/ Range: bytes=0- header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header
This is expected since the pre-signed URL only allows GET. But why is it doing a HEAD here, when Request 2 showed the server can do partial reads using GET w/ Range header?

Request 4: GET w/ no relevant header results in 200 reading the entire file instead of partial reads.

Browser/Environment:

Chome115.0.5790.171 (Official Build) (64-bit)

Device:

Lenovo ThinkPad P16

DuckDB-Wasm Version:

1.27.0

DuckDB-Wasm Deployment:

shell.duckdb.org

Full Name:

[email protected]

Affiliation:

Episode-VII Solutions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions