-
Notifications
You must be signed in to change notification settings - Fork 181
Description
What happens?
I am trying to read a large Parquet file from AWS S3 using pre-signed URLs. The expectation is that DuckDB-WASM will be able to do partial reads of the large file. The observed behavior is the entire file is read. I can reproduce this by first registering the file with registerFileURL writing a SELECT against the file alias as well as putting the pre-signed URL inline in a SELECT.
To Reproduce
- When using
await this.db.registerFileURL('data.parquet', 'https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI....', DuckDBDataProtocol.HTTP, false);
Request 1: HEAD w/ Range: bytes=0- header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header and in the console logs falling back to full HTTP read for: https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI...
This is expected since the pre-signed URL only allows GET
Request 2: GET w/ no relevant header results in 200 reading the entire file
- When using
select column_1 from 'https://the-bucket.s3.amazonaws.com/data.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASI...' limit 1;
Request 1: HEAD w/ no relevant header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header
Again, this is expected since the pre-signed URL only allows GET
Request 2: GET w/ Range: bytes=0-0 header results in 206 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header and a single byte P returned in the Response body.
Seems to prove Range header can be used to fetch a partial file.
But then... instead of starting to partially read the data needed to serve the query, what I observed when using registerFileURL seems to repeat.
Request 3: HEAD w/ Range: bytes=0- header results in 403 w/ Access-Control-Expose-Headers: Access-Control-Allow-Origin, Accept-Ranges, Content-Range, ETag header
This is expected since the pre-signed URL only allows GET. But why is it doing a HEAD here, when Request 2 showed the server can do partial reads using GET w/ Range header?
Request 4: GET w/ no relevant header results in 200 reading the entire file instead of partial reads.
Browser/Environment:
Chome115.0.5790.171 (Official Build) (64-bit)
Device:
Lenovo ThinkPad P16
DuckDB-Wasm Version:
1.27.0
DuckDB-Wasm Deployment:
shell.duckdb.org
Full Name:
Affiliation:
Episode-VII Solutions