Skip to content

Conversation

@zilto
Copy link
Collaborator

@zilto zilto commented Sep 29, 2025

Attempt to fix #3145

Changes

  • the main change is a single line: yield batch.to_pylist() -> yield batch if use_pyarrow else batch.to_pylist()
  • add basic tests for all readers (previously untested)
  • fixed typing for arg items from the incorrect Iterator to Iterable

Future work

  • unify reader signatures
  • implement CSV reader with pyarrow instead of pandas given the former is a required dependency of dlt
  • set default read_parquet(use_pyarrow=True) instead of False because it should be faster and pyarrow is a required deps.
  • add a read_parquet_duckdb() reader OR automatically use duckdb inside read_parquet() if duckdb is available

@zilto zilto self-assigned this Sep 29, 2025
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Sep 29, 2025

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 22b8645 Commit Preview URL

Branch Preview URL
Sep 30 2025, 04:10 PM

@zilto zilto added the enhancement New feature or request label Sep 29, 2025
rudolfix
rudolfix previously approved these changes Sep 30, 2025
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! common tests are not passing on pendulum 2.0 - there's no Timezone object there.

@zilto zilto added the release-highlight Changes to highlight in release notes label Sep 30, 2025
@zilto
Copy link
Collaborator Author

zilto commented Sep 30, 2025

@rudolfix any objection to setting use_pyarrow = True as default? User reported a 20x performance improvement from 10min to <1min loading big parquet files

@zilto zilto force-pushed the feat/parquet-reader-optional-conversion branch from 93280d9 to 22b8645 Compare September 30, 2025 16:01
@zilto
Copy link
Collaborator Author

zilto commented Sep 30, 2025

@rudolfix any objection to setting use_pyarrow = True as default? User reported a 20x performance improvement from 10min to <1min loading big parquet files

After discussion, we decided to keep use_pyarrow=False as default for full backwards compatibility.

There could be edge cases that differ between Python and Pyarrow normalization and type inference. This could evolve the schema or break pipelines.

TODO

Once we evaluate some of these edge cases, we should set a deprecation warning akin to

In dlt==X.Y.Z, we're introduced use_pyarrow: bool with default False. Set to True for better performance. In dlt==X.Y.Z, default will be use_pyarrow=True. This should cause no error, but you can try use_pyarrow=True right now to validate it.

@zilto zilto merged commit 6fd20d1 into devel Sep 30, 2025
68 checks passed
@zilto zilto deleted the feat/parquet-reader-optional-conversion branch September 30, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request release-highlight Changes to highlight in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow parquet reading

3 participants