-
Notifications
You must be signed in to change notification settings - Fork 44
feat: data source options #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Spark Test ReportCommit Information
Test Summary
Test DetailsError Counts
Passed Tests Diff(empty) Failed Tests
|
Gold Data ReportNotes
Commit Information
Summary
DetailsGold Data Metrics
|
Spark 3.5.5 Test ReportCommit Information
Test Summary
Test DetailsError Counts
Passed Tests Diff(empty) Failed Tests
|
Spark 4.0.0 Test ReportCommit Information
Test Summary
Test DetailsError Counts
(truncated) Passed Tests Diff(empty) Failed Tests
(truncated) |
@@ -163,6 +163,7 @@ mod retry_strategy { | |||
#[serde(deny_unknown_fields)] | |||
pub struct ExecutionConfig { | |||
pub batch_size: usize, | |||
pub collect_statistics: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion 48.0.0
introduces a significant performance regression. Instead of waiting for the next release which changes the default value of this config to true, we can just expose it and control it on our end.
@@ -206,12 +206,20 @@ | |||
description: The batch size for physical plan execution. | |||
experimental: true | |||
|
|||
- key: execution.collect_statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🚀
The DataSourceOptions
trait looks really nice. And thanks for having all the tests!
@@ -225,13 +233,62 @@ def test_sql_temp_view(spark, df, df_view): | |||
assert_frame_equal(spark.sql(f"SELECT * FROM {df_view}").toPandas(), df.toPandas()) # noqa: S608 | |||
|
|||
|
|||
def test_write(spark, df, tmpdir): | |||
# CHECK HERE: DO NOT MERGE IF THIS COMMENT IS HERE!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
Well done!!! I wonder if we can open an issue to track the options that are not yet supported, or perhaps this issue already exists and I overlooked it. |
Future work is needed to support Spark’s specific Parquet global options, as well as DataFusion’s
column_specific_options
andkey_value_metadata
.