-
Notifications
You must be signed in to change notification settings - Fork 959
arrow-select: add support for merging primitive dictionary values #7519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4a2e763
to
62c4028
Compare
EDIT: After fixing the values slice length (see #7520 (comment)). the regression is more manageable and this is how it stacks up with varying batch sizes and batch counts vs #7517 (had to increase sample size to 10k since there seemed to be a lot of noise)
So to summarize: this approach is slower (in the general case) for concatenating primitive dictionaries in exchange for reduced memory. In our case, this improves execution speed downstream so I would say this regression in microbenchmarks is worth it but I'd like to hear other opinions. |
See also #7468 (although I suspect this formulation avoids the issue there). Edit: as an aside I am curious why you are using primitive dictionaries, the performance hit is huge and in most cases the memory savings marginal... Curious what I am missing, I've always viewed them as supported but esoteric... |
Thanks @tustvold, I wasn't aware of that PR. In our case we don't do any computations on these columns but care about the memory savings since we run in memory-constrained environments. Our data for these columns specifically has very few unique values (0.03% is a recent number). Additionally, our schema is deeply nested and these columns are usually found in the leaves (within lists of structs) so memory savings per batch is magnified. Granted, our idea was to have these be REE but that wasn't working for some reason (can't remember why but I should experiment when I have some time). Let me know how you'd like to proceed. I guess one point in favor of this PR is that if someone is using primitive dictionary batches it is likely that they value memory over perf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asubiotto
Since multiple users have this usecase (cc @davidhewitt) I think this is a valuable feature. I'll run the concat benchmarks on this PR to gather some more data but overall I think it is good
🤖 |
🤖: Benchmark completed Details
|
🤖 |
@asubiotto is there any chance you can merge this PR up from main so I can run the newly added concat benchmarks too?
|
🤖: Benchmark completed Details
|
62c4028
to
a1aad72
Compare
Addressed comment and rebased. |
a1aad72
to
2002512
Compare
Seems like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems better than my PR, thank you.
We ended up with primitive dictionaries because we have a hive partition with Date32
column type, and (as far as I could tell) datafusion wanted us to define this as a dictionary column.
I restarted the job
This makes sense -- DataFusion likely uses that encoding for partition columns |
🤖 |
🤖: Benchmark completed Details
|
🤖 |
Looks like there is some wild variability in the benchmarks -- trying again |
Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.
2002512
to
1664f95
Compare
🤖: Benchmark completed Details
|
// Verify pointer equality check succeeds, and therefore the | ||
// dictionaries are not merged. A single values buffer should be reused | ||
// in this case. | ||
assert!(dict.values().to_data().ptr_eq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
🤖 |
🤖: Benchmark completed Details
|
Looks good to me -- I'll plan to merge this tomorrow unless anyone else would like time to comment |
Great, thanks for the review! |
🚀 |
Thanks again @asubiotto and @davidhewitt |
Previously, should_merge_dictionaries would always return false in the ptr_eq closure creation match arm for types that were not {Large}{Utf8,Binary}. This could lead to excessive memory usage.
Which issue does this PR close?
What changes are included in this PR?
Update to the match arm in
should_merge_dictionary_values
to not short circuit on primitive types. Also uses primitive byte representations to reuse theInterner
pipeline used for the bytes types.Are there any user-facing changes?
No