Description
Is your feature request related to a problem or challenge?
Introduction
This ticket is my weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community.
Community Highlights
- DF 45 Blog post https://datafusion.apache.org/blog/2025/02/20/datafusion-45.0.0/
- @oznur-synnada updated the events page Update Community Events in concepts-readings-events.md #14629
- We are hosting a Google Summer of Code -- thanks again @oznur-synnada for driving this
Releases!
- DataFusion 46 Release candidate is available. Huge thank you to @xudong963 for running this release. This one contains a massive refactor of DataSource from @ozankabak and @mertak-synnada
- Also huge shout out to @blaginin for his help chasing down issues blocking the release: Set projection before configuring the source #14685
- Another Huge shout out to @shehabgamin for his help testing and identifying issues pre-release
- Check out the DataFusion 46 Upgrade Guide to help
Performance
DataFusion's core value proposition is great performance without having to re-implement it yourself
- @Omega359 's improvement to Dataframe with_column and with_column_renamed performance improvements #14653
- @berkaysynnada improved the sort tracking code more Window Functions Order Conservation -- Follow-up On Set Monotonicity #14813
- @zjregee made repeat 50% faster: optimize performance of the repeat function (up to 50% faster) #14697
- @simonvandel made
to_hex
2x faster: Speedupto_hex
(~2x faster) #14686 - @simonvandel also made
to_hex
4x faster: Speed upuuid
UDF (40x faster) #14675 (no string copies for the win!) - And @simonvandel also updated
date_trunc
to be 2x faster: Speedupdate_trunc
(~20% time reduction) #14593 - @Kev1n8 made
substr
faster: Always useStringViewArray
as output ofsubstr
#14498
Quality
Testing
Bug Fixes
DataFusion is in the "we are finding all the corner case bugs now" phase of its life and people are now bashing them down
- @joroKr21 's fix for grouping exprs Preserve the name of grouping sets in SimplifyExpressions #14888
- @anlinc helped fixed fix(substrait): Do not add implicit groupBy expressions in
LogicalPlanBuilder
or when building logical plans from Substrait #14860 - test: change test_function macro to use
return_type_from_args
instead ofreturn_type
#14852 @rluvaton 🙏 - @xudong963 Fix: limit is missing after removing SPM #14569
Docs
Build time
Cleanups 🧹
- physical-optimizer into its own crate (finally!): thanks to @logan-keede @berkaysynnada and @buraksenn.
- breaking the datafusion core crate apart (finally!): thanks to @logan-keede and @AdamGS
- @onlyjackfrost @niebayes @irenjj @goldmedal and others have been migrating all our functions to use
invoke_args
etc - @jayzhan211 has been Fixing up wild card handling
Features
Features under way
- Statistics work: StatisticsV2: initial statistics framework redesign #14699
Better Out of Core Support
In general, DataFusion is getting better at handling datasets that are larger than can fit in memory.
- @davidhewitt's improvement here Use arrow IPC Stream format for spill files #14868
- @2010YOUY01 's work to improve spilling for StringView Fix: External sort failing on
StringView
due to shared buffers #14823 - @zhuqi-lucas improved datafusion-cli: feat: Improve datafusion-cli memory usage and considering reserve mem… #14766
- @Kontinuation improved docs docs: Add additional info about memory reservation to the doc of MemoryPool #14789 and implementation bug: Fix memory reservation and allocation problems for SortExec #14644 and testing feat: Add support for --mem-pool-type and --memory-limit options to multiple benchmarks #14642
We can have nice things! (Explain plans)
- @irenjj took the first step towards feat: Add
tree
/ pretty explain mode #14677. I'll give you a teaser below. Come help with the follow on work on [EPIC] CompleteSQL EXPLAIN
Tree Rendering #14914
> explain select * from t1 inner join t2 on t1.i=t2.i;
+---------------+------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------+
| logical_plan | Inner Join: t1.i = t2.i |
| | TableScan: t1 projection=[i] |
| | TableScan: t2 projection=[i] |
| physical_plan | ┌───────────────────────────┐ |
| | │ CoalesceBatchesExec │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ HashJoinExec ├──────────────┐ |
| | └─────────────┬─────────────┘ │ |
| | ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ |
| | │ DataSourceExec ││ DataSourceExec │ |
| | │ -------------------- ││ -------------------- │ |
| | │ partition_sizes: [0] ││ partitions: 1 │ |
| | │ partitions: 1 ││ partition_sizes: [0] │ |
| | └───────────────────────────┘└───────────────────────────┘ |
| | |
+---------------+------------------------------------------------------------+
2 row(s) fetched.
Better Error Messages
@eliaperantoni is working with various contributors to make the error messages better. This work is tracked in
- [EPIC] Attach
Diagnostic
to more errors #14429 - Add
DataFusionError::Collection
to return multipleDataFusionError
s #14439 - @onlyjackfrost chore: Attach Diagnostic to "function x does not exist" error #14849
Misc
- @simonvandel added Add
range
table function #14830 - @Lordworms made expression access nicer: Map access supports constant-resolvable expressions #14712
- @rkrishn7 did
UNION ALL BY NAME
feat: Implement UNION ALL BY NAME #14538
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
- I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- Help schedule some!