-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Datafusion binary size has been getting bigger #13816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
binaries can be moved out from the main release |
Some good experiments are https://github.com/johnthagen/min-sized-rust?tab=readme-ov-file#optimize-libstd-with-xargo with this profile
The cli size went from |
That is a very cool page 🤔 |
Expanding on @comphead's idea adding opt-level = "s" reduced the size of the cli from 52MB on my machine (with his changes) to 37MB. |
Thanks @Omega359 Opt-level is 3 by default for the release https://doc.rust-lang.org/cargo/reference/profiles.html#release |
FWIW I don't think the size of hte datafusion-cli binary is all that critical per se (maybe we can adjust / optimize the size of what is distributed on homebrew) What I was hoping to review / improve with this PR is the size of the code in general (and review if there were places that were unecessairly causing code bloat that weren't also adding value |
At a high level, I think this ticket has 2 parts:
I think the most valuable (and hardest) part is 1 (figuring out what to do) To do so I recommend doing an "Ablation Study"
This is a fancy way of saying "remove parts of the system and see how much impact it makes on binary size" Suggested things to tryI suggest initially simply trying with different datafusion crate features and see how much extra code each contributes to the binary size. A follow on idea would be to comment out some of the features that require lots of generic code such as
Example of Ablation Study for the
|
type | size in mb | size in kb |
---|---|---|
default | 58 | 58440 |
without parquet feature |
53 | 54248 |
So I conclude that the parquet feature adds approximately 5mb to the binary size
To remove parquet support I hacked out the dependency on parquet. Since this was just to test the impact as long as it compiles that is good enough. No need to be pretty. Here is the
That is really awesome write up @alamb |
LOL like most great things I think I first heard this term used in a paper by Viktor Leis and Thomas Neumann |
btw after changes in 45.0.0 the image size is 49M 🎉 |
Nice! Do you know what changed? Indeed I checked on my mac after doing andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ du -s -h target/release/datafusion-cli
58M target/release/datafusion-cli |
Tbh I dont have accurate answer for this, I found it when played with different feature set on latest DF, but I remember some packages were moved out of the core or similar. Experimenting with different feature set and --no-default-features keeps the binary size the same. One more optimization is to strip symbols and debug info which takes some space considering our code size and it saves 20% more going down to 50M |
Stripped binary by inner segments
|
so in the data above(ARM Macos) the biggest parts are
@alamb WDYT should we dig deeper? |
I checked the biggest methods are std panic methods, removing unwind can save even more
|
I don't think so. It is fascinating how much binary size we can save without unwinding. |
Optimizing binary size #13816
are we still considering this for GSoC Proposals as it is already closed, with binary size currently being 62M (on linux)? [nix-shell:~/dev/datafusion]$ du -s -h target/release/datafusion-cli
62M target/release/datafusion-cli We can probably still do the Ablative study, in which case maybe this should be reopened. |
Hey @logan-keede I would think this ticket is a good fit for GSoC #14510 |
Thanks for the information, I would take a look at that. |
I personally think reducing the datafusion crate compile time somehow would be far more impactful than the binary size
IN other words it would make you a hero! |
Yeah, compile time would be closer to the original |
Thanks again, I believe this proposal might suit me even better considering my efforts over the last month or two. |
Thank you! An ablative study would also be sweet! |
Hi @comphead, I would really appreciate if you can give me some feedback for my GSoC proposal. PS: I dropped a draft of the same on your discord. However, it seems you are not active there. |
Hey @logan-keede please ping me in ASF slack, I'm not using discord now |
@comphead I pinged you on slack. |
Is your feature request related to a problem or challenge?
The size of datafusion's binary has grown significantly in the last few releases
This likely leads to higher compile times as well as larger overall binary size
datafusion-cli
binarymain
at 57d130943.0.0
42.0.0
41.0.0
40.0.0
39.0.0
The sizes are measured like this:
git checkout version cd datafusion-cli cargo build --release du -h target/release/datafusion-cli
Also, people such as @g3blv have noticed that the WASM build has increased 50%:
#9834 (comment)
Describe the solution you'd like
I would like to reduce the binary size of DataFusion if possible
At least I would like to understand where the code size comes from and offer hints about how to reduce the size if needed
Describe alternatives you've considered
A common source of code size is templated functions (as that generates multiple copies of the same function(s)).
Here is some fascianting information from running
cargo bloat -p datafusion
Additional context
No response
The text was updated successfully, but these errors were encountered: