Skip to content

Datafusion binary size has been getting bigger #13816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #13813
alamb opened this issue Dec 17, 2024 · 26 comments · Fixed by #14843
Closed
Tracked by #13813

Datafusion binary size has been getting bigger #13816

alamb opened this issue Dec 17, 2024 · 26 comments · Fixed by #14843
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

Is your feature request related to a problem or challenge?

The size of datafusion's binary has grown significantly in the last few releases

This likely leads to higher compile times as well as larger overall binary size

version size of datafusion-cli binary
main at 57d1309 92M
43.0.0 87M
42.0.0 83M
41.0.0 72M
40.0.0 69M
39.0.0 68M

The sizes are measured like this:

git checkout version
cd datafusion-cli
cargo build --release
du -h target/release/datafusion-cli

Also, people such as @g3blv have noticed that the WASM build has increased 50%:
#9834 (comment)

Describe the solution you'd like

I would like to reduce the binary size of DataFusion if possible

At least I would like to understand where the code size comes from and offer hints about how to reduce the size if needed

Describe alternatives you've considered

A common source of code size is templated functions (as that generates multiple copies of the same function(s)).

Here is some fascianting information from running cargo bloat -p datafusion

 File  .text    Size                          Crate Name
 0.1%   0.3% 79.7KiB                         blake2 blake2::Blake2bVarCore::compress
 0.1%   0.2% 70.7KiB                         blake2 blake2::Blake2sVarCore::compress
 0.1%   0.2% 67.1KiB                      sqlparser <sqlparser::ast::Statement as core::fmt::Display>::fmt
 0.1%   0.2% 61.4KiB                         blake3 _blake3_hash4_neon
 0.1%   0.2% 56.4KiB                      chrono_tz <chrono_tz::timezones::Tz as chrono_tz::timezone_impl::TimeSpans>::timespans
 0.1%   0.2% 44.7KiB                     arrow_cast <i64 as lexical_write_integer::api::ToLexical>::to_lexical
 0.1%   0.1% 42.8KiB                     arrow_cast arrow_cast::cast::cast_with_options
 0.0%   0.1% 35.9KiB                           rand <rand_chacha::chacha::ChaCha12Core as rand_core::block::BlockRngCore>::generate
 0.0%   0.1% 34.9KiB                     arrow_cast lexical_parse_float::slow::parse_mantissa
 0.0%   0.1% 33.1KiB                     arrow_cast lexical_parse_float::parse::parse_complete
 0.0%   0.1% 33.1KiB                     arrow_cast lexical_parse_float::parse::parse_complete
 0.0%   0.1% 29.0KiB                 regex_automata regex_automata::hybrid::search::find_fwd
 0.0%   0.1% 27.6KiB                         blake3 blake3::portable::compress_in_place
 0.0%   0.1% 27.1KiB                   aho_corasick aho_corasick::automaton::try_find_fwd
 0.0%   0.1% 25.2KiB                      sqlparser <sqlparser::ast::Expr as core::fmt::Display>::fmt
 0.0%   0.1% 23.8KiB              datafusion_common datafusion_common::scalar::ScalarValue::iter_to_array
 0.0%   0.1% 23.7KiB              datafusion_common datafusion_common::scalar::ScalarValue::iter_to_array
 0.0%   0.1% 23.7KiB       datafusion_physical_expr datafusion_common::scalar::ScalarValue::iter_to_array
 0.0%   0.1% 23.7KiB datafusion_functions_aggregate datafusion_common::scalar::ScalarValue::iter_to_array
 0.0%   0.1% 22.0KiB                     arrow_cast <u64 as lexical_write_integer::api::ToLexical>::to_lexical
36.7%  97.4% 27.7MiB                                And 139272 smaller methods. Use -n N to show more.
37.7% 100.0% 28.4MiB                                .text section size, the file size is 75.4MiB

Additional context

No response

@comphead
Copy link
Contributor

print_functions_docs
print_functions_config

binaries can be moved out from the main release

@comphead
Copy link
Contributor

Some good experiments are https://github.com/johnthagen/min-sized-rust?tab=readme-ov-file#optimize-libstd-with-xargo

with this profile

[profile.release]
codegen-units = 1
strip = true
panic = "abort"

The cli size went from 94.3MiB down to 48.6MiB 🤔

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2024

That is a very cool page 🤔

@Omega359
Copy link
Contributor

[profile.release]
codegen-units = 1
strip = true
panic = "abort"
opt-level = "s"

Expanding on @comphead's idea adding opt-level = "s" reduced the size of the cli from 52MB on my machine (with his changes) to 37MB.

@comphead
Copy link
Contributor

Thanks @Omega359 Opt-level is 3 by default for the release https://doc.rust-lang.org/cargo/reference/profiles.html#release
which focus on maximum runtime speed, I think it is important. However we can strip things to make executable smaller, I'll create a PR soon

@alamb
Copy link
Contributor Author

alamb commented Dec 24, 2024

FWIW I don't think the size of hte datafusion-cli binary is all that critical per se (maybe we can adjust / optimize the size of what is distributed on homebrew)

What I was hoping to review / improve with this PR is the size of the code in general (and review if there were places that were unecessairly causing code bloat that weren't also adding value

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2025

At a high level, I think this ticket has 2 parts:

  1. Figure out what is contributing to code size increase
  2. Then perhaps figure out how to make it better

I think the most valuable (and hardest) part is 1 (figuring out what to do)

To do so I recommend doing an "Ablation Study"

An ablation study aims to determine the contribution of a component to an AI system by removing the component, and then analyzing the resultant performance of the system.[2]

This is a fancy way of saying "remove parts of the system and see how much impact it makes on binary size"

Suggested things to try

I suggest initially simply trying with different datafusion crate features and see how much extra code each contributes to the binary size.

A follow on idea would be to comment out some of the features that require lots of generic code such as

Example of Ablation Study for the parquet feature

For example, to test the impact of the parquet feature, I tested the size of the binary with and without parquet support

cargo build --release -p datafusion-cli
# get size of datafusion-cli in kb
du -k target/release/datafusion-cli

Here is what I got

type size in mb size in kb
default 58 58440
without parquet feature 53 54248

So I conclude that the parquet feature adds approximately 5mb to the binary size

To remove parquet support I hacked out the dependency on parquet. Since this was just to test the impact as long as it compiles that is good enough. No need to be pretty. Here is the

@comphead
Copy link
Contributor

That is really awesome write up @alamb
Tbh I wanted to do exactly the same but I even had no idea this approach has so smart definition Ablative Study.

@alamb
Copy link
Contributor Author

alamb commented Feb 14, 2025

That is really awesome write up @alamb Tbh I wanted to do exactly the same but I even had no idea this approach has so smart definition Ablative Study.

LOL like most great things I think I first heard this term used in a paper by Viktor Leis and Thomas Neumann

@comphead
Copy link
Contributor

btw after changes in 45.0.0 the image size is 49M 🎉

@alamb
Copy link
Contributor Author

alamb commented Feb 23, 2025

btw after changes in 45.0.0 the image size is 49M 🎉

Nice! Do you know what changed?

Indeed I checked on my mac after doing cargo build --release and the size is 58MB:

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ du -s -h target/release/datafusion-cli
 58M	target/release/datafusion-cli

@comphead
Copy link
Contributor

Tbh I dont have accurate answer for this, I found it when played with different feature set on latest DF, but I remember some packages were moved out of the core or similar.

Experimenting with different feature set and --no-default-features keeps the binary size the same.

One more optimization is to strip symbols and debug info which takes some space considering our code size and it saves 20% more going down to 50M

@comphead
Copy link
Contributor

Stripped binary by inner segments

size -m -l target/release/datafusion-cli
Segment __PAGEZERO: 4294967296 (zero fill)  (vmaddr 0x0 fileoff 0)
Segment __TEXT: 50282496 (vmaddr 0x100000000 fileoff 0)
        Section __text: 41381608 (addr 0x100001a40 offset 6720)
        Section __stubs: 2724 (addr 0x102778928 offset 41388328)
        Section __stub_helper: 2748 (addr 0x1027793cc offset 41391052)
        Section __gcc_except_tab: 2199924 (addr 0x102779e88 offset 41393800)
        Section __const: 2538360 (addr 0x102993000 offset 43593728)
        Section __cstring: 7536 (addr 0x102bfeb78 offset 46132088)
        Section __unwind_info: 594168 (addr 0x102c008e8 offset 46139624)
        Section __eh_frame: 3544616 (addr 0x102c919e0 offset 46733792)
        total 50271684
Segment __DATA_CONST: 2293760 (vmaddr 0x102ff4000 fileoff 50282496)
        Section __got: 72 (addr 0x102ff4000 offset 50282496)
        Section __mod_init_func: 8 (addr 0x102ff4048 offset 50282568)
        Section __const: 2287856 (addr 0x102ff4080 offset 50282624)
        total 2287936
Segment __DATA: 262144 (vmaddr 0x103224000 fileoff 52576256)
        Section __la_symbol_ptr: 1816 (addr 0x103224000 offset 52576256)
        Section __data: 41296 (addr 0x103224720 offset 52578080)
        Section __thread_vars: 720 (addr 0x10322e870 offset 52619376)
        Section __thread_data: 152 (addr 0x10322eb40 offset 52620096)
        Section __thread_bss: 529 (addr 0x10322ebd8 zerofill)
        Section __bss: 214208 (addr 0x10322ee00 zerofill)
        Section __common: 704 (addr 0x1032632c0 zerofill)
        total 259425
Segment __LINKEDIT: 557056 (vmaddr 0x103264000 fileoff 52625408)
total 4348362752

@comphead
Copy link
Contributor

so in the data above(ARM Macos) the biggest parts are

  • code. compiled instructions 41MB
  • consts (2-3MB)

@alamb WDYT should we dig deeper?

@comphead
Copy link
Contributor

I checked the biggest methods are std panic methods, removing unwind can save even more

panic = "abort"
 du -s -h target/release/datafusion-cli
 40M    target/release/datafusion-cli

@alamb
Copy link
Contributor Author

alamb commented Feb 24, 2025

@alamb WDYT should we dig deeper?

I don't think so.

It is fascinating how much binary size we can save without unwinding.

@logan-keede
Copy link
Contributor

Optimizing binary size #13816

Optimizing DataFusion Binary Size Core/Build Medium @comphead and @alamb Software Engineering, Refactoring, Dependency Management, Compilers 175 to 350 hours*

are we still considering this for GSoC Proposals as it is already closed, with binary size currently being 62M (on linux)?

[nix-shell:~/dev/datafusion]$ du -s -h target/release/datafusion-cli
62M     target/release/datafusion-cli

We can probably still do the Ablative study, in which case maybe this should be reopened.

@comphead
Copy link
Contributor

Hey @logan-keede I would think this ticket is a good fit for GSoC #14510

@logan-keede
Copy link
Contributor

Hey @logan-keede I would think this ticket is a good fit for GSoC #14510

Thanks for the information, I would take a look at that.

@alamb
Copy link
Contributor Author

alamb commented Mar 10, 2025

Hey @logan-keede I would think this ticket is a good fit for GSoC #14510

Thanks for the information, I would take a look at that.

I personally think reducing the datafusion crate compile time somehow would be far more impactful than the binary size

IN other words it would make you a hero!

@comphead
Copy link
Contributor

Yeah, compile time would be closer to the original Software Engineering, Refactoring, Dependency Management, Compilers title

@logan-keede
Copy link
Contributor

Hey @logan-keede I would think this ticket is a good fit for GSoC #14510

Thanks for the information, I would take a look at that.

I personally think reducing the datafusion crate compile time somehow would be far more impactful than the binary size

IN other words it would make you a hero!

Thanks again, I believe this proposal might suit me even better considering my efforts over the last month or two.
You can look forward to a proposal draft soon! : )

@alamb
Copy link
Contributor Author

alamb commented Mar 11, 2025

Thanks again, I believe this proposal might suit me even better considering my efforts over the last month or two.
You can look forward to a proposal draft soon! : )

Thank you!

An ablative study would also be sweet!

@logan-keede
Copy link
Contributor

Hi @comphead, I would really appreciate if you can give me some feedback for my GSoC proposal.
Let me know if that is feasible or if there is anything else that I can do to make it easier for you.

PS: I dropped a draft of the same on your discord. However, it seems you are not active there.

@comphead
Copy link
Contributor

Hey @logan-keede please ping me in ASF slack, I'm not using discord now

@logan-keede
Copy link
Contributor

Hey @logan-keede please ping me in ASF slack, I'm not using discord now

@comphead I pinged you on slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants