Skip to content
This repository was archived by the owner on Feb 5, 2019. It is now read-only.

remove inlined functions for faster compilation #92

Closed
wants to merge 1 commit into from
Closed

remove inlined functions for faster compilation #92

wants to merge 1 commit into from

Conversation

andjo403
Copy link

after inlining remove functions that is not called any more
reduces compile times due to the removed functions is not further optimized.

part of solution for rust-lang/rust#44655

@alexcrichton
Copy link
Member

Wow those are some awesome wins you're seeing! Would you be willing to try to send this upstream to LLVM itself as well? Unfortunately I'm less familiar with their contribution process, but it'd be great to ensure that their experiences eyes also took a look at this!

@andjo403
Copy link
Author

I'm waiting for a bugzilla account for LLVM so I can add a ticket upstream

@andjo403
Copy link
Author

llvm ticket

@alexcrichton
Copy link
Member

Thanks @andjo403! Let's wait a few days and see what happens upstream there.

@dcci
Copy link

dcci commented Sep 18, 2017

LLVM dev here [with a hidden passion for rust] :)
I see the rationale behind your commit. I think it makes sense, and you're probably seeing a big benefit because you have a bunch of small functions (even better if marked as static) with a single caller.
If you look at the inliner code in LLVM you'll see it gives a ridicolous bonus (grep for LastCallToStaticBonus, or something along these lines).

I wonder whether it's better running GDCE after the inliner or having the inliner itself removing dead functions after they're inlined (admittedly, I haven't looked at whether the latter is feasible).

FWIW, what's your benchmark? Can you try with big cases (like, building rust, building clang itself, maybe with LTO [this last one generaly should leave a bunch of unused function around])

@alexcrichton
Copy link
Member

@dcci hello again! FWIW on rust-lang/rust#44655 it mentions a 23% improvement for the racer project which I think is relatively large, and it also mentions a 15% improvement for rustc itself which I know for a fact is quite large! I haven't tried this out myself yet, though, to verify.

Copy link

@dcci dcci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment.

@@ -478,6 +478,8 @@ void PassManagerBuilder::populateModulePassManager(
if (Inliner) {
MPM.add(Inliner);
Inliner = nullptr;
// Removes functions that is not used due to inlining
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: that "are" not used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks will fix this

@dcci
Copy link

dcci commented Sep 18, 2017

OK, thanks for your datapoints.
FWIW, GDCE is O(n), so I don't expect that taking a lot of time, but still, we should make a more careful analysis before changing the standard -O2 pipeline. If you're willing to run some tests on rust, I'm going to run something equivalent on C++ (as I mentioned earlier, I'm particularly interested in LTO).
I'm a little swamped at the moment so that my take a week or two before we get the patch in, how does that sound like as timeline?

Thanks!

@alexcrichton
Copy link
Member

Sure yeah! @andjo403 if you want to poke around as well I can try to reproduce the rustc build times and also test some projects like Cargo and some simple-ish LTO cases.

@andjo403
Copy link
Author

@alexcrichton sounds good. will try some more projects was thinking of running the https://github.com/rust-lang-nursery/rustc-benchmarks compare.py but will have to install a virtualbox or something due to it is using make and I'm on windows.

@alexcrichton
Copy link
Member

Some stats:

These numbers are for whole compilations, basically how long it took cargo build --release to finish, which means it's timing a whole lot more than just LLVM:

  • rustc itself 505s -> 423s
  • cargo 215s -> 163s
  • small async I/O server 57s -> 48s

Some other numbers

  • Compiling Cargo with LTO, just the final rustc invocation 106s -> 95s

Definitely seems like some possible big improvements to be had! Apparently not helping LTO a whole lot though :(

@andjo403
Copy link
Author

@dcci and @alexcrichton maybe I do not understand how LTO works but I do not see how functions can be removed from an other crate. the linkage must be not private for LTO to do something and it feels like GDCE only can remove private linkage functions.

@alexcrichton
Copy link
Member

@andjo403 oh I think it's just with LTO that typically LLVM works with much larger LLVM modules (lots more code in memory all at once). In theory there's also tons of inlining and in theory greater wins to be had, but the one test I had didn't look too promising unfortunately :(

@dcci
Copy link

dcci commented Sep 18, 2017

@alexcrichton Alex, I'm a little surprised of these numbers as, FWIW, LTO should be the kinda best case but not always really.
Please note that if you do LTO, the inliner runs twice. 1st time, during the per-TU pipeline (i.e. -O2 and -O3) and another time on the whole module.
I'm not sure how rust actually sets up the pipeline, but if it gets the stock from LLVM, you might consider adding the pass also to populateLTOPassManager (or something similar, I don't remember the name off the top of my head), otherwise, GDCE won't run after the inliner at link time.

@alexcrichton
Copy link
Member

Aha indeed! Looks like there's a spot this PR as-is doesn't handle. I've updated that and will report back the LTO build for Cargo.

@dcci
Copy link

dcci commented Sep 18, 2017

Yes, that's exactly it. You were only getting the benefit from the per-TU pipeline.

@alexcrichton
Copy link
Member

Ah unfortunately that ended up having no impact at all :(

@andjo403
Copy link
Author

andjo403 commented Sep 18, 2017

@alexcrichton
Copy link
Member

Ah that'd explain the lack of improvements!

@dcci
Copy link

dcci commented Sep 21, 2017

@alexcrichton @andjo403
So, I've been looking at this more closely. I'm not able to find any obvious regression, so I think I'm going for a slightly modified version of this patch for review (probably, we should also run another pass, GlobalOpt in tandem with GlobalDCE). Once that is done, I guess you guys can probably backport this thingy back? Of course, I'll give the correct attribution for the idea =)

@alexcrichton
Copy link
Member

@dcci sounds great, and thanks again for helping out with this! If you've got a patch I can run some tests for it for sure. And yeah we'll take care of backporting for our own "fork", no need to worry about that!

@andjo403
Copy link
Author

andjo403 commented Sep 21, 2017

@alexcrichton shall I close this PR then?
@dcci sounds amazing thanks for the help

@dcci
Copy link

dcci commented Oct 2, 2017

FWIW, I'm working on this, but the patch exposes a series of misoptimizations in LLVM that need to be fixed. I'm working on these now and I hope to enable the patch once they're all addressed.

@dcci
Copy link

dcci commented Oct 2, 2017

OK, I think the patch was placed in a slightly incorrect point in the pipeline.
Can you please re-run with the following patch applied?

Thanks!

diff --git a/lib/Transforms/IPO/PassManagerBuilder.cpp b/lib/Transforms/IPO/PassManagerBuilder.cpp
index b384629..9155d1a 100644
--- a/lib/Transforms/IPO/PassManagerBuilder.cpp
+++ b/lib/Transforms/IPO/PassManagerBuilder.cpp
@@ -499,10 +499,14 @@ void PassManagerBuilder::populateModulePassManager(
   // Start of CallGraph SCC passes.
   if (!DisableUnitAtATime)
     MPM.add(createPruneEHPass()); // Remove dead EH info
+  bool RunInliner = false;
   if (Inliner) {
     MPM.add(Inliner);
     Inliner = nullptr;
+    RunInliner = true;
   }
+
+
   if (!DisableUnitAtATime)
     MPM.add(createPostOrderFunctionAttrsLegacyPass());
   if (OptLevel > 2)
@@ -515,6 +519,18 @@ void PassManagerBuilder::populateModulePassManager(
   // pass manager that we are specifically trying to avoid. To prevent this
   // we must insert a no-op module pass to reset the pass manager.
   MPM.add(createBarrierNoopPass());
+
+  // The inliner performs some kind of dead code elimination as it goes,
+  // but there are cases that are not really caught by it. We might
+  // at some point consider teaching the inliner about them, but it
+  // is OK for now to run GlobalOpt + GlobalDCE in tandem as their
+  // benefits generally outweight the cost, making the whole pipeline
+  // faster. See PR34652.
+  if (RunInliner) {
+    MPM.add(createGlobalOptimizerPass());
+    MPM.add(createGlobalDCEPass());
+  }
+
   if (RunPartialInlining)
     MPM.add(createPartialInliningPass());

after inlining remove functions that is not called any more
reduces compile times due to the removed functions is not further optimized.
@andjo403
Copy link
Author

andjo403 commented Oct 2, 2017

have update this PR with the patch from @dcci so it shall be easier to test for others
I have started the compilation not sure if I will wait for the results tonight

@andjo403
Copy link
Author

andjo403 commented Oct 2, 2017

@dcci I have only tested to compile race once with standard rustc and once with the patch and I get
LLVM Pass execution timing report for racer lib compilation
standard rustc Total Execution Time: 64.7500 seconds (68.0480 wall clock)
patch Total Execution Time: 35.9531 seconds (37.2969 wall clock)

and for all crates in racer the total time go from 405.71 secs to 315.5 secs
will run some more tests tomorrow

@alexcrichton can you also run some tests?

@dcci
Copy link

dcci commented Oct 2, 2017

OK, I think this is fine.

@dcci
Copy link

dcci commented Oct 5, 2017

Upstream rL314997 and rL314999.
Cheers.

@andjo403
Copy link
Author

andjo403 commented Oct 5, 2017

@dcci thanks for all the help

@alexcrichton
Copy link
Member

Thanks so much @dcci!

@andjo403 oh I actually just pushed this to the relevant branch, ensuring that it builds locally first though. @andjo403 want to send a PR to rust-lang/rust though?

@andjo403
Copy link
Author

andjo403 commented Oct 5, 2017

@alexcrichton yes I want to send a PR to rust-lang/rust

@andjo403 andjo403 closed this Oct 5, 2017
@cuviper
Copy link
Member

cuviper commented Oct 5, 2017

@alexcrichton looks like you only pushed 314997 -- did you intend to backport 314999 too?

@alexcrichton
Copy link
Member

Oh oops, thanks @cuviper! Apparently when I fetched 314999 wasn't in the mirror yet...

@andjo403
Copy link
Author

andjo403 commented Oct 5, 2017

@alexcrichton I wanted to test the the commit rust-lang/rust@8fd3c8f
before making a PR but I get:

failed to run: /home/ubuntu/rust/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /home/ubuntu/rust/src/bootstrap/Cargo.toml

@alexcrichton
Copy link
Member

@andjo403 sounds suspicious! Mind gisting the full error you saw?

@andjo403
Copy link
Author

andjo403 commented Oct 5, 2017

@alexcrichton
Copy link
Member

Ah try doing git submodule update --init --recursive maybe?

@andjo403
Copy link
Author

andjo403 commented Oct 5, 2017

yes solved it thanks

@andjo403 andjo403 deleted the remove_dead_code_after_inline branch October 5, 2017 19:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants