Skip to content

Rewrite inlining pass #1935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 21, 2025
Merged

Rewrite inlining pass #1935

merged 7 commits into from
May 21, 2025

Conversation

vouillon
Copy link
Member

No description provided.

@vouillon vouillon force-pushed the inlining branch 3 times, most recently from 840420d to 7b64a79 Compare April 14, 2025 23:08
@vouillon vouillon force-pushed the inlining branch 3 times, most recently from 79446f9 to ba1a622 Compare April 16, 2025 15:48
@vouillon vouillon force-pushed the inlining branch 4 times, most recently from b62b39e to 5cb6652 Compare April 24, 2025 17:49
@vouillon vouillon marked this pull request as ready for review April 24, 2025 17:50
@hhugo
Copy link
Member

hhugo commented Apr 25, 2025

I've pushed a fixup to the testsuite.
We should check how this PR affects functor heavy programs (something using core maybe).
@TyOverby could you tests this PR on your side ?

@hhugo
Copy link
Member

hhugo commented Apr 25, 2025

We need a changelog entry

@hhugo
Copy link
Member

hhugo commented Apr 25, 2025

I'm not certain I read the benchmark correctly.
It seems that partial render table sees a code size increase of 10%, memory increase of ~50%, compilation time increase of 30% for not runtime improvement.

@hhugo
Copy link
Member

hhugo commented May 6, 2025

Maybe we can wait for #1962 to get better measurements.

@hhugo
Copy link
Member

hhugo commented May 7, 2025

We don't have the latest benchmark.
The last result we have shows a runtime regression for ocamlc (maybe some noise ?).

With this PR, we seem to double the time spent in inline. We can probably live with that.

@hhugo hhugo force-pushed the inlining branch 2 times, most recently from 10a1ba8 to 6aaf9ad Compare May 7, 2025 21:59
@vouillon
Copy link
Member Author

out of curiosity, what was the osx / node-24 issue? consuming too much memory?

@TyOverby See my comment above.

@vouillon
Copy link
Member Author

I'm not certain I read the benchmark correctly. It seems that partial render table sees a code size increase of 10%, memory increase of ~50%, compilation time increase of 30% for not runtime improvement.

Right, the aggressive inlining of functors does not really seem to result into any runtime improvement with js_of_ocaml. So it is not enabled only with wasm_of_ocaml.

@hhugo
Copy link
Member

hhugo commented May 16, 2025

I've pushed commits to only inline (small) functors in o3 with jsoo. Let's wait for the benchmarks

@hhugo hhugo force-pushed the inlining branch 2 times, most recently from 13f7293 to 4ac6713 Compare May 16, 2025 10:58
@hhugo
Copy link
Member

hhugo commented May 16, 2025

fannkuch_redux and fft seem to take longer now. Can you take a look ? Compilation time increase everywhere but I guess we could live with that given recent improvement everywhere else

@vouillon
Copy link
Member Author

For fft, it's because a function no longer gets inline because I have reduced the inlining limit from 200 down to 150.
For fankuch_redux, the function fannkuch is no longer inlined at toplevel, so it not optimized with the assumption that n = 10.

  let n = 10 in
  let _maxflips, _checksum = fannkuch n in

Inlining small functions make a significant difference for raytrace.

@hhugo
Copy link
Member

hhugo commented May 16, 2025

For fft, it's because a function no longer gets inline because I have reduced the inlining limit from 200 down to 150. For fankuch_redux, the function fannkuch is no longer inlined at toplevel, so it not optimized with the assumption that n = 10.

  let n = 10 in
  let _maxflips, _checksum = fannkuch n in

Inlining small functions make a significant difference for raytrace.

Are you ok to merge in the current state ?

@hhugo
Copy link
Member

hhugo commented May 16, 2025

Apologies for the delay; I didn't see this thread for a while. We should have some test and benchmark results ready for you next week.

@TyOverby, any update on this ?

@TyOverby
Copy link
Collaborator

We've been trying to import these changes (well, really the base revision so that we have a good point to compare benchmarks with) and have hit a very large number of conflicts with our internal patches due to the recent PRs that have been merged. I think we're close to being ready to test this PR, my guess is next week.

@vouillon
Copy link
Member Author

Are you ok to merge in the current state ?

I would prefer to wait for some feedbacks from Ty.

@rickyvetter
Copy link
Contributor

We were able to run pull this in internally and performance looks very good! Substantially faster and more consistent on PRT and our other internal benchmarks. For Bonsai benchmarks we are seeing 50%-80% reduction in benchmarking times. Binary size looks like <1% increase in separate compilation and 0-2% in whole program. There are a couple outlier programs that increase in the 10-16% range.

We've reached out about a miscompilation issue on Slack and initially we believe it's unrelated to this PR directly, but it looks like applying this patch actually causes a very similar miscompilation in a program that didn't have it before. This one new case is the only test we have failing, and I suspect that if we resolve the minimal repro for the original issue, then we might also see how to resolve this new instance in this PR?

@TyOverby
Copy link
Collaborator

For the bonsai benchmarks, I suspect that the large improvements are due to the inlining-related memory leak being resolved by this PR

@hhugo
Copy link
Member

hhugo commented May 21, 2025

We were able to run pull this in internally and performance looks very good! Substantially faster and more consistent on PRT and our other internal benchmarks. For Bonsai benchmarks we are seeing 50%-80% reduction in benchmarking times. Binary size looks like <1% increase in separate compilation and 0-2% in whole program. There are a couple outlier programs that increase in the 10-16% range.

We've reached out about a miscompilation issue on Slack and initially we believe it's unrelated to this PR directly, but it looks like applying this patch actually causes a very similar miscompilation in a program that didn't have it before. This one new case is the only test we have failing, and I suspect that if we resolve the minimal repro for the original issue, then we might also see how to resolve this new instance in this PR?

Anything to say on compilation times?

vouillon and others added 7 commits May 21, 2025 09:34
- We are a lot more aggressive at inlining functor-like functions in
wasm_of_ocaml, since this may enable further optimizations
- We are more cautious at inlining nested functions, since this can
result in memory leaks
- We inline a larger class of small functions
@hhugo hhugo merged commit 3695d26 into master May 21, 2025
25 of 26 checks passed
@hhugo hhugo deleted the inlining branch May 21, 2025 07:37
@hhugo
Copy link
Member

hhugo commented May 21, 2025

Let's merge and move from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants