Skip to content

[PERF] Dominance frontier calculation appears to be O(n^2) #1300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TyOverby opened this issue Sep 2, 2022 · 14 comments · Fixed by #1319 or ocaml/opam-repository#22503
Closed

[PERF] Dominance frontier calculation appears to be O(n^2) #1300

TyOverby opened this issue Sep 2, 2022 · 14 comments · Fixed by #1319 or ocaml/opam-repository#22503
Labels

Comments

@TyOverby
Copy link
Collaborator

TyOverby commented Sep 2, 2022

A large jsoo application I've encountered at work spends about 9 minutes compiling with js_of_ocaml using direct-compilation (separate compilation is much faster). Profiling the compiler shows that almost all of this time is spent in generate.ml, specifically computing dominance frontiers.

I believe that the issue is that although dominance frontier calculation can be done for the whole program in one pass, the implementation in generate.ml will compute the frontier for each node individually, resulting in no opportunity to share results.

@TyOverby TyOverby added the bug label Sep 2, 2022
@TyOverby
Copy link
Collaborator Author

TyOverby commented Sep 2, 2022

I suspect that there’s something super linear going on here because a from-scratch separate-compilation build of the same project spends ~20x less time doing dominance frontier calculation

@hhugo
Copy link
Member

hhugo commented Sep 2, 2022

Can you share some code reproducing the issue ? What's the size of the bytecode file ? Have you tried --disable inline ?

@TyOverby
Copy link
Collaborator Author

TyOverby commented Sep 2, 2022

Can you share some code reproducing the issue?

Sadly I can not

What's the size of the bytecode file?

112MB

Have you tried --disable inline?

I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.

@TyOverby
Copy link
Collaborator Author

TyOverby commented Sep 2, 2022

From debugging this a few weeks ago, I remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list (a node in the dominator tree has a single child, which has a single child, which has a single child, etc...) that it spends most of its time traversing. Is it possible that this is the list of constants/globals or maybe the toplevel statements?

@TyOverby
Copy link
Collaborator Author

TyOverby commented Sep 2, 2022

Have you tried --disable inline?

I have not, though I can try it out and see what happens. I should point out that it's using direct-compilation, but doesn't use any of the other "optimization" flags.

It appears to make no difference

@TyOverby
Copy link
Collaborator Author

TyOverby commented Sep 2, 2022

remember that the primary issue was that a region of the dominator tree is effectively just a singly-linked list

Because of this, I think the dominator tree depth was somewhere on the order of ~100,000

@TyOverby
Copy link
Collaborator Author

TyOverby commented Oct 12, 2022

THE PLOT THICKENS

Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.

With --sourcemap

  parsing js: 0.19
Start parsing...
  parsing: 13.50
Start Optimizing...
  tail calls: 0.13
    flow analysis 1: 1.52
    flow analysis 2: 1.07
    flow analysis 3: 0.25
    flow analysis 4: 1.23
    flow analysis 5: 4.42
  flow analysis: 8.49
  dead code elim.: 1.48
  inlining: 0.73
  dead code elim.: 1.61
  tail calls: 0.08
    phi-simpl. 1: 0.48
    phi-simpl. 2: 0.56
  phi-simpl.: 1.24
    flow analysis 1: 0.30
    flow analysis 2: 0.33
    flow analysis 3: 0.21
    flow analysis 4: 0.38
    flow analysis 5: 0.85
  flow analysis: 2.06
  dead code elim.: 1.07
  inlining: 0.31
  dead code elim.: 0.79
    flow analysis 1: 0.14
    flow analysis 2: 0.20
    flow analysis 3: 0.18
    flow analysis 4: 0.27
    flow analysis 5: 0.53
  flow analysis: 1.32
  dead code elim.: 1.14
  inlining: 0.34
  dead code elim.: 0.85
    phi-simpl. 1: 0.14
    phi-simpl. 2: 0.20
  phi-simpl.: 0.53
    flow analysis 1: 0.39
    flow analysis 2: 0.39
    flow analysis 3: 0.39
    flow analysis 4: 0.42
    flow analysis 5: 0.47
  flow analysis: 2.07
  free vars: 0.45
  dead code elim.: 0.81
 optimizations : 27.86
Start Generation...
// 10 minutes pass and then it stack overflows

Without --sourcemap

  parsing js: 0.19
Start parsing...
  parsing: 0.84
Start Optimizing...
  tail calls: 0.07
    flow analysis 1: 0.43
    flow analysis 2: 0.60
    flow analysis 3: 0.08
    flow analysis 4: 0.40
    flow analysis 5: 0.43
  flow analysis: 1.94
  dead code elim.: 0.71
  inlining: 0.41
  dead code elim.: 0.81
  tail calls: 0.08
    phi-simpl. 1: 0.46
    phi-simpl. 2: 0.39
  phi-simpl.: 1.00
    flow analysis 1: 0.24
    flow analysis 2: 0.29
    flow analysis 3: 0.10
    flow analysis 4: 0.19
    flow analysis 5: 0.34
  flow analysis: 1.16
  dead code elim.: 0.54
  inlining: 0.29
  dead code elim.: 0.57
    flow analysis 1: 0.35
    flow analysis 2: 0.25
    flow analysis 3: 0.08
    flow analysis 4: 0.11
    flow analysis 5: 0.16
  flow analysis: 0.94
  dead code elim.: 0.50
  inlining: 0.25
  dead code elim.: 0.58
    phi-simpl. 1: 0.18
    phi-simpl. 2: 0.12
  phi-simpl.: 0.65
    flow analysis 1: 0.26
    flow analysis 2: 0.15
    flow analysis 3: 0.08
    flow analysis 4: 0.10
    flow analysis 5: 0.18
  flow analysis: 0.78
  free vars: 0.30
  dead code elim.: 0.52
 optimizations : 13.46
Start Generation...
  code gen.: 10.80
Start Linking...
  linking: 0.53
Start Optimizing js...
    share constant: 0.74
    simpl: 0.31
    clean: 0.26
    shortten vars: 1.28
  optimizing: 3.24
Start Coloring...
  coloring: 5.79
Start Checks...
  checks: 1.21
Start Writing file...
  write: 0.28
compilation: 36.35

@TyOverby
Copy link
Collaborator Author

TyOverby commented Oct 12, 2022

The same is true for jsoo link

with --sourcemap: 5.1s
without --sourcemap: 0.5s
without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s

@TyOverby
Copy link
Collaborator Author

This one doesn't surprise me as much; most of what I want jsoo-link to do is stitch-up sourcemaps correctly, but I wasn't expecting a full order of magnitude

@hhugo
Copy link
Member

hhugo commented Oct 13, 2022

without --sourcemap: 0.5s
without --sourcemap and all input .js files don't have a sourcemap comment: 0.08s

#1313 should reduce this difference.

with --sourcemap: 5.1s
without --sourcemap: 0.5s

It would be interesting to get numbers about the size of the debug info.
Can you check the impact of #1313 on these timings ? I suspect it won't change much.

@hhugo
Copy link
Member

hhugo commented Oct 13, 2022

Passing --sourcemap makes basically every pass in the compiler dramatically slower, including codegen.

passing --sourcemap forces the debug info to be read and used. You can achieve the same thing with --debug-info or --pretty.

Could you report timings when using --debug-info and --pretty separately ?

I suspect that

if Debug.enabled debug_data then debug_data else Debug.create ~toplevel:false false
is what cost you the most (triggered by --debug-info and --source-map, but not --pretty). It creates lots of "artificial" blocks for the purpose of holding more debug-info.
The change was Introduced in 08124c05888dc6cd33d4fa261dcd9803e6864b77https://github.com/ocsigen/js_of_ocaml/commit/08124c05888dc6cd33d4fa261dcd9803e6864b77.

translate_instrs st.ctx queue (source_location st.ctx pc) block.body
is where we map the pc of a block to its location info.

If my understanding is correct, we should be able to disconnect how we track debug infos from how we split blocks. Maybe @vouillon has more to say about this.

@TyOverby
Copy link
Collaborator Author

Can you check the impact of #1313 on these timings ? I suspect it won't change much.

I'll try it out

@TyOverby
Copy link
Collaborator Author

For what it's worth, I think that my original diagnosis that dominance frontier calculation is quadratic is still correct. This recent development with debuginfo shows that every pass gets ~2x slower, but codegen gets slower quadratically with respect to code size.

@hhugo
Copy link
Member

hhugo commented Oct 18, 2022

Here is a minimal repro case.

500 consecutive function calls

$ ocaml <(echo "for i = 0 to 500  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 0.03
...
compilation: 0.17

5_000 consecutive function calls

$ ocaml <(echo "for i = 0 to 5000  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 3.14
...
compilation: 3.56

10_000 consecutive function calls

$ ocaml <(echo "for i = 0 to 10000  do print_endline \"print_char '.';\"; done") > a.ml && ocamlc -g a.ml -o a.byte && js_of_ocaml --debuginfo --debug times a.byte
...
Start Generation...
  code gen.: 14.13
...
compilation: 14.97

hhugo pushed a commit to hhugo/opam-repository that referenced this issue Nov 15, 2022
…s_of_ocaml-ppx_deriving_json, js_of_ocaml-ppx, js_of_ocaml-lwt and js_of_ocaml-compiler (4.1.0)

CHANGES:

## Features/Changes
* Compiler: initial support for OCaml 5 (ocsigen/js_of_ocaml#1265,ocsigen/js_of_ocaml#1303)
* Compiler: bump magic number to match the 5.0.0~alpha0 release (ocsigen/js_of_ocaml#1288)
* Compiler: complain when runtime and compiler built-in primitives disagree (ocsigen/js_of_ocaml#1312)
* Compiler: more efficient implementation of Js_traverse.freevar
* Compiler: more efficient implementation of Js_traverse.rename_variable
* Compiler: --linkall now export all compilation units in addition to primitives (ocsigen/js_of_ocaml#1324)
* Compiler: improve --dynlink, one no longer need to pass --toplevel to use Dynlink (ocsigen/js_of_ocaml#1324)
* Compiler: toplevel runtime files "+toplevel.js" and "+dynlink.js" are added automatically (ocsigen/js_of_ocaml#1324)
* Misc: switch to cmdliner.1.1.0
* Misc: remove old binaries jsoo_link, jsoo_fs
* Misc: remove uchar dep
* Misc: use 4.14 in the CI
* Misc: switch to dune 3
* Lib: add missing options for Intl.DateTimeFormat
* Lib: add missing options for Intl.NumberFormat
* Lib: wheel event binding
* Lib: add normalize in js_string (ES6)
* Lib: more complete transition event bindings
* Lib: remove support for old browser-specific transition events
* Runtime: Implement weak semantic for weak and ephemeron
* Runtime: Implement Gc.finalise_last
* Runtime: Implement buffer for in_channels
* Runtime: add support for unix_opendir, unix_readdir, unix_closedir, win_findfirst, win_findnext, win_findclose
* Runtime: Dont use require when target-env is browser
* Runtime: Implements Parsing.set_trace (ocsigen/js_of_ocaml#1308)
* Test: track external used in the stdlib and unix

## Bug fixes
* Compiler: fix quadratic behavior of dominance frontier (fix ocsigen/js_of_ocaml#1300)
* Compiler: fix rewriter bug in share_constant (fix ocsigen/js_of_ocaml#1247)
* Compiler: fix miscompilation of mutually recursive functions in loop (ocsigen/js_of_ocaml#1321)
* Compiler: fix bug while minifying/renaming try-catch blocks
* Compiler: no dead code elimination for caml_js_get
* Runtime: fix ocamlyacc parse engine (ocsigen/js_of_ocaml#1307)
* Runtime: fix Out_channel.is_buffered, set_buffered
* Runtime: fix format wrt alternative
* Runtime: fix Digest.channel
* Runtime: sync channel seek / pos with the native runtime
* Misc: fix installation with dune 3 without opam
* Node: Only write small chunks to stdout/stderr so they flush
* Deriving: fix for nested polymorphic variants
@hhugo hhugo mentioned this issue Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment