Description
Epic for improving how the jit produces and consumes profile data, with an emphasis on the "dynamic" scenario where everything happens in-process.
Much of the work is also applicable to AOT PGO scenarios.
All non-stretch items are completed for .NET 6. We'll open a follow-on issue to capture the stretch items below and new work envisioned for .NET 7.
Link to related github project
Overview document: Dynamic PGO
(intro from that doc)
Profile based optimization relies heavily on the principle that past behavior is a good predictor of future behavior. Thus observations about past program behavior can steer optimization decisions in profitable directions, so that future program execution is more efficient.
These observations may come from the recent past, perhaps even from the current execution of a program, or from the distant past. Observations can be from the same version of the program or from different versions.
Observations are most often block counts, but can cover many different aspects of behavior; some of these are sketched below.
A number of important optimizations are really only practical when profile feedback is available. Key among these is aggressive inlining, but many other speculative, time-consuming, or size-expanding optimizations fall in this category.
Profile feedback is especially crucial in JIT-based environments, where compile time is at a premium. Indeed, one can argue that the performance of modern Java and Javascript implementations hinges crucially on effective leverage of profile feedback.
Profile guided optimization benefits both JIT and AOT compilation. While this document focuses largely on the benefits to JIT compilation, but much of what follows is also applicable to AOT. The big distinction is ease of use -- in a jitted environment profile based optimization can be done automatically, and so can be offered as a platform feature without requiring any changes to applications.
.NET currently has a somewhat arm's-length approach to profile guided optimization, and does not obtain much benefit from it. Significant opportunity awaits us if we can tap into this technology
.NET 6 Scenarios
- [Dynamic PGO] Users can opt-into a dynamic PGO mode where their application is automatically profiled and optimized as methods are tiered up. Full benefit may require enabling other non-default behaviors (eg enabling QuickJitForLoops, disabling ReadyToRun).
- [Static PGO] Users can collect profile data created by jit applied instrumentation and apply it to future runs of their application, both for jitted and prejitted code. This mode also replaces the current IBC technology and is used internally in our own builds.
- [Sample based PGO] Users can provide sample data from performance profilers that the jit can ultimately leverage to improve code generation.
Work items
(stretch) indicates things that are not going make it into .NET 6.0.
Representation of Profile Data
- use floating point for block weights (Float weights #44983)
- (stretch) remove edge weight and min/max in favor of successor likelihood (in progress; JIT: refactor edge weight representation #46885)
- (stretch) normalize counts from the start (JIT: use normalized counts throughout #46883)
Incorporation of profile data
- ensure that 'fgComputeBlockAndEdgeWeights' is not overly pessimistic, or, remove it all together. (JIT: tolerate edge profile inconsistencies better. #50213, JIT: don't allow negative edge weights #52884)
- (stretch) support for user annotations (RyuJIT: Allow developers to provide Branch Prediction Information #6225)
- (stretch) move building of pred lists earlier
- (stretch) implement profile synthesis
- (stretch) implement blended/hedged profiles
- make sure jit does reasonable things when caller has no PGO, but callee does (JIT: fixes for mixed PGO/nonPGO compiles #50633) (see this note). Common now even with dynamic PGO, now that we've enabled minimal probing and so won't have profile data for simple "wrapper" methods.
Heuristics and Optimization
- create a more profile-aware inline policy (PGO Inlining Policy #43914, [JIT] Improve inliner: new heuristics, rely on PGO data #52708, Inliner: new observations (don't impact inlining for now) #53670, JIT: don't report profile for methods with insufficient samples during prejitting #55096)
- enable guarded devirtualization by class profile (Initial version of class profiling for PGO #45133)
- remove struct limitation from guarded devirtualization (Update guarded devirtualization to handle struct returns #51138)
- ensure we have proper cleanup opts for redundant class tests (JIT: jump threading #46257, JIT: optimize redundant type tests created by PGO #46887, JIT: chained guarded devirtualization #51890)
- enable hot/cold splitting based on block profile (Disable HANDLER_ENTRY_MUST_BE_IN_HOT_SECTION #71273)
- allow testing hot/cold splitting in the jit without direct runtime support (Implement fake hot/cold splitting and corresponding stress mode #69763, Add hot/cold splitting test job to jit-runtime-experimental #69922, Enable fake hot/cold splitting on ARM64 #70708)
- (stretch) evaluate and adjust heuristics in other phases as needed
- allow CSE of method table and vtable lookups (partially done, see JIT: consistently mark method table accesses as invariant #45854. Also JIT: allow CSE a hoisting of vtable lookups #46884; in progress via Expand Virtual Call targets earlier in Morph and allow CSE of the indirections #47808)
- peel dominant switch case during switch lowering (JIT: peel off dominant switch case under PGO #52827)
- (stretch) when we lower a switch to compares, order compares using profile data
- consider ignoring profile data for a method when all counts are zero (done) or total number of counts is small (JIT: don't report profile for methods with insufficient samples during prejitting #55096)
- (stretch) globally order functions using PGO during crossgen
Instrumentation
- (stretch) handle loops to offset 0 better (JIT: use normalized counts throughout #46883)
- minimum weight spanning tree instrumentation (JIT: efficient profiling schemes #46882, via JIT: refactor instrumentation code #47509, JIT: run instrumentation phase just after importing #47476, JIT: split up some parts of flowgraph.cpp #47072, JIT: let instrumentor decide which blocks to process #47597, JIT: fix issues with profile incorporation phase #47723, JIT: fix interaction of PGO and jitstress #47876 and finally Spanning tree instrumentation #47959)
- omit probes for single-block methods (JIT: update jit config defaults for PGO #49267)
- (stretch) better tier0 codegen for probe sequences
- probes for class types (Initial version of class profiling for PGO #45133)
- (stretch) Fine tuning of class probes (PGO: class profile details we need to get right #48549)
- (stretch) consider devirtualizing in tier0 to reduce class probe overhead (need to revisit this)
- (stretch) Complications with instrumentation and OSR (JIT: resolve issues with OSR and PGO #47942)
Sample Based PGO
- Prototype the ability for the jit to consume sampled profile data (Add SPGO support and MIBC comparison in dotnet-pgo #52765)
- Validate accuracy of sampled profiles on JIT benchmarks
- Validate accuracy of sampled profiles on training scenarios
- (stretch) Consider strategies for incorporating stale profile data
- (stretch) Store debug annotations on the side
- Track IL offsets when inlining (Start tracking debug info for inlined statements #61220)
- Improve accuracy of IL offsets in optimized code (Some more precise debug info improvements #61419)
Runtime
- enable reading of prejit profile data when jitting (Provide block counts in tiered compilation from R2R images #13672. Support via crossgen2 R2R is in Pgo phase3 #47558; data flowing into assembly prejitting is in Enable the latest managed pgo data #49793)
- (stretch) time and space efficient type probes
- (stretch) other heuristics to drive jit speculation on types
Maintenance
- Add profile data consistency checker (JIT: initial version of a profile checker #42481)
- (stretch) Implement profile reconstruction scheme
- Uncover and fix significant maintenance issues (JIT: make profile data available to inlinees #42277, JIT: some small profile related fixes #43408, JIT: change loop inversion edge weight updates and add phase #48364, PGO considerations for finally cloning #48925, loop cloning and pgo #48850, JIT: profile updates for finally optimizations #49139)
- Consider allowing inlinees to "scale up" their counts if call site count is greater than inlinee entry count (JIT: allow inlinee profile scale-up #48280)
Debugging and Diagnostics
- Create modes where PGO data is readily available
- Create tools to analyze profile data
- Ensure Dynamic PGO works properly with SPMI (in progress; Fix SPMI to handle replays of BBINSTR jit method contexts #41386, SPMI: make method identity dependent on jit flags and isa flags #48082, SPMI: tolerate null pResolvedToken #48208, SPMI: adjust near differ offset compare logic #48245, Add MCS verb to dump jit flags histogram #48281)
- Graphical dump of flow graph with profile data (JIT: show profile data for dot flowgraph dumps #42657)
- Tooling for verifying PGO is working as expected (as in Fix weight computation in jit #47470)
- (stretch) Methodology for asm diffs with (dynamic) PGO
Testing and CI
- Add suitable testing to inner and outerloop CI (Add pgo testing to outerloop #53301)
- Add SPMI collections that use dynamic PGO (not yet automated)
- Cross-verify that devirtualization implies 100% likely class profiles (JIT: pgo/devirt diagnostic improvements #53247)
Performance
- Look at performance on realistic apps (PGO measurements on TE)
- Look at overhead of instrumentation
Related issues:
- Profiling: Profiling, profile-guided optimization, and deoptimization #7235, RyuJIT: Allow developers to provide Branch Prediction Information #6225, Set profile weights correctly for the Internal blocks in JIT/Directed/UnrollLoop/loop3_il_d.exe #7727, JIT: Support consuming profile guided optimization data for optimization #6522
- Devirtualization: Unclear logic in Compiler::impDevirtualizeCall #43607, JIT: see if guarded devirtualization for EqualityComparer methods pays off #9028, Performance decrease when using interfaces #7291, JIT: devirtualization next steps #7541, [Question] Can Virtual Stub Dispatch be "inlined"? #7198, Compute precise generic context after de-virtualization #38477
- Inlining: JIT: have inlining heuristics look for cases where inlining might enable devirtualization #10303, RyuJIT: Don't inline calls inside rarely used basic-blocks #41923, Inliner: look at block weight allocation in inlinees #6096, Inliner: methodology for code quality measurements #5673, #RyuJIT call optimization and aggressive inlining with known generic types #4489, Ensure inlining in cryptographic functions #4591, RyuJIT: Fasta benchmark: hot method random() is not in-lined by legacy policy into SelectRandom() #7311
Also of note:
- Dynamic PGO should provide one route to enable Guarded Devirtualization
- The analysis of the experiment to develop inlining heuristics via machine learning suggests it would have been more successful if there was profile feedback.
category:planning
theme:planning
skill-level:expert
cost:large
Metadata
Metadata
Assignees
Labels
Type
Projects
Status