Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Commit 1aca073

Browse files
committed
Document describing upcoming object stack allocation work.
1 parent ffe1cd6 commit 1aca073

File tree

2 files changed

+199
-0
lines changed

2 files changed

+199
-0
lines changed
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# Object Stack Allocation
2+
3+
This document describes work to enable object stack allocation in .NET Core.
4+
5+
## Motivation
6+
7+
In .NET instances of reference types are allocated on the garbage-collected heap.
8+
Such allocations have performance overhead at garbage collection time. The allocator also has to ensure that the memory is fully zero-initialized.
9+
If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation
10+
may be moved to the stack. The benefits of this optimization:
11+
12+
* The pressure on the garbage collector is reduced because the GC heap becomes smaller. The garbage collector doesn't have to be involved in allocating or deallocating these objects.
13+
* Object field accesses may become cheaper if the compiler is able to do scalar replacement of the fields of the stack-allocated object
14+
(i.e., if the fields can be promoted).
15+
* Some field zero-initializations may be elided by the compiler.
16+
17+
Object stack allocation is implemented in
18+
various Java runtimes. This optimization is more important for Java since it doesn't have value types.
19+
20+
## GitHub issues
21+
22+
[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Compiler should optimize "alloc temporary small object" to "alloc on stack"
23+
24+
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784) CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically
25+
26+
## Escape Analysis
27+
28+
An object is said to escape a method if it can be accessed after the method's execution has finished.
29+
An object allocation can be moved to the stack safely only if the object doesn't escape the allocating method.
30+
31+
Several escape algorithms have been implemented in different Java implementations. Of the 3 algorithms listed in [references](References),
32+
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf)
33+
is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler,
34+
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf)
35+
is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation.
36+
[[2]](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf)
37+
is between
38+
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) and
39+
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf)
40+
both in analysis precision and cost. It was used in Java HotSpot.
41+
42+
Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally.
43+
With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites,
44+
which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization.
45+
[[4]](http://www.ssw.uni-linz.ac.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf) describes an approach that
46+
handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths.
47+
48+
There are several choices for where escape analysis can be performed:
49+
50+
### Analysis in the jit
51+
**Pros:**
52+
* The jit can analyze callee's code (subject to some restrictions, e.g., when running under profiler)
53+
since there are no versioning considerations at jit time.
54+
* The optimization will apply to any msil code regardless of the language compiler or msil post-processing tools.
55+
* The jit already has IR that's suitable for escape analysis.
56+
57+
**Cons:**
58+
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization.
59+
* Full interprocedural analysis is expensive for the jit, even at high tiering levels. Background on-demand/full interprocedural analysis might be feasible
60+
if we have the ability to memoize method properties with (in)validation.
61+
62+
Possible approaches to interprocedural analysis in the jit:
63+
* We can run escape analysis concurrently with inlining and analyze callee's parameters for escaping while inspecting
64+
inline candidates. The results of such analysis can be cached.
65+
* We can adjust inlining heuristics to give more weight to candidates whose parameters have references to potentially
66+
stack-allocated objects. Inlining such methods may result in additional benefits if the jit can promote fields of the
67+
stack-allocated objects.
68+
* For higher-tier jit the order of method processing may be closer to bottom-up, i.e., callees before callers. That may
69+
help with running stack allocation optimization.
70+
71+
### Analysis in ngen/crossgen
72+
**Pros:**
73+
* ngen/crossgen can afford to spend more time for escape analysis.
74+
* The jit already has IR that's suitable for escape analysis.
75+
76+
**Cons:**
77+
* crossgen in Ready-To-Run mode is not running on generic methods that cross assembly boundaries.
78+
* crossgen in Ready-To-Run mode is not allowed to analyze code of methods from other assemblies.
79+
* newobj in Ready-To-Run mode is more abstract and introducing a hard dependence on ref class size and layout may interfere
80+
with version reseliency.
81+
82+
### Analysis in ILLink
83+
**Pros:**
84+
* ILLInk can afford to spend more time for escape analysis.
85+
* For self-contained apps, ILLink has access to all of application's code and can do full interprocedural analysis.
86+
* ILLink is already a part of System.Private.CoreLib and CoreFX build toolchain so the assemblies built there can benefit
87+
from this.
88+
89+
**Cons:**
90+
* The implementation will only benefit customers that have ILLink in their toolchain
91+
* ILLink operated on a view of metadata and raw msil instructions, it currently doesn't have a call graph or IR representation
92+
suitable for escape analysis.
93+
94+
The results of escape analysis in the linker may be communicated to the jit by injecting an intrinsic call right before or after
95+
newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach.
96+
An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with
97+
local analysis.
98+
99+
If the methods whose info was used for interprocedural escape analysis are allowed to change after the analysis, the jit either needs
100+
to inline those methods or there should be a mechanism to immediately revoke methods with stack allocated objects that relied on
101+
that analysis.
102+
103+
## Other restrictions on stack allocations
104+
105+
* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue.
106+
* Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is
107+
allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation.
108+
* Conditional object allocations (i.e., allocations that don't dominate the exit) need to be restricted to avoid growing the stack
109+
unnecessarily. A possible approach is turning such allocations to dynamic stack allocations.
110+
* There should be a limit on the maximum size of stack allocated objects.
111+
* There may be restrictions for objects with weak GC fields (this needs to be investigated).
112+
113+
## GC considerations
114+
115+
The jit is responsible for reporting references to heap-allocated objects to the GC. With stack-allocated objects present in a method
116+
a reference at a particular GC-safe point may be in one of 3 states:
117+
* The reference always points to a heap object: the reference should be reported as TYPE_GC_REF.
118+
* The reference always points to a stack object: the reference should not be reported to the GC.
119+
* The reference may point to a heap object or to a stack object depending on control flow: the reference should be reported as TYPE_GC_BYREF.
120+
121+
All GC fields of stack-allocated objects have to be reported to the GC, the same as for fields of stack-allocated value classes.
122+
123+
When a field of an object is modified the jit may need to issue a write barrier:
124+
* The reference always points to a heap object: normal write barrier should be used
125+
* The reference always points to a stack object: no write barrier is needed
126+
* The reference may point to a heap object or to a stack object depending on control flow: checked write barrier should be used
127+
128+
## Existing Prototypes
129+
130+
@echesakovMSFT implemented a [prototype](https://github.com/echesakovMSFT/coreclr/tree/StackAllocation) in 2016.
131+
132+
The goal of the prototype was to have the optimization working end-to-end with a number of simplifications:
133+
* A simple intra-procedural escape analysis based on
134+
[[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf)
135+
but without field edges in the connection graph.
136+
* All call arguments are assumed to be escaping.
137+
* Only simple objects are stack allocated, arrays of constant size are not analyzed.
138+
* Only objects that are allocated unconditionally in the method are moved to the stack. An improvement here would
139+
be allocating other objects dynamically on the stack.
140+
* If at least one object in a method is stack allocated, all objects are conservatively reported as as TYPE_GC_BYREF
141+
and a checked write barrier is used in the method.
142+
* All objects allocated on the stack also have a pre-header allocated. Pre-header is used for synchronization
143+
and hashing so we could eliminate it if we proved the object wasn't used for synchronization and hashing.
144+
145+
We ran the prototype on System.Private.CoreLib via crossgen in 2016 and
146+
objects at 21 allocation sites were moved to the stack.
147+
148+
We also ran an experiment where we changed the algorithm to optimistically assume that no arguments escape.
149+
The goal was to get an upper bound on the number of potential stack allocations.
150+
151+
* 424 methods out of 6150 had allocations moved to the stack.
152+
* 586 allocation sites out of 7977 were moved to the stack.
153+
* One finding was that most sites were exception object allocations (5345), which almost never happen dynamically.
154+
* Excluding exception allocations we had 586 allocation sites out of 2632 that could be moved to the stack.
155+
So the upper bound from this experiment is 22.2%.
156+
157+
@AndyAyersMS recently resurrected @echesakovMSFT work and used it to [prototype stack allocation of a simple delegate that's
158+
directly invoked](https://github.com/dotnet/coreclr/compare/master...AndyAyersMS:NonNullPlusStackAlloc). It exposed a number of things that need to be
159+
done in the jit to generate better code for stack-allocated objects. The details are in comments of
160+
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784).
161+
162+
We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in [GreenNode.WriteTo](https://github.com/dotnet/roslyn/blob/fab7134296816fc80019c60b0f5bef7400cf23ea/src/Compilers/Core/Portable/Syntax/GreenNode.cs#L647).
163+
This object allocation accounts for 8.17% of all object allocations in this scenario. The number is not as impressive as a percentage
164+
of all allocated bytes: 0.67% (6.24 Mb out of 920.1 Mb) but it's just a single static allocation.
165+
Below is the portion of the call graph the escape analysis will have to consider when proving this allocation is not escaping.
166+
Green arrows correspond to the call sites that are inlined and red arrows correspond to the call sites that are not inlined.
167+
168+
![Call Graph](../images/GreenNode_WriteTo_CallGraph.png)
169+
170+
## Roadmap
171+
172+
We will implement the optimization in the jit first and get as many cases as possible that don't require deep interprocedural analysis,
173+
e.g., the delegate cases from the prototype mentioned above. We will also try to take advantage of the more suitable order of method
174+
processing in higher-tier jit.
175+
176+
The jit work includes removing the restrictions in @echesakovMSFT prototype, making escape analysis more sophisticated, making
177+
changes for producing better code for stack-allocated objects (some of which @AndyAyersMS discovered while working on his prototype),
178+
and updating inlining heuristics to help with object stack allocation.
179+
180+
To get the maximum benefit from the optimization we will likely have to augment the jit analysis with more information. The information
181+
may come from manual annotations or from a tool analysis. ILLink or the upcoming [CPAOT](https://github.com/dotnet/corert/tree/r2r)
182+
may be appropriate places for ahead-of-time escape analysis. Self-contained applications will benefit the most from ILLink analysis
183+
but framework assemblies can also be analyzed and annotated even though cross-assembly calls will have to be processed conservatively.
184+
185+
In this context an algorithm similar to [[1]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf) can be used
186+
to get the most accurate escape results.
187+
188+
The cost of adding some infrastructure to ILLink (call graph, proper IR, etc.) will be amortized if we do other IL-to-IL optimizations in the future.
189+
Also, we may be able to reuse the infrastructure from other projects, i.e., [ILSpy](https://github.com/icsharpcode/ILSpy/blob/da2f0d0b9143fb082a5529f78267fa36e8bf16f9/ICSharpCode.Decompiler/IL/ILReader.cs).
190+
191+
## References
192+
193+
[[1] Jong-Deok Choi at al. Stack Allocation and Synchronization Optimizations for Java Using Escape Analysis.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.4799&rep=rep1&type=pdf)
194+
195+
[[2] Thomas Kotzmann and Hanspeter Moessenbroeck. Escape Analysis in the Context of Dynamic Compilation and Deoptimization](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf)
196+
197+
[[3] David Gay and Bjarne Steensgaard. Fast Escape Analysis and Stack Allocation for Object-Based Programs](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf)
198+
199+
[[4] Lukas Stadler at al. Partial Escape Analysis and Scalar Replacement for Java](http://www.ssw.uni-linz.ac.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf)
Loading

0 commit comments

Comments
 (0)