-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Benchmark regression is caused by 5f78a23 which inhibits code motion of static field loads (for already initialized static fields) for the sake of reusability of JIT generated code between isolates within isolate group. FWIW AOT always likely had the same bad performance. There might be few strategies for addressing this in the compiler (eager initialization of fields like this, loop peeling, hoisting always executed initializations out of the loop, etc) Note however that if the benchmark is attempting to measure for (var i = 0; i < loops; ++i) {
result = result * sigmoidFloat32x4(in1F32);
result = result * sigmoidFloat32x4(in2F32);
result = result * sigmoidFloat32x4(in3F32);
result = result * sigmoidFloat32x4(in4F32);
result = result + sigmoidFloat32x4(in5F32);
} observe here that all final s1 = sigmoidFloat32x4(in1F32);
final s2 = sigmoidFloat32x4(in2F32);
final s3 = sigmoidFloat32x4(in3F32);
final s4 = sigmoidFloat32x4(in4F32);
final s5 = sigmoidFloat32x4(in5F32);
for (var i = 0; i < loops; ++i) {
result = result * s1;
result = result * s2;
result = result * s3;
result = result * s4;
result = result + s5;
} The referenced change made /cc @mkustermann |
Thanks for the answer. This benchmark was created only to show this reported degradation. Actually I have found the performance issue in a much bigger project. |
About AOT: In dart 2.14, a compiled script shows the same degradation: 'dart compile exe bin/simd_benchmark_severe.dart' |
Does it make sense to force sdk/runtime/vm/compiler/backend/il.cc Line 1087 in e995cb5
The CSE optimization was supressed at this line:
|
The comment above line 1087 explains why it has to |
I understand the purpose, but I'm not an SDK code specialist and I'm investigating that only for 15min: JIT is allowing the reset of fields: sdk/runtime/vm/compiler/jit/compiler.cc Line 97 in fbcacd7
But sdk/runtime/vm/compiler/backend/il.cc Line 1113 in e995cb5
And JIT will always get out f this method: sdk/runtime/vm/compiler/backend/il.cc Line 1081 in e995cb5
|
If fields can be reset (which is a flag that is only ever set when generating AOT or AppJIT snapshots) then compiler can't make assumptions about the state of the fields based on their values during compilation because we are going to clear them when generating snapshot. If isolate groups are enabled then compiler can't make assumptions about fields being initialised based on the state of the field in one isolate because each other isolate starts from scratch with all fields cleared. As I have said there are some optimizations we could implement to address this, but in general it is good to remember that Dart's static fields are specified to be lazily initialised and thus accessing them from hot loops directly might come with a penalty. |
Nice explanation. One question: There's only one level of JIT in the VM? Maybe a multilevel JIT can allow extra JIT optimizations after the Isolate is initialized. Looking at the SDK code, it's prepared to generate JIT at any time (before Dart 2.15). But now, for some points, the JIT is generated like AOT snapshots. Since AOT is not able to generate JIT, it makes sense to behave like that. But one important benefit of the VM is high optimization for JIT and live adaptation of it. Just an initial idea:Let's say that JITs have "levels". Each optimization of a previous JIT increases the level. A JIT of level 0 allows Isolate bootstrap and AOT. If you can trace if a JIT code was generated for level 0 (allowing Isolate bootstrap), and another JIT was generated at level 1+, you can separate what can be shared between isolates and what should be dependent on the "current" Isolate state. So, when generating JIT for level 0, you disable everything that is "bad" for AOT and Isolate bootstrap. For level 1+ yo can enable extra optimization, static field initialization detection, and full CSE. Note that JIT sharing is very important for lightweight Isolates, allowing fresh Isolates to run with JIT without the need to recompute it. From my point of view, the real issue is the lack of the VM ability to define what is a JIT that can "boot" an Isolate and what is a JIT that is state-dependent. A complementary idea:Maybe when generating a JIT tree, it should always generate a state-dependent node and also a parallel node that is a level 0 node (that can boot an Isolate). So, when sharing a JIT between Isolates, you walk the tree looking only for level 0 nodes. When using a JIT locally, you allow level 1+ nodes of the tree. Of course that this will duplicate compiled code in memory if the JIT tree needs 2 versions. But this can be treated as a volatile memory since it can be discarded if needed and recomputed when necessary. |
Our JIT is a fairly classical adaptively optimizing speculative JIT with two tiers (unoptimized compilation and optimised compilation). I am not entirely sure I understand what you describe but I think it assumes that it can split JIT generated code by isolates - this could be theoretically done, but it goes against the whole concept of lightweight in which we share the program structure between isolates (e.g. all isolates within the same group have the same Frankly speaking, JIT has not been in focus for a while for us. We are kinda keeping it running - but we are not that much occupied by its peak performance as we used to because most Dart applications are deployed in AOT mode. I think if we are going to address this problem we are going to do it in a way that works for AOT as well. |
I have detected a severe degradation with SIMD operations in Dart 2.15.x (beta).
The performance degradation is from 10-20x depending on the case.
I have isolated the issue in this code:
https://github.com/gmpassos/simd_benchmark/blob/main/bin/simd_benchmark_severe.dart
I haven't identified yet why this is happening!
Benchmark project (README with benchmark outputs):
https://github.com/gmpassos/simd_benchmark
Benchmark resume:
Device 1 (macOS x64 - Intel):
macOS Big Sur 11.4
MacBook Pro (13-inch, 2019, Two Thunderbolt 3 ports)
macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_x64":
elapsedTime: 529 ms, hertz: 189,035,916 Hz
macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_x64":
elapsedTime: 14209 ms, hertz: 7,037,792 Hz
Device 2 (macOS ARM64 - M1 - Apple Silicon):
macOS Big Sur 11.6
MacBook Pro (13-inch, M1, 2020)
macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_arm64":
elapsedTime: 594 ms, hertz: 168,350,168 Hz
macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_arm64":
elapsedTime: 11056 ms, hertz: 9,044,862 Hz
The text was updated successfully, but these errors were encountered: