Skip to content

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gmpassos opened this issue Oct 18, 2021 · 9 comments
Closed

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

gmpassos opened this issue Oct 18, 2021 · 9 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-regression

Comments

@gmpassos
Copy link
Contributor

gmpassos commented Oct 18, 2021

I have detected a severe degradation with SIMD operations in Dart 2.15.x (beta).

The performance degradation is from 10-20x depending on the case.

I have isolated the issue in this code:
https://github.com/gmpassos/simd_benchmark/blob/main/bin/simd_benchmark_severe.dart

I haven't identified yet why this is happening!

Benchmark project (README with benchmark outputs):
https://github.com/gmpassos/simd_benchmark

Benchmark resume:

  • Device 1 (macOS x64 - Intel):
    macOS Big Sur 11.4
    MacBook Pro (13-inch, 2019, Two Thunderbolt 3 ports)

    • macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_x64":
      elapsedTime: 529 ms, hertz: 189,035,916 Hz

    • macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_x64":
      elapsedTime: 14209 ms, hertz: 7,037,792 Hz

  • Device 2 (macOS ARM64 - M1 - Apple Silicon):
    macOS Big Sur 11.6
    MacBook Pro (13-inch, M1, 2020)

    • macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_arm64":
      elapsedTime: 594 ms, hertz: 168,350,168 Hz

    • macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_arm64":
      elapsedTime: 11056 ms, hertz: 9,044,862 Hz

@mraleph
Copy link
Member

mraleph commented Oct 18, 2021

Benchmark regression is caused by 5f78a23 which inhibits code motion of static field loads (for already initialized static fields) for the sake of reusability of JIT generated code between isolates within isolate group.

FWIW AOT always likely had the same bad performance. There might be few strategies for addressing this in the compiler (eager initialization of fields like this, loop peeling, hoisting always executed initializations out of the loop, etc)

Note however that if the benchmark is attempting to measure sigmoidFloat{32x4,64x2} performance then this is not the right way to do it and previously reported numbers were incorrect.

  for (var i = 0; i < loops; ++i) {
    result = result * sigmoidFloat32x4(in1F32);
    result = result * sigmoidFloat32x4(in2F32);
    result = result * sigmoidFloat32x4(in3F32);
    result = result * sigmoidFloat32x4(in4F32);
    result = result + sigmoidFloat32x4(in5F32);
  }

observe here that all sigmoidFloat32x4(X) invocations are essentially loop invariant and previously were hoisted out of the loop turning it into

  final s1 = sigmoidFloat32x4(in1F32);
  final s2 = sigmoidFloat32x4(in2F32);
  final s3 = sigmoidFloat32x4(in3F32);
  final s4 = sigmoidFloat32x4(in4F32);
  final s5 = sigmoidFloat32x4(in5F32);
  for (var i = 0; i < loops; ++i) {
    result = result * s1;
    result = result * s2;
    result = result * s3;
    result = result * s4;
    result = result + s5;
  }

The referenced change made in{1,2,3,4,5}F32 non-invariant and consequently inhibited the hoisting - meaning that now you are actually measuring the cost of computing sigmoid function.

/cc @mkustermann

@gmpassos
Copy link
Contributor Author

Thanks for the answer.

This benchmark was created only to show this reported degradation.

Actually I have found the performance issue in a much bigger project.

@gmpassos
Copy link
Contributor Author

About AOT: In dart 2.14, a compiled script shows the same degradation:

'dart compile exe bin/simd_benchmark_severe.dart'

@gmpassos
Copy link
Contributor Author

gmpassos commented Oct 18, 2021

Does it make sense to force IsFieldInitialized to always return false when FLAG_enable_isolate_groups is true?:

if (FLAG_enable_isolate_groups) return false;

The CSE optimization was supressed at this line:

if (load->AllowsCSE() && !load->IsFieldInitialized()) {

@mraleph
Copy link
Member

mraleph commented Oct 18, 2021

The comment above line 1087 explains why it has to return false if isolate groups are enabled.

@gmpassos
Copy link
Contributor Author

gmpassos commented Oct 18, 2021

The comment above line 1087 explains why it has to return false if isolate groups are enabled.

I understand the purpose, but IsFieldInitialized has a broader meaning.

I'm not an SDK code specialist and I'm investigating that only for 15min:

JIT is allowing the reset of fields:

FLAG_fields_may_be_reset = true;

But IsFieldInitialized is not doing his job here:

if (calls_initializer() && IsFieldInitialized()) {

And JIT will always get out f this method:

if (FLAG_fields_may_be_reset) {

@mraleph
Copy link
Member

mraleph commented Oct 18, 2021

IsFieldInitialized is supposed to return true when it is safe to assume the field is initialized and can be accessed directly without calling initializer (and thus without guarding against the possibility of initializer throwing).

If fields can be reset (which is a flag that is only ever set when generating AOT or AppJIT snapshots) then compiler can't make assumptions about the state of the fields based on their values during compilation because we are going to clear them when generating snapshot.

If isolate groups are enabled then compiler can't make assumptions about fields being initialised based on the state of the field in one isolate because each other isolate starts from scratch with all fields cleared.

As I have said there are some optimizations we could implement to address this, but in general it is good to remember that Dart's static fields are specified to be lazily initialised and thus accessing them from hot loops directly might come with a penalty.

@gmpassos
Copy link
Contributor Author

gmpassos commented Oct 18, 2021

Nice explanation.

One question: There's only one level of JIT in the VM?

Maybe a multilevel JIT can allow extra JIT optimizations after the Isolate is initialized.

Looking at the SDK code, it's prepared to generate JIT at any time (before Dart 2.15). But now, for some points, the JIT is generated like AOT snapshots. Since AOT is not able to generate JIT, it makes sense to behave like that. But one important benefit of the VM is high optimization for JIT and live adaptation of it.

Just an initial idea:

Let's say that JITs have "levels". Each optimization of a previous JIT increases the level. A JIT of level 0 allows Isolate bootstrap and AOT.

If you can trace if a JIT code was generated for level 0 (allowing Isolate bootstrap), and another JIT was generated at level 1+, you can separate what can be shared between isolates and what should be dependent on the "current" Isolate state.

So, when generating JIT for level 0, you disable everything that is "bad" for AOT and Isolate bootstrap. For level 1+ yo can enable extra optimization, static field initialization detection, and full CSE.

Note that JIT sharing is very important for lightweight Isolates, allowing fresh Isolates to run with JIT without the need to recompute it.

From my point of view, the real issue is the lack of the VM ability to define what is a JIT that can "boot" an Isolate and what is a JIT that is state-dependent.

A complementary idea:

Maybe when generating a JIT tree, it should always generate a state-dependent node and also a parallel node that is a level 0 node (that can boot an Isolate). So, when sharing a JIT between Isolates, you walk the tree looking only for level 0 nodes. When using a JIT locally, you allow level 1+ nodes of the tree.

Of course that this will duplicate compiled code in memory if the JIT tree needs 2 versions. But this can be treated as a volatile memory since it can be discarded if needed and recomputed when necessary.

@mraleph
Copy link
Member

mraleph commented Oct 18, 2021

Our JIT is a fairly classical adaptively optimizing speculative JIT with two tiers (unoptimized compilation and optimised compilation).

I am not entirely sure I understand what you describe but I think it assumes that it can split JIT generated code by isolates - this could be theoretically done, but it goes against the whole concept of lightweight in which we share the program structure between isolates (e.g. all isolates within the same group have the same Class, Function etc objects) this directly implies sharing of the JIT generated code (because it is just a Code object hanging from the Function object). Splitting this by isolate complicates things a lot.

Frankly speaking, JIT has not been in focus for a while for us. We are kinda keeping it running - but we are not that much occupied by its peak performance as we used to because most Dart applications are deployed in AOT mode. I think if we are going to address this problem we are going to do it in a way that works for AOT as well.

@a-siva a-siva added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-regression labels Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-regression
Projects
None yet
Development

No branches or pull requests

3 participants