Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

gmpassos · 2021-10-18T09:18:06Z

I have detected a severe degradation with SIMD operations in Dart 2.15.x (beta).

The performance degradation is from 10-20x depending on the case.

I have isolated the issue in this code:
https://github.com/gmpassos/simd_benchmark/blob/main/bin/simd_benchmark_severe.dart

I haven't identified yet why this is happening!

Benchmark project (README with benchmark outputs):
https://github.com/gmpassos/simd_benchmark

Benchmark resume:

Device 1 (macOS x64 - Intel):
macOS Big Sur 11.4
MacBook Pro (13-inch, 2019, Two Thunderbolt 3 ports)
- macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_x64":
  elapsedTime: 529 ms, hertz: 189,035,916 Hz
- macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_x64":
  elapsedTime: 14209 ms, hertz: 7,037,792 Hz
Device 2 (macOS ARM64 - M1 - Apple Silicon):
macOS Big Sur 11.6
MacBook Pro (13-inch, M1, 2020)
- macOS: Dart version: 2.14.4 (stable) (Wed Oct 13 11:11:32 2021 +0200) on "macos_arm64":
  elapsedTime: 594 ms, hertz: 168,350,168 Hz
- macOS: Dart version: 2.15.0-178.1.beta (beta) (Tue Oct 12 11:11:28 2021 +0200) on "macos_arm64":
  elapsedTime: 11056 ms, hertz: 9,044,862 Hz

The text was updated successfully, but these errors were encountered:

mraleph · 2021-10-18T09:48:13Z

Benchmark regression is caused by 5f78a23 which inhibits code motion of static field loads (for already initialized static fields) for the sake of reusability of JIT generated code between isolates within isolate group.

FWIW AOT always likely had the same bad performance. There might be few strategies for addressing this in the compiler (eager initialization of fields like this, loop peeling, hoisting always executed initializations out of the loop, etc)

Note however that if the benchmark is attempting to measure sigmoidFloat{32x4,64x2} performance then this is not the right way to do it and previously reported numbers were incorrect.

  for (var i = 0; i < loops; ++i) {
    result = result * sigmoidFloat32x4(in1F32);
    result = result * sigmoidFloat32x4(in2F32);
    result = result * sigmoidFloat32x4(in3F32);
    result = result * sigmoidFloat32x4(in4F32);
    result = result + sigmoidFloat32x4(in5F32);
  }

observe here that all sigmoidFloat32x4(X) invocations are essentially loop invariant and previously were hoisted out of the loop turning it into

  final s1 = sigmoidFloat32x4(in1F32);
  final s2 = sigmoidFloat32x4(in2F32);
  final s3 = sigmoidFloat32x4(in3F32);
  final s4 = sigmoidFloat32x4(in4F32);
  final s5 = sigmoidFloat32x4(in5F32);
  for (var i = 0; i < loops; ++i) {
    result = result * s1;
    result = result * s2;
    result = result * s3;
    result = result * s4;
    result = result + s5;
  }

The referenced change made in{1,2,3,4,5}F32 non-invariant and consequently inhibited the hoisting - meaning that now you are actually measuring the cost of computing sigmoid function.

/cc @mkustermann

gmpassos · 2021-10-18T09:58:09Z

Thanks for the answer.

This benchmark was created only to show this reported degradation.

Actually I have found the performance issue in a much bigger project.

gmpassos · 2021-10-18T10:05:21Z

About AOT: In dart 2.14, a compiled script shows the same degradation:

'dart compile exe bin/simd_benchmark_severe.dart'

gmpassos · 2021-10-18T10:33:21Z

Does it make sense to force IsFieldInitialized to always return false when FLAG_enable_isolate_groups is true?:

sdk/runtime/vm/compiler/backend/il.cc

Line 1087 in e995cb5

if (FLAG_enable_isolate_groups) return false;

The CSE optimization was supressed at this line:

sdk/runtime/vm/compiler/backend/redundancy_elimination.cc

Line 1545 in b9b1ea6

if (load->AllowsCSE() && !load->IsFieldInitialized()) {

mraleph · 2021-10-18T10:43:38Z

The comment above line 1087 explains why it has to return false if isolate groups are enabled.

gmpassos · 2021-10-18T11:01:39Z

The comment above line 1087 explains why it has to return false if isolate groups are enabled.

I understand the purpose, but IsFieldInitialized has a broader meaning.

I'm not an SDK code specialist and I'm investigating that only for 15min:

JIT is allowing the reset of fields:

sdk/runtime/vm/compiler/jit/compiler.cc

Line 97 in fbcacd7

FLAG_fields_may_be_reset = true;

But IsFieldInitialized is not doing his job here:

sdk/runtime/vm/compiler/backend/il.cc

Line 1113 in e995cb5

if (calls_initializer() && IsFieldInitialized()) {

And JIT will always get out f this method:

sdk/runtime/vm/compiler/backend/il.cc

Line 1081 in e995cb5

if (FLAG_fields_may_be_reset) {

mraleph · 2021-10-18T11:28:59Z

IsFieldInitialized is supposed to return true when it is safe to assume the field is initialized and can be accessed directly without calling initializer (and thus without guarding against the possibility of initializer throwing).

If fields can be reset (which is a flag that is only ever set when generating AOT or AppJIT snapshots) then compiler can't make assumptions about the state of the fields based on their values during compilation because we are going to clear them when generating snapshot.

If isolate groups are enabled then compiler can't make assumptions about fields being initialised based on the state of the field in one isolate because each other isolate starts from scratch with all fields cleared.

As I have said there are some optimizations we could implement to address this, but in general it is good to remember that Dart's static fields are specified to be lazily initialised and thus accessing them from hot loops directly might come with a penalty.

gmpassos · 2021-10-18T11:56:41Z

Nice explanation.

One question: There's only one level of JIT in the VM?

Maybe a multilevel JIT can allow extra JIT optimizations after the Isolate is initialized.

Looking at the SDK code, it's prepared to generate JIT at any time (before Dart 2.15). But now, for some points, the JIT is generated like AOT snapshots. Since AOT is not able to generate JIT, it makes sense to behave like that. But one important benefit of the VM is high optimization for JIT and live adaptation of it.

Just an initial idea:

Let's say that JITs have "levels". Each optimization of a previous JIT increases the level. A JIT of level 0 allows Isolate bootstrap and AOT.

If you can trace if a JIT code was generated for level 0 (allowing Isolate bootstrap), and another JIT was generated at level 1+, you can separate what can be shared between isolates and what should be dependent on the "current" Isolate state.

So, when generating JIT for level 0, you disable everything that is "bad" for AOT and Isolate bootstrap. For level 1+ yo can enable extra optimization, static field initialization detection, and full CSE.

Note that JIT sharing is very important for lightweight Isolates, allowing fresh Isolates to run with JIT without the need to recompute it.

From my point of view, the real issue is the lack of the VM ability to define what is a JIT that can "boot" an Isolate and what is a JIT that is state-dependent.

A complementary idea:

Maybe when generating a JIT tree, it should always generate a state-dependent node and also a parallel node that is a level 0 node (that can boot an Isolate). So, when sharing a JIT between Isolates, you walk the tree looking only for level 0 nodes. When using a JIT locally, you allow level 1+ nodes of the tree.

Of course that this will duplicate compiled code in memory if the JIT tree needs 2 versions. But this can be treated as a volatile memory since it can be discarded if needed and recomputed when necessary.

mraleph · 2021-10-18T12:21:03Z

Our JIT is a fairly classical adaptively optimizing speculative JIT with two tiers (unoptimized compilation and optimised compilation).

I am not entirely sure I understand what you describe but I think it assumes that it can split JIT generated code by isolates - this could be theoretically done, but it goes against the whole concept of lightweight in which we share the program structure between isolates (e.g. all isolates within the same group have the same Class, Function etc objects) this directly implies sharing of the JIT generated code (because it is just a Code object hanging from the Function object). Splitting this by isolate complicates things a lot.

Frankly speaking, JIT has not been in focus for a while for us. We are kinda keeping it running - but we are not that much occupied by its peak performance as we used to because most Dart applications are deployed in AOT mode. I think if we are going to address this problem we are going to do it in a way that works for AOT as well.

gmpassos mentioned this issue Oct 18, 2021

Float32x4 GC mitigation needed! #46187

Closed

a-siva added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-regression labels Oct 18, 2021

modulovalue mentioned this issue Jun 10, 2023

Add BitSet (bit vector) to Dart library #1053

Open

gmpassos closed this as completed Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

gmpassos commented Oct 18, 2021 •

edited

Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021

gmpassos commented Oct 18, 2021

gmpassos commented Oct 18, 2021 •

edited

Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021 •

edited

Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021 •

edited

Loading

mraleph commented Oct 18, 2021

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

Severe SIMD degradation in Dart VM 2.15.0-178.1.beta #47488

Comments

gmpassos commented Oct 18, 2021 • edited Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021

gmpassos commented Oct 18, 2021

gmpassos commented Oct 18, 2021 • edited Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021 • edited Loading

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021 • edited Loading

Just an initial idea:

A complementary idea:

mraleph commented Oct 18, 2021

gmpassos commented Oct 18, 2021 •

edited

Loading

gmpassos commented Oct 18, 2021 •

edited

Loading

gmpassos commented Oct 18, 2021 •

edited

Loading

gmpassos commented Oct 18, 2021 •

edited

Loading