Skip to content

Significant Performance Disparity Between Arm64 and x64 Write Barriers #106051

Open
@ebepho

Description

@ebepho

Description

We observed a significant performance disparity between the Arm64 and x64 write barriers. When running a program without the write barrier, Arm64 was 3x slower than x64. However, with the write barrier enabled, Arm64 became 10x slower. This suggests that Arm64's handling of the write barrier is less optimized compared to x64.

Data

Performance Counter Stats without the Write Barrier

To test the performance of the write barrier, we used Crank to run a simple program 10 times on the two machines. Notice that when we do not access the write barrier, it’s approximately 3x slower on the Arm64 machine.

This is a simple program that does not access the write barrier that we measured the performance of using crank:

int[] foo = new int[1];
for (long i = 0; i < 100_000_000; i++)
{
   foo[0]++;
}

Table 1: Average Performance Counter Stats without the write barrier.

Architecture x64 Arm64
# of iterations 100,000,000 200,000,000 100,000,000 200,000,000
cache-references 7199555 7210098 266711905 467403412.6
cache-misses 1673444 1673888 1021946.5 1042045.5
cycles 812275185 1513438858 831957725 1517325563
instructions 656685121 1156933373.4 881350905 1583055913
branches 131173961 231219510.1 121014944 221181620.1
faults 2123.4 2123.2 3290.1 3290.9
migrations 50.9 51.7 71.1 84.8
Time elapsed (seconds) 0.26562 0.47812 0.82561 1.4412
User (seconds) 0.24808 0.46158 0.74556 1.3178
Sys (seconds) 0.00801 0.00946 0.16161 0.20523

Performance Counter Stats with the Write Barrier

When we do access the write barrier, performance degrades further, with the Arm64 machine becoming 10x slower.

This is a simple program that access the write barrier that we measured the performance of using crank:

Foo foo = new Foo();
for (long i = 0; i < (# of iterations); i++)
{
    foo.x = foo;
}
internal class Foo
{
    public volatile Foo x;
}

Table 2: Performance Counter Stats with the write barrier.

Architecture x64 Arm64
# of iterations 100,000,000 200,000,000 100,000,000 200,000,000
cache-references 7252140 7178833 568014397 1068659425
cache-misses 1697333 1684188 1025013 1012689
cycles 713364359 1313245706 2756710296 5360611600
instructions 1456194567 2756823577 1983627681 3785656008
branches 431088498 831198368 621239460 1221448774
faults 2116 2124 3291 3296
migrations 50.9 52.3 72.7 61.6
Time elapsed (seconds) 0.23283 0.41492 2.6058 4.2126
User (seconds) 0.21495 0.39656 2.5438 4.0788
Sys (seconds) 0.01169 0.01188 0.14361 0.1984

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions