Description
Description
We observed a significant performance disparity between the Arm64 and x64 write barriers. When running a program without the write barrier, Arm64 was 3x slower than x64. However, with the write barrier enabled, Arm64 became 10x slower. This suggests that Arm64's handling of the write barrier is less optimized compared to x64.
Data
Performance Counter Stats without the Write Barrier
To test the performance of the write barrier, we used Crank to run a simple program 10 times on the two machines. Notice that when we do not access the write barrier, it’s approximately 3x slower on the Arm64 machine.
This is a simple program that does not access the write barrier that we measured the performance of using crank:
int[] foo = new int[1];
for (long i = 0; i < 100_000_000; i++)
{
foo[0]++;
}
Table 1: Average Performance Counter Stats without the write barrier.
Architecture | x64 | Arm64 | ||
# of iterations | 100,000,000 | 200,000,000 | 100,000,000 | 200,000,000 |
cache-references | 7199555 | 7210098 | 266711905 | 467403412.6 |
cache-misses | 1673444 | 1673888 | 1021946.5 | 1042045.5 |
cycles | 812275185 | 1513438858 | 831957725 | 1517325563 |
instructions | 656685121 | 1156933373.4 | 881350905 | 1583055913 |
branches | 131173961 | 231219510.1 | 121014944 | 221181620.1 |
faults | 2123.4 | 2123.2 | 3290.1 | 3290.9 |
migrations | 50.9 | 51.7 | 71.1 | 84.8 |
Time elapsed (seconds) | 0.26562 | 0.47812 | 0.82561 | 1.4412 |
User (seconds) | 0.24808 | 0.46158 | 0.74556 | 1.3178 |
Sys (seconds) | 0.00801 | 0.00946 | 0.16161 | 0.20523 |
Performance Counter Stats with the Write Barrier
When we do access the write barrier, performance degrades further, with the Arm64 machine becoming 10x slower.
This is a simple program that access the write barrier that we measured the performance of using crank:
Foo foo = new Foo();
for (long i = 0; i < (# of iterations); i++)
{
foo.x = foo;
}
internal class Foo
{
public volatile Foo x;
}
Table 2: Performance Counter Stats with the write barrier.
Architecture | x64 | Arm64 | ||
# of iterations | 100,000,000 | 200,000,000 | 100,000,000 | 200,000,000 |
cache-references | 7252140 | 7178833 | 568014397 | 1068659425 |
cache-misses | 1697333 | 1684188 | 1025013 | 1012689 |
cycles | 713364359 | 1313245706 | 2756710296 | 5360611600 |
instructions | 1456194567 | 2756823577 | 1983627681 | 3785656008 |
branches | 431088498 | 831198368 | 621239460 | 1221448774 |
faults | 2116 | 2124 | 3291 | 3296 |
migrations | 50.9 | 52.3 | 72.7 | 61.6 |
Time elapsed (seconds) | 0.23283 | 0.41492 | 2.6058 | 4.2126 |
User (seconds) | 0.21495 | 0.39656 | 2.5438 | 4.0788 |
Sys (seconds) | 0.01169 | 0.01188 | 0.14361 | 0.1984 |