Skip to content

Conversation

@p-e-w
Copy link
Contributor

@p-e-w p-e-w commented Jun 11, 2020

The call unicode.In(r, unicode.Mark) introduced in 5c8a233 is a performance hog. Profiling reveals that a double-digit percentage of Micro's total CPU time is spent inside that function. The reason is that the range table unicode.Mark has more than 100 entries, including many highly exotic code points not commonly thought of as "marks".

This PR essentially reverts 5c8a233, except that instead of constructing a range table, it uses a hand-written function that only performs the necessary comparisons. This avoids a bunch of branching and iteration logic from unicode.In, which makes it even faster than the previous approach.

The benchmarks demonstrate that this is indeed a huge performance improvement (I had to increase the p-value threshold from 0.05 to 0.15 for benchstat to show this, as 3 benchmark runs are unfortunately still not sufficient to reach p-values that low in most situations):

name                           old time/op  new time/op  delta
CreateAndClose10Lines-4        41.3µs ± 1%  55.1µs ± 5%  +33.47%  (p=0.100 n=3+3)
CreateAndClose100Lines-4        122µs ± 4%   135µs ±12%     ~     (p=0.700 n=3+3)
CreateAndClose1000Lines-4       897µs ± 2%   887µs ± 1%     ~     (p=0.700 n=3+3)
CreateAndClose10000Lines-4     9.14ms ± 2%  9.06ms ± 1%     ~     (p=0.700 n=3+3)
CreateAndClose100000Lines-4    86.4ms ± 5%  82.9ms ± 0%     ~     (p=0.700 n=3+3)
CreateAndClose1000000Lines-4    825ms ± 7%   832ms ± 1%     ~     (p=0.700 n=3+3)
Read10Lines-4                  1.90µs ± 3%  1.90µs ± 1%     ~     (p=1.000 n=3+3)
Read100Lines-4                 12.5µs ± 5%  12.4µs ± 2%     ~     (p=1.000 n=3+3)
Read1000Lines-4                 122µs ± 6%   120µs ± 1%     ~     (p=1.000 n=3+3)
Read10000Lines-4               1.47ms ± 9%  1.37ms ± 1%     ~     (p=0.400 n=3+3)
Read100000Lines-4              14.3ms ± 4%  14.3ms ± 0%     ~     (p=0.700 n=3+3)
Read1000000Lines-4              148ms ± 8%   129ms ± 1%  -12.58%  (p=0.100 n=3+3)
Edit10Lines1Cursor-4            245µs ± 3%   158µs ± 0%  -35.53%  (p=0.100 n=3+3)
Edit100Lines1Cursor-4           120µs ± 4%    80µs ± 1%  -33.78%  (p=0.100 n=3+3)
Edit100Lines10Cursors-4        6.50ms ± 3%  4.65ms ± 1%  -28.41%  (p=0.100 n=3+3)
Edit1000Lines1Cursor-4         65.5µs ± 3%  41.9µs ± 1%  -36.06%  (p=0.100 n=3+3)
Edit1000Lines10Cursors-4       5.45ms ± 2%  3.96ms ± 0%  -27.39%  (p=0.100 n=3+3)
Edit1000Lines100Cursors-4       352ms ± 3%   257ms ± 0%  -27.08%  (p=0.100 n=3+3)
Edit10000Lines1Cursor-4         248µs ± 5%   181µs ± 1%  -26.97%  (p=0.100 n=3+3)
Edit10000Lines10Cursors-4      5.49ms ± 1%  4.16ms ± 0%  -24.20%  (p=0.100 n=3+3)
Edit10000Lines100Cursors-4      393ms ± 3%   289ms ± 0%  -26.51%  (p=0.100 n=3+3)
Edit10000Lines1000Cursors-4     33.4s ± 3%   24.7s ± 0%  -26.18%  (p=0.100 n=3+3)
Edit100000Lines1Cursor-4       2.30ms ±24%  1.96ms ± 0%  -14.85%  (p=0.100 n=3+3)
Edit100000Lines10Cursors-4     43.7ms ±24%  36.8ms ± 0%  -15.81%  (p=0.100 n=3+3)
Edit100000Lines100Cursors-4     707ms ± 1%   621ms ± 0%  -12.18%  (p=0.100 n=3+3)
Edit100000Lines1000Cursors-4    35.3s ± 5%   27.7s ± 0%  -21.56%  (p=0.100 n=3+3)
Edit1000000Lines1Cursor-4       433µs ± 1%   394µs ± 1%   -9.05%  (p=0.100 n=3+3)
Edit1000000Lines10Cursors-4     366ms ± 6%   342ms ± 0%   -6.71%  (p=0.100 n=3+3)
Edit1000000Lines100Cursors-4    4.17s ± 3%   3.94s ± 0%   -5.65%  (p=0.100 n=3+3)
Edit1000000Lines1000Cursors-4   74.8s ± 1%   65.2s ± 0%  -12.83%  (p=0.100 n=3+3)

Of course, this code is not perfectly equivalent to using unicode.Mark, but I don't think getting every single corner case for every exotic script correct is worth slowing Micro down by 20%.

It would be really nice if such performance regressions could be caught automatically by CI, but Travis doesn't provide a consistent runtime environment AFAIK. Do you know a better option? How are the nightly builds generated?

Also, why are the Unicode helper functions from the highlight package duplicated? Micro depends on that package, so why are they not simply imported?

@zyedidia
Copy link
Owner

zyedidia commented Jun 11, 2020

I don't think we should cut corners on the unicode mark range, but we can definitely look into ways to speed this up. One of the reasons to implement combining characters was so that rendering Thai script would be correct (#872), and this change would break that.

A good method of improving performance would be to check if the rune is a common character (ASCII or similar) that is known not to be a mark before doing an expensive check to see if it is a mark. I've looked closer at the Go documentation and see there is a unicode.IsMark function, and inspecting the source code seems to reveal that this is optimized to handle the common case of Latin characters quickly (I think), by making sure that the rune is not Latin before doing anything expensive. Let's try using that function and see if that improves performance. If not, we can try building something by hand with this idea.

@p-e-w
Copy link
Contributor Author

p-e-w commented Jun 12, 2020

Thanks, I somehow managed to completely overlook unicode.IsMark. That function indeed has a fast path, but we can do better. unicode.IsMark only skips testing against the full range table if the rune is less than 0x00FF, but the first mark only occurs at 0x0300.

So that is the fast path I ended up implementing. For Latin-1 characters, this code is actually even faster than the previous one from this PR, as it only needs to perform a single comparison instead of five.

Speed improvement is similar to before (note that ~ does not mean that there has been no improvement, but that the p-value returned by the statistical test is above the significance threshold):

name                           old time/op  new time/op  delta
CreateAndClose10Lines-4        41.3µs ± 1%  47.4µs ±33%     ~     (p=0.700 n=3+3)
CreateAndClose100Lines-4        122µs ± 4%   128µs ±13%     ~     (p=1.000 n=3+3)
CreateAndClose1000Lines-4       897µs ± 2%   877µs ± 2%     ~     (p=0.400 n=3+3)
CreateAndClose10000Lines-4     9.14ms ± 2%  8.91ms ± 0%   -2.44%  (p=0.100 n=3+3)
CreateAndClose100000Lines-4    86.4ms ± 5%  80.0ms ± 1%   -7.45%  (p=0.100 n=3+3)
CreateAndClose1000000Lines-4    825ms ± 7%   802ms ± 0%     ~     (p=0.700 n=3+3)
Read10Lines-4                  1.90µs ± 3%  1.91µs ± 1%     ~     (p=1.000 n=3+3)
Read100Lines-4                 12.5µs ± 5%  12.4µs ± 1%     ~     (p=1.000 n=3+3)
Read1000Lines-4                 122µs ± 6%   119µs ± 1%     ~     (p=1.000 n=3+3)
Read10000Lines-4               1.47ms ± 9%  1.39ms ± 1%     ~     (p=0.700 n=3+3)
Read100000Lines-4              14.3ms ± 4%  14.3ms ± 3%     ~     (p=1.000 n=3+3)
Read1000000Lines-4              148ms ± 8%   131ms ± 6%     ~     (p=0.200 n=3+3)
Edit10Lines1Cursor-4            245µs ± 3%   158µs ± 1%  -35.50%  (p=0.100 n=3+3)
Edit100Lines1Cursor-4           120µs ± 4%    80µs ± 1%  -33.63%  (p=0.100 n=3+3)
Edit100Lines10Cursors-4        6.50ms ± 3%  5.01ms ± 2%  -22.87%  (p=0.100 n=3+3)
Edit1000Lines1Cursor-4         65.5µs ± 3%  43.7µs ± 1%  -33.33%  (p=0.100 n=3+3)
Edit1000Lines10Cursors-4       5.45ms ± 2%  4.14ms ± 0%  -24.11%  (p=0.100 n=3+3)
Edit1000Lines100Cursors-4       352ms ± 3%   269ms ± 0%  -23.67%  (p=0.100 n=3+3)
Edit10000Lines1Cursor-4         248µs ± 5%   185µs ± 4%  -25.34%  (p=0.100 n=3+3)
Edit10000Lines10Cursors-4      5.49ms ± 1%  4.37ms ± 1%  -20.40%  (p=0.100 n=3+3)
Edit10000Lines100Cursors-4      393ms ± 3%   302ms ± 0%  -23.18%  (p=0.100 n=3+3)
Edit10000Lines1000Cursors-4     33.4s ± 3%   26.1s ± 4%  -22.09%  (p=0.100 n=3+3)
Edit100000Lines1Cursor-4       2.30ms ±24%  2.11ms ±12%     ~     (p=0.400 n=3+3)
Edit100000Lines10Cursors-4     43.7ms ±24%  40.2ms ±15%     ~     (p=0.400 n=3+3)
Edit100000Lines100Cursors-4     707ms ± 1%   647ms ± 2%   -8.50%  (p=0.100 n=3+3)
Edit100000Lines1000Cursors-4    35.3s ± 5%   29.7s ± 5%  -15.82%  (p=0.100 n=3+3)
Edit1000000Lines1Cursor-4       433µs ± 1%   396µs ± 2%   -8.41%  (p=0.100 n=3+3)
Edit1000000Lines10Cursors-4     366ms ± 6%   404ms ±26%     ~     (p=1.000 n=3+3)
Edit1000000Lines100Cursors-4    4.17s ± 3%   4.22s ±12%     ~     (p=0.700 n=3+3)
Edit1000000Lines1000Cursors-4   74.8s ± 1%   68.8s ± 2%   -8.05%  (p=0.100 n=3+3)

@zyedidia zyedidia merged commit 5ce26cc into zyedidia:master Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants