[Performance] Make determining whether a code point represents a combining mark faster #1719

p-e-w · 2020-06-11T05:53:00Z

The call unicode.In(r, unicode.Mark) introduced in 5c8a233 is a performance hog. Profiling reveals that a double-digit percentage of Micro's total CPU time is spent inside that function. The reason is that the range table unicode.Mark has more than 100 entries, including many highly exotic code points not commonly thought of as "marks".

This PR essentially reverts 5c8a233, except that instead of constructing a range table, it uses a hand-written function that only performs the necessary comparisons. This avoids a bunch of branching and iteration logic from unicode.In, which makes it even faster than the previous approach.

The benchmarks demonstrate that this is indeed a huge performance improvement (I had to increase the p-value threshold from 0.05 to 0.15 for benchstat to show this, as 3 benchmark runs are unfortunately still not sufficient to reach p-values that low in most situations):

name                           old time/op  new time/op  delta
CreateAndClose10Lines-4        41.3µs ± 1%  55.1µs ± 5%  +33.47%  (p=0.100 n=3+3)
CreateAndClose100Lines-4        122µs ± 4%   135µs ±12%     ~     (p=0.700 n=3+3)
CreateAndClose1000Lines-4       897µs ± 2%   887µs ± 1%     ~     (p=0.700 n=3+3)
CreateAndClose10000Lines-4     9.14ms ± 2%  9.06ms ± 1%     ~     (p=0.700 n=3+3)
CreateAndClose100000Lines-4    86.4ms ± 5%  82.9ms ± 0%     ~     (p=0.700 n=3+3)
CreateAndClose1000000Lines-4    825ms ± 7%   832ms ± 1%     ~     (p=0.700 n=3+3)
Read10Lines-4                  1.90µs ± 3%  1.90µs ± 1%     ~     (p=1.000 n=3+3)
Read100Lines-4                 12.5µs ± 5%  12.4µs ± 2%     ~     (p=1.000 n=3+3)
Read1000Lines-4                 122µs ± 6%   120µs ± 1%     ~     (p=1.000 n=3+3)
Read10000Lines-4               1.47ms ± 9%  1.37ms ± 1%     ~     (p=0.400 n=3+3)
Read100000Lines-4              14.3ms ± 4%  14.3ms ± 0%     ~     (p=0.700 n=3+3)
Read1000000Lines-4              148ms ± 8%   129ms ± 1%  -12.58%  (p=0.100 n=3+3)
Edit10Lines1Cursor-4            245µs ± 3%   158µs ± 0%  -35.53%  (p=0.100 n=3+3)
Edit100Lines1Cursor-4           120µs ± 4%    80µs ± 1%  -33.78%  (p=0.100 n=3+3)
Edit100Lines10Cursors-4        6.50ms ± 3%  4.65ms ± 1%  -28.41%  (p=0.100 n=3+3)
Edit1000Lines1Cursor-4         65.5µs ± 3%  41.9µs ± 1%  -36.06%  (p=0.100 n=3+3)
Edit1000Lines10Cursors-4       5.45ms ± 2%  3.96ms ± 0%  -27.39%  (p=0.100 n=3+3)
Edit1000Lines100Cursors-4       352ms ± 3%   257ms ± 0%  -27.08%  (p=0.100 n=3+3)
Edit10000Lines1Cursor-4         248µs ± 5%   181µs ± 1%  -26.97%  (p=0.100 n=3+3)
Edit10000Lines10Cursors-4      5.49ms ± 1%  4.16ms ± 0%  -24.20%  (p=0.100 n=3+3)
Edit10000Lines100Cursors-4      393ms ± 3%   289ms ± 0%  -26.51%  (p=0.100 n=3+3)
Edit10000Lines1000Cursors-4     33.4s ± 3%   24.7s ± 0%  -26.18%  (p=0.100 n=3+3)
Edit100000Lines1Cursor-4       2.30ms ±24%  1.96ms ± 0%  -14.85%  (p=0.100 n=3+3)
Edit100000Lines10Cursors-4     43.7ms ±24%  36.8ms ± 0%  -15.81%  (p=0.100 n=3+3)
Edit100000Lines100Cursors-4     707ms ± 1%   621ms ± 0%  -12.18%  (p=0.100 n=3+3)
Edit100000Lines1000Cursors-4    35.3s ± 5%   27.7s ± 0%  -21.56%  (p=0.100 n=3+3)
Edit1000000Lines1Cursor-4       433µs ± 1%   394µs ± 1%   -9.05%  (p=0.100 n=3+3)
Edit1000000Lines10Cursors-4     366ms ± 6%   342ms ± 0%   -6.71%  (p=0.100 n=3+3)
Edit1000000Lines100Cursors-4    4.17s ± 3%   3.94s ± 0%   -5.65%  (p=0.100 n=3+3)
Edit1000000Lines1000Cursors-4   74.8s ± 1%   65.2s ± 0%  -12.83%  (p=0.100 n=3+3)

Of course, this code is not perfectly equivalent to using unicode.Mark, but I don't think getting every single corner case for every exotic script correct is worth slowing Micro down by 20%.

It would be really nice if such performance regressions could be caught automatically by CI, but Travis doesn't provide a consistent runtime environment AFAIK. Do you know a better option? How are the nightly builds generated?

Also, why are the Unicode helper functions from the highlight package duplicated? Micro depends on that package, so why are they not simply imported?

zyedidia · 2020-06-11T17:23:25Z

I don't think we should cut corners on the unicode mark range, but we can definitely look into ways to speed this up. One of the reasons to implement combining characters was so that rendering Thai script would be correct (#872), and this change would break that.

A good method of improving performance would be to check if the rune is a common character (ASCII or similar) that is known not to be a mark before doing an expensive check to see if it is a mark. I've looked closer at the Go documentation and see there is a unicode.IsMark function, and inspecting the source code seems to reveal that this is optimized to handle the common case of Latin characters quickly (I think), by making sure that the rune is not Latin before doing anything expensive. Let's try using that function and see if that improves performance. If not, we can try building something by hand with this idea.

p-e-w · 2020-06-12T04:01:27Z

Thanks, I somehow managed to completely overlook unicode.IsMark. That function indeed has a fast path, but we can do better. unicode.IsMark only skips testing against the full range table if the rune is less than 0x00FF, but the first mark only occurs at 0x0300.

So that is the fast path I ended up implementing. For Latin-1 characters, this code is actually even faster than the previous one from this PR, as it only needs to perform a single comparison instead of five.

Speed improvement is similar to before (note that ~ does not mean that there has been no improvement, but that the p-value returned by the statistical test is above the significance threshold):

name                           old time/op  new time/op  delta
CreateAndClose10Lines-4        41.3µs ± 1%  47.4µs ±33%     ~     (p=0.700 n=3+3)
CreateAndClose100Lines-4        122µs ± 4%   128µs ±13%     ~     (p=1.000 n=3+3)
CreateAndClose1000Lines-4       897µs ± 2%   877µs ± 2%     ~     (p=0.400 n=3+3)
CreateAndClose10000Lines-4     9.14ms ± 2%  8.91ms ± 0%   -2.44%  (p=0.100 n=3+3)
CreateAndClose100000Lines-4    86.4ms ± 5%  80.0ms ± 1%   -7.45%  (p=0.100 n=3+3)
CreateAndClose1000000Lines-4    825ms ± 7%   802ms ± 0%     ~     (p=0.700 n=3+3)
Read10Lines-4                  1.90µs ± 3%  1.91µs ± 1%     ~     (p=1.000 n=3+3)
Read100Lines-4                 12.5µs ± 5%  12.4µs ± 1%     ~     (p=1.000 n=3+3)
Read1000Lines-4                 122µs ± 6%   119µs ± 1%     ~     (p=1.000 n=3+3)
Read10000Lines-4               1.47ms ± 9%  1.39ms ± 1%     ~     (p=0.700 n=3+3)
Read100000Lines-4              14.3ms ± 4%  14.3ms ± 3%     ~     (p=1.000 n=3+3)
Read1000000Lines-4              148ms ± 8%   131ms ± 6%     ~     (p=0.200 n=3+3)
Edit10Lines1Cursor-4            245µs ± 3%   158µs ± 1%  -35.50%  (p=0.100 n=3+3)
Edit100Lines1Cursor-4           120µs ± 4%    80µs ± 1%  -33.63%  (p=0.100 n=3+3)
Edit100Lines10Cursors-4        6.50ms ± 3%  5.01ms ± 2%  -22.87%  (p=0.100 n=3+3)
Edit1000Lines1Cursor-4         65.5µs ± 3%  43.7µs ± 1%  -33.33%  (p=0.100 n=3+3)
Edit1000Lines10Cursors-4       5.45ms ± 2%  4.14ms ± 0%  -24.11%  (p=0.100 n=3+3)
Edit1000Lines100Cursors-4       352ms ± 3%   269ms ± 0%  -23.67%  (p=0.100 n=3+3)
Edit10000Lines1Cursor-4         248µs ± 5%   185µs ± 4%  -25.34%  (p=0.100 n=3+3)
Edit10000Lines10Cursors-4      5.49ms ± 1%  4.37ms ± 1%  -20.40%  (p=0.100 n=3+3)
Edit10000Lines100Cursors-4      393ms ± 3%   302ms ± 0%  -23.18%  (p=0.100 n=3+3)
Edit10000Lines1000Cursors-4     33.4s ± 3%   26.1s ± 4%  -22.09%  (p=0.100 n=3+3)
Edit100000Lines1Cursor-4       2.30ms ±24%  2.11ms ±12%     ~     (p=0.400 n=3+3)
Edit100000Lines10Cursors-4     43.7ms ±24%  40.2ms ±15%     ~     (p=0.400 n=3+3)
Edit100000Lines100Cursors-4     707ms ± 1%   647ms ± 2%   -8.50%  (p=0.100 n=3+3)
Edit100000Lines1000Cursors-4    35.3s ± 5%   29.7s ± 5%  -15.82%  (p=0.100 n=3+3)
Edit1000000Lines1Cursor-4       433µs ± 1%   396µs ± 2%   -8.41%  (p=0.100 n=3+3)
Edit1000000Lines10Cursors-4     366ms ± 6%   404ms ±26%     ~     (p=1.000 n=3+3)
Edit1000000Lines100Cursors-4    4.17s ± 3%   4.22s ±12%     ~     (p=0.700 n=3+3)
Edit1000000Lines1000Cursors-4   74.8s ± 1%   68.8s ± 2%   -8.05%  (p=0.100 n=3+3)

Make determining whether a code point represents a combining mark faster

fcd5c05

zyedidia merged commit 5ce26cc into zyedidia:master Jun 12, 2020

p-e-w mentioned this pull request Jun 12, 2020

[Performance] Make determining rune width faster #1722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Make determining whether a code point represents a combining mark faster #1719

[Performance] Make determining whether a code point represents a combining mark faster #1719

Uh oh!

p-e-w commented Jun 11, 2020

Uh oh!

zyedidia commented Jun 11, 2020 •

edited

Loading

Uh oh!

p-e-w commented Jun 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Performance] Make determining whether a code point represents a combining mark faster #1719

[Performance] Make determining whether a code point represents a combining mark faster #1719

Uh oh!

Conversation

p-e-w commented Jun 11, 2020

Uh oh!

zyedidia commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Jun 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zyedidia commented Jun 11, 2020 •

edited

Loading