Implements zero-copy html parsing. #60

cgaebel · 2014-11-25T21:36:48Z

Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.

There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as chronicled in:

http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944

Here's a little list of transformations done for performance reasons:

State machine states are now functions. Rustc isn't smart enough
to properly handle large-ish things on the stack in different
match arms. This does mean that it's sane to inline the jump
table into feed, which had a nice impact on performance. Jump
tables inside loops are especially efficient because it's just
a bigger jump table!
Things which get atomized anyhow (except for doctypes, which
weren't hot enough for me to bother changing) use the old String
parsing method, since it ends up being a lot faster for small
strings and doesn't cause O(tags) allocations, thanks to truncation.
A custom Option called FastOption which doesn't zero on take
and can't be matched on, but still maintains safety.
No utf-8 decoding of the chars is avoided unless absolutely
necessary. For most parsing, we just need the utf-8 length,
which is much easier to calculate (a branch, and a LUT on the
first byte in the "slow" path).
Chars and Runs are parsed into a "shared" location every time,
because rustc is really bad at generating code for types which
Drop a lot in a loop. See the discuss post at the top.
A new temp_buf has been introduced, because it is no longer
performant to just append random characters to spans. Consider
a partially-consumed comment start: <-. If the next character
is an a, the <- needs to be emitted. The second temporary
buffer is used to handle cases like that.
Similar to above, but when parsing char refs: the '&' and '#'
are saved in case of backout.
Dashes at the end of a comment -----> need to be saved and
shuffled as we keep reading more dashes, so that we always
emit the "right" ones to keep the span continuous. This
required a little 2-element "queue": first_comment_end_dash
and second_comment_end_dash.
Some of the tokenizer fields were reordered for cache
efficiency.
Some inliner guidance was done in get_char and
get_preprocessed_char, to keep fast paths fast.
clone_from to get data out of the input buffers is used
where it makes sense, preventing a bunch of bad rustc codegen.

As a result of these optimizations (and zero-copy parsing in general):

zero-copy

test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645)

original

test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178)

SUMMARY

html5.html: 19%
sina.com.cn.html: 14%
strong.html: 47%
webapps.html: 27%
wikipedia.html: 21%
lipsum-1M.html: 17%

r? @kmcallister

@kmcallister

Instead of using strings, html5ever now uses Iobufs and Spans over Iobufs to represent the raw html data. This allows us to do all of the parsing without copying the HTML into a tree of strings: it can be Spans and Iobufs all the way down. There were several performance hacks done to get this faster. Most of them were to work around rustc failures as described in: http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944 Here's a little list of transformations done for performance reasons: - State machine states are now functions. Rustc isn't smart enough to properly handle large-ish things on the stack in different match arms. This does mean that it's sane to inline the jump table into `feed`, which had a nice impact on performance. Jump tables inside loops are especially efficient because it's just a bigger jump table! - Things which get atomized anyhow (except for doctypes, which weren't hot enough for me to bother changing) use the old String parsing method, since it ends up being a lot faster for small strings and doesn't cause O(tags) allocations, thanks to truncation. - A custom Option called `FastOption` which doesn't zero on `take` and can't be matched on, but still maintains safety. - Iobufs in the input_buffers Ringbuf are padded to 32 bytes, to allow indexing without a multiply. That was actually a hotspot that showed up in perf, which is a little scary. - No utf-8 decoding of the chars is avoided unless absolutely necessary. For most parsing, we just need the utf-8 length, which is much easier to calculate (a branch, and a LUT on the first byte in the "slow" path). - Chars and Runs are parsed into a "shared" location every time, because rustc is really bad at generating code for types which Drop a lot in a loop. See the discuss post at the top. - A new temp_buf has been introduced, because it is no longer performant to just append random characters to spans. Consider a partially-consumed comment start: `<-`. If the next character is an `a`, the <- needs to be emitted. The second temporary buffer is used to handle cases like that. - Similar to above, but when parsing char refs: the '&' and '#' are saved in case of backout. - Dashes at the end of a comment `----->` need to be saved and shuffled as we keep reading more dashes, so that we always emit the "right" ones to keep the span continuous. This required a little 2-element "queue": `first_comment_end_dash` and `second_comment_end_dash`. - Some of the tokenizer fields were reordered for cache efficiency. - Some inliner guidance was done in `get_char` and `get_preprocessed_char`, to keep fast paths fast. - `clone_from` to get data out of the input buffers is used where it makes sense, preventing a bunch of bad rustc codegen. As a result of these optimizations (and zero-copy parsing in general): === zero-copy test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897) test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327) test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972) test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152) test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989) test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645) original test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531) test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953) test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217) test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161) test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049) test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178) SUMMARY html5.html: 19% sina.com.cn.html: 14% strong.html: 47% webapps.html: 27% wikipedia.html: 21% lipsum-1M.html: 17% ==== r? @kmcallister

cgaebel · 2015-03-08T23:06:38Z

Optimization issues that I ran in to that led to a bunch of the "performance tweaks" in this patch:

http://internals.rust-lang.org/t/the-sad-state-of-zero-on-drop/944
rust-lang/rust#20219
The giant state machine function had stuff on the stack (with a destructor) in each match arm, and I believe this is the root cause of stack usage exploding. I think this is zero-on-drop confusing llvm. To fix it, I put each state in its own function. This actually isn't totally a bad thing, because it makes stacktraces when there's problems a lot nicer to read.

kmcallister · 2015-03-21T01:08:42Z

@cgaebel: Did you investigate using Span just for the fast paths like pop_except_from? Keeping track of every SingleChar may be more trouble than it's worth.

cgaebel · 2015-03-21T14:32:01Z

I did. If you break up runs of text on "non-hot" states, spans move out of their "empty or one" state and into the "many" state, which is much slower. It definitely made a huge difference, and this design was only found after I tried what you just said, because you're right -- keeping track of every SingleChar is hard.

kmcallister · 2015-03-21T18:13:19Z

How many buffers did those spans have on average? I'm thinking a small vector optimization could save us, or maybe finger trees.

cgaebel · 2015-03-21T18:19:48Z

the vast majority are spans over 0 or 1 buffer. That optimization is already implemented. Making spans handle more than that inline would greatly increase the size of each span, and increase the amount of memory traffic on the stack even for simple and common cases.

Based on servo#60 by cgaebel.

Based on servo#60 and servo#114. Fixes servo#20. Fixes servo#115.

kmcallister · 2015-06-10T15:32:41Z

Now #141.

Based on #60 and #114. Fixes #20. Fixes #115.

Clark Gaebel added 6 commits November 25, 2014 13:35

fixed test on 32-bit platforms

f4cb709

more comments!

32950ac

license headers

f97b647

Code review, and some perf optimization.

b480e07

small percentage perf boost

cb8104b

kmcallister mentioned this pull request Mar 18, 2015

Consider small vector optimization for CharacterTokens #20

Closed

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015

Implement zero-copy parsing

e2f1d18

Based on servo#60 by cgaebel.

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015

Implement zero-copy parsing

3942a80

Based on servo#60 by cgaebel.

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015

Implement zero-copy parsing

9ca1de0

Based on servo#60 by cgaebel.

kmcallister mentioned this pull request Mar 24, 2015

Implement zero-copy parsing #114

Closed

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Jun 10, 2015

Implement zero-copy parsing

7be620c

Based on servo#60 and servo#114. Fixes servo#20. Fixes servo#115.

kmcallister closed this Jun 10, 2015

kmcallister added a commit that referenced this pull request Jun 16, 2015

Implement zero-copy parsing

238c03b

Based on #60 and #114. Fixes #20. Fixes #115.

kmcallister added a commit that referenced this pull request Jun 25, 2015

Implement zero-copy parsing

221bd2c

Based on #60 and #114. Fixes #20. Fixes #115.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements zero-copy html parsing. #60

Implements zero-copy html parsing. #60

cgaebel commented Nov 25, 2014

cgaebel commented Mar 8, 2015

kmcallister commented Mar 21, 2015

cgaebel commented Mar 21, 2015

kmcallister commented Mar 21, 2015

cgaebel commented Mar 21, 2015

kmcallister commented Jun 10, 2015

Implements zero-copy html parsing. #60

Implements zero-copy html parsing. #60

Conversation

cgaebel commented Nov 25, 2014

cgaebel commented Mar 8, 2015

kmcallister commented Mar 21, 2015

cgaebel commented Mar 21, 2015

kmcallister commented Mar 21, 2015

cgaebel commented Mar 21, 2015

kmcallister commented Jun 10, 2015