-
Notifications
You must be signed in to change notification settings - Fork 236
Implements zero-copy html parsing. #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Instead of using strings, html5ever now uses Iobufs and Spans over Iobufs to represent the raw html data. This allows us to do all of the parsing without copying the HTML into a tree of strings: it can be Spans and Iobufs all the way down. There were several performance hacks done to get this faster. Most of them were to work around rustc failures as described in: http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944 Here's a little list of transformations done for performance reasons: - State machine states are now functions. Rustc isn't smart enough to properly handle large-ish things on the stack in different match arms. This does mean that it's sane to inline the jump table into `feed`, which had a nice impact on performance. Jump tables inside loops are especially efficient because it's just a bigger jump table! - Things which get atomized anyhow (except for doctypes, which weren't hot enough for me to bother changing) use the old String parsing method, since it ends up being a lot faster for small strings and doesn't cause O(tags) allocations, thanks to truncation. - A custom Option called `FastOption` which doesn't zero on `take` and can't be matched on, but still maintains safety. - Iobufs in the input_buffers Ringbuf are padded to 32 bytes, to allow indexing without a multiply. That was actually a hotspot that showed up in perf, which is a little scary. - No utf-8 decoding of the chars is avoided unless absolutely necessary. For most parsing, we just need the utf-8 length, which is much easier to calculate (a branch, and a LUT on the first byte in the "slow" path). - Chars and Runs are parsed into a "shared" location every time, because rustc is really bad at generating code for types which Drop a lot in a loop. See the discuss post at the top. - A new temp_buf has been introduced, because it is no longer performant to just append random characters to spans. Consider a partially-consumed comment start: `<-`. If the next character is an `a`, the <- needs to be emitted. The second temporary buffer is used to handle cases like that. - Similar to above, but when parsing char refs: the '&' and '#' are saved in case of backout. - Dashes at the end of a comment `----->` need to be saved and shuffled as we keep reading more dashes, so that we always emit the "right" ones to keep the span continuous. This required a little 2-element "queue": `first_comment_end_dash` and `second_comment_end_dash`. - Some of the tokenizer fields were reordered for cache efficiency. - Some inliner guidance was done in `get_char` and `get_preprocessed_char`, to keep fast paths fast. - `clone_from` to get data out of the input buffers is used where it makes sense, preventing a bunch of bad rustc codegen. As a result of these optimizations (and zero-copy parsing in general): === zero-copy test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897) test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327) test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972) test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152) test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989) test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645) original test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531) test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953) test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217) test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161) test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049) test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178) SUMMARY html5.html: 19% sina.com.cn.html: 14% strong.html: 47% webapps.html: 27% wikipedia.html: 21% lipsum-1M.html: 17% ==== r? @kmcallister
Optimization issues that I ran in to that led to a bunch of the "performance tweaks" in this patch: http://internals.rust-lang.org/t/the-sad-state-of-zero-on-drop/944 |
@cgaebel: Did you investigate using |
I did. If you break up runs of text on "non-hot" states, spans move out of their "empty or one" state and into the "many" state, which is much slower. It definitely made a huge difference, and this design was only found after I tried what you just said, because you're right -- keeping track of every SingleChar is hard. |
How many buffers did those spans have on average? I'm thinking a small vector optimization could save us, or maybe finger trees. |
the vast majority are spans over 0 or 1 buffer. That optimization is already implemented. Making spans handle more than that inline would greatly increase the size of each span, and increase the amount of memory traffic on the stack even for simple and common cases. |
Based on servo#60 by cgaebel.
Based on servo#60 by cgaebel.
Based on servo#60 by cgaebel.
Now #141. |
Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.
There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as chronicled in:
http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944
Here's a little list of transformations done for performance reasons:
to properly handle large-ish things on the stack in different
match arms. This does mean that it's sane to inline the jump
table into
feed
, which had a nice impact on performance. Jumptables inside loops are especially efficient because it's just
a bigger jump table!
weren't hot enough for me to bother changing) use the old String
parsing method, since it ends up being a lot faster for small
strings and doesn't cause O(tags) allocations, thanks to truncation.
FastOption
which doesn't zero ontake
and can't be matched on, but still maintains safety.
necessary. For most parsing, we just need the utf-8 length,
which is much easier to calculate (a branch, and a LUT on the
first byte in the "slow" path).
because rustc is really bad at generating code for types which
Drop a lot in a loop. See the discuss post at the top.
performant to just append random characters to spans. Consider
a partially-consumed comment start:
<-
. If the next characteris an
a
, the <- needs to be emitted. The second temporarybuffer is used to handle cases like that.
are saved in case of backout.
----->
need to be saved andshuffled as we keep reading more dashes, so that we always
emit the "right" ones to keep the span continuous. This
required a little 2-element "queue":
first_comment_end_dash
and
second_comment_end_dash
.efficiency.
get_char
andget_preprocessed_char
, to keep fast paths fast.clone_from
to get data out of the input buffers is usedwhere it makes sense, preventing a bunch of bad rustc codegen.
As a result of these optimizations (and zero-copy parsing in general):
zero-copy
test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645)
original
test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178)
SUMMARY
html5.html: 19%
sina.com.cn.html: 14%
strong.html: 47%
webapps.html: 27%
wikipedia.html: 21%
lipsum-1M.html: 17%
r? @kmcallister