Skip to content

Implements zero-copy html parsing. #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

cgaebel
Copy link

@cgaebel cgaebel commented Nov 25, 2014

Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.

There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as chronicled in:

http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944

Here's a little list of transformations done for performance reasons:

  • State machine states are now functions. Rustc isn't smart enough
    to properly handle large-ish things on the stack in different
    match arms. This does mean that it's sane to inline the jump
    table into feed, which had a nice impact on performance. Jump
    tables inside loops are especially efficient because it's just
    a bigger jump table!
  • Things which get atomized anyhow (except for doctypes, which
    weren't hot enough for me to bother changing) use the old String
    parsing method, since it ends up being a lot faster for small
    strings and doesn't cause O(tags) allocations, thanks to truncation.
  • A custom Option called FastOption which doesn't zero on take
    and can't be matched on, but still maintains safety.
  • No utf-8 decoding of the chars is avoided unless absolutely
    necessary. For most parsing, we just need the utf-8 length,
    which is much easier to calculate (a branch, and a LUT on the
    first byte in the "slow" path).
  • Chars and Runs are parsed into a "shared" location every time,
    because rustc is really bad at generating code for types which
    Drop a lot in a loop. See the discuss post at the top.
  • A new temp_buf has been introduced, because it is no longer
    performant to just append random characters to spans. Consider
    a partially-consumed comment start: <-. If the next character
    is an a, the <- needs to be emitted. The second temporary
    buffer is used to handle cases like that.
  • Similar to above, but when parsing char refs: the '&' and '#'
    are saved in case of backout.
  • Dashes at the end of a comment -----> need to be saved and
    shuffled as we keep reading more dashes, so that we always
    emit the "right" ones to keep the span continuous. This
    required a little 2-element "queue": first_comment_end_dash
    and second_comment_end_dash.
  • Some of the tokenizer fields were reordered for cache
    efficiency.
  • Some inliner guidance was done in get_char and
    get_preprocessed_char, to keep fast paths fast.
  • clone_from to get data out of the input buffers is used
    where it makes sense, preventing a bunch of bad rustc codegen.

As a result of these optimizations (and zero-copy parsing in general):

zero-copy

test tokenize uncommitted/html5.html ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html ... bench: 1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench: 7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html ... bench: 30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html ... bench: 99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html ... bench: 3841740 ns/iter (+/- 612645)

original

test tokenize uncommitted/html5.html ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html ... bench: 2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench: 8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html ... bench: 44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html ... bench: 4868854 ns/iter (+/- 797178)

SUMMARY

html5.html: 19%
sina.com.cn.html: 14%
strong.html: 47%
webapps.html: 27%
wikipedia.html: 21%
lipsum-1M.html: 17%

r? @kmcallister

Clark Gaebel added 6 commits November 25, 2014 13:35
Instead of using strings, html5ever now uses Iobufs and Spans over
Iobufs to represent the raw html data. This allows us to do all of
the parsing without copying the HTML into a tree of strings: it can
be Spans and Iobufs all the way down.

There were several performance hacks done to get this faster. Most
of them were to work around rustc failures as described in:

http://discuss.rust-lang.org/t/the-sad-state-of-zero-on-drop/944

Here's a little list of transformations done for performance reasons:

  - State machine states are now functions. Rustc isn't smart enough
    to properly handle large-ish things on the stack in different
    match arms. This does mean that it's sane to inline the jump
    table into `feed`, which had a nice impact on performance. Jump
    tables inside loops are especially efficient because it's just
    a bigger jump table!
  - Things which get atomized anyhow (except for doctypes, which
    weren't hot enough for me to bother changing) use the old String
    parsing method, since it ends up being a lot faster for small
    strings and doesn't cause O(tags) allocations, thanks to truncation.
  - A custom Option called `FastOption` which doesn't zero on `take`
    and can't be matched on, but still maintains safety.
  - Iobufs in the input_buffers Ringbuf are padded to 32 bytes, to
    allow indexing without a multiply. That was actually a hotspot
    that showed up in perf, which is a little scary.
  - No utf-8 decoding of the chars is avoided unless absolutely
    necessary. For most parsing, we just need the utf-8 length,
    which is much easier to calculate (a branch, and a LUT on the
    first byte in the "slow" path).
  - Chars and Runs are parsed into a "shared" location every time,
    because rustc is really bad at generating code for types which
    Drop a lot in a loop. See the discuss post at the top.
  - A new temp_buf has been introduced, because it is no longer
    performant to just append random characters to spans. Consider
    a partially-consumed comment start: `<-`. If the next character
    is an `a`, the <- needs to be emitted. The second temporary
    buffer is used to handle cases like that.
  - Similar to above, but when parsing char refs: the '&' and '#'
    are saved in case of backout.
  - Dashes at the end of a comment `----->` need to be saved and
    shuffled as we keep reading more dashes, so that we always
    emit the "right" ones to keep the span continuous. This
    required a little 2-element "queue": `first_comment_end_dash`
    and `second_comment_end_dash`.
  - Some of the tokenizer fields were reordered for cache
    efficiency.
  - Some inliner guidance was done in `get_char` and
    `get_preprocessed_char`, to keep fast paths fast.
  - `clone_from` to get data out of the input buffers is used
    where it makes sense, preventing a bunch of bad rustc codegen.

As a result of these optimizations (and zero-copy parsing in general):

===

zero-copy

test tokenize uncommitted/html5.html       ... bench: 124076195 ns/iter (+/- 9519897)
test tokenize uncommitted/lipsum-1M.html   ... bench:   1989708 ns/iter (+/- 405327)
test tokenize uncommitted/sina.com.cn.html ... bench:   7210262 ns/iter (+/- 1391972)
test tokenize uncommitted/strong.html      ... bench:  30002001 ns/iter (+/- 3375152)
test tokenize uncommitted/webapps.html     ... bench:  99264377 ns/iter (+/- 8138989)
test tokenize uncommitted/wikipedia.html   ... bench:   3841740 ns/iter (+/- 612645)

original

test tokenize uncommitted/html5.html       ... bench: 153991836 ns/iter (+/- 7196531)
test tokenize uncommitted/lipsum-1M.html   ... bench:   2393385 ns/iter (+/- 450953)
test tokenize uncommitted/sina.com.cn.html ... bench:   8837605 ns/iter (+/- 1238217)
test tokenize uncommitted/strong.html      ... bench:  44153393 ns/iter (+/- 5076161)
test tokenize uncommitted/webapps.html     ... bench: 136860951 ns/iter (+/- 8137049)
test tokenize uncommitted/wikipedia.html   ... bench:   4868854 ns/iter (+/- 797178)

SUMMARY

html5.html:       19%
sina.com.cn.html: 14%
strong.html:      47%
webapps.html:     27%
wikipedia.html:   21%
lipsum-1M.html:   17%

====

r? @kmcallister
@cgaebel
Copy link
Author

cgaebel commented Mar 8, 2015

Optimization issues that I ran in to that led to a bunch of the "performance tweaks" in this patch:

http://internals.rust-lang.org/t/the-sad-state-of-zero-on-drop/944
rust-lang/rust#20219
The giant state machine function had stuff on the stack (with a destructor) in each match arm, and I believe this is the root cause of stack usage exploding. I think this is zero-on-drop confusing llvm. To fix it, I put each state in its own function. This actually isn't totally a bad thing, because it makes stacktraces when there's problems a lot nicer to read.

@kmcallister
Copy link
Contributor

@cgaebel: Did you investigate using Span just for the fast paths like pop_except_from? Keeping track of every SingleChar may be more trouble than it's worth.

@cgaebel
Copy link
Author

cgaebel commented Mar 21, 2015

I did. If you break up runs of text on "non-hot" states, spans move out of their "empty or one" state and into the "many" state, which is much slower. It definitely made a huge difference, and this design was only found after I tried what you just said, because you're right -- keeping track of every SingleChar is hard.

@kmcallister
Copy link
Contributor

How many buffers did those spans have on average? I'm thinking a small vector optimization could save us, or maybe finger trees.

@cgaebel
Copy link
Author

cgaebel commented Mar 21, 2015

the vast majority are spans over 0 or 1 buffer. That optimization is already implemented. Making spans handle more than that inline would greatly increase the size of each span, and increase the amount of memory traffic on the stack even for simple and common cases.

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
kmcallister added a commit to kmcallister/html5ever that referenced this pull request Mar 23, 2015
kmcallister added a commit to kmcallister/html5ever that referenced this pull request Jun 10, 2015
@kmcallister
Copy link
Contributor

Now #141.

kmcallister added a commit that referenced this pull request Jun 16, 2015
Based on #60 and #114.

Fixes #20.
Fixes #115.
kmcallister added a commit that referenced this pull request Jun 25, 2015
Based on #60 and #114.

Fixes #20.
Fixes #115.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants