Skip to content

Implement zero-copy parsing #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

kmcallister
Copy link
Contributor

This is based on #60 but with substantial changes. The biggest difference is that we only use shared buffers for the character runs found by pop_except_from. The majority of the remaining spans are single ASCII characters, which have their own fast path. Everything else is a String as before.

This branch also drops many of the micro-optimizations from #60. Unlike that PR, we leave the parser rules alone for the most part.

r? @Manishearth or @SimonSapin (general review)

r? @cgaebel (iobuf usage in tendril.rs)

Depending on the specific content and the I/O chunk size, this branch speeds up tokenization by up to a few percent. I did not see any significant performance regressions with sensible chunk sizes.

I have plans for further optimizations, including following up on the rustc bugs @cgaebel identified in #60.

The branch already achieves a significant drop in allocations and memory consumption:

(preliminary numbers)

Webapp spec, single page:

pre-zerocopy

846,520 allocs, 37,567,992 bytes allocated
maximum resident: 11,248 kB

zerocopy

95,690 allocs, 25,698,480 bytes allocated
maximum resident: 3,648 kB

Wikipedia (GotG from servo-static-suite)

pre-zerocopy

62,705 allocs, 2,299,424 bytes allocated
maximum resident: 3,948 kB

zerocopy

11,549 allocs, 2,355,984 bytes allocated
maximum resident: 3,596 kB

/// The buffer.
pub buf: String,
pub buf: Tendril,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't pos redundant with buf now, since Iobuf has its own cursor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the owned variant of a Tendril doesn't.

Eventually I would like to replace BufferQueue with a rope, as discussed elsewhere.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@kmcallister kmcallister mentioned this pull request Mar 25, 2015
@cgaebel
Copy link

cgaebel commented Mar 25, 2015

Doesn't this mean that any long string of text (i.e. that spans multiple chunks) would have to fall back to a String? How big are chunks in Servo?

I guess that's a small price to pay for more efficient iteration.


#[inline(always)]
fn check_len(&self) {
if self.len() > TENDRIL_MAX_LEN as usize {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use len32?

@cgaebel
Copy link

cgaebel commented Mar 25, 2015

Iobuf usage looks good to me (modulo comments)! I like this remix.

kmcallister added a commit to kmcallister/html5ever that referenced this pull request Jun 10, 2015
@kmcallister
Copy link
Contributor Author

Now #141.

kmcallister added a commit that referenced this pull request Jun 16, 2015
Based on #60 and #114.

Fixes #20.
Fixes #115.
kmcallister added a commit that referenced this pull request Jun 25, 2015
Based on #60 and #114.

Fixes #20.
Fixes #115.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants