Skip to content

Improve zerofill in Vec::resize and Read::read_to_end #26849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 8, 2015

Conversation

bluss
Copy link
Member

@bluss bluss commented Jul 7, 2015

Improve zerofill in Vec::resize and Read::read_to_end

We needed a more efficient way to zerofill the vector in read_to_end.
This to reduce the memory intialization overhead to a minimum.

Use the implementation of std::vec::from_elem (used for the vec![]
macro) for Vec::resize as well. For simple element types like u8, this
compiles to memset, so it makes Vec::resize much more efficient.

@rust-highfive
Copy link
Contributor

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

buf.set_len(len + additional);
buf[len..].set_memory(0);
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a method on Vec<u8> or even Vec<T> where T: Copy in general (with element supplied). The latter should work since a plain for loop with *elt = 0 will be recognized as memset for the T=u8 case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe .resize() would be fine if it used a for loop and not extend. I'll look. I sort of take that back. Doing it right will just be another place with a complicated for loop..

@nikomatsakis
Copy link
Contributor

looks good to me, worthwhile hack for now

@nikomatsakis
Copy link
Contributor

but do you have any kind of measurements (or a benchmark)?

@nikomatsakis
Copy link
Contributor

cc @alexcrichton

@alexcrichton
Copy link
Member

I've personally not been a huge fan of small optimizations like this in the codebase in the past as it's basically impossible to keep track of "what incantation today optimizes to what we want". In the ideal world we'd fix extend + take and then add a regression test that it always turns into a memset, but I realize that may be difficult. Regardless I agree with @nikomatsakis that I'd at least like to see comparison numbers to see how much of a difference this makes.

@gmjosack
Copy link
Contributor

gmjosack commented Jul 7, 2015

This bit me today in BufReader. The default size of BufReader is 64k so having suboptimal zeroing made my code slower than similar code in Python.

@AlisdairO
Copy link
Contributor

This PR came out of a post here: https://users.rust-lang.org/t/reading-from-stdin-performance/2025 . The difference is pretty severe - for the shootout reverse-complement test we're talking maybe 15-20% of the total runtime is taken up by zeroing of a buffer for a read in file. It's about 10x slower to 0-init a Vec by extend vs memset.

If the sentiment is against adding small hacks like this, I'd favour either (if it's reasonably achievable) making extend optimise appropriately as soon as possible, or adding a simpler method to vec for Copy data that will optimise more reliably (see https://users.rust-lang.org/t/reading-from-stdin-performance/2025/13?u=alisdairo ). I realise that we probably don't want to pollute Vec with too many methods, but the inability to safely 0-init/extend a buffer with reasonable performance is a pretty bad black eye for a lot of performance sensitive work - especially in a world where we avoid use of uninitialised buffers.

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

I did the benchmarks in that forum thread, but that's just different ways to zerofill a vector (code here). I'll use @AlisdairO and try this new code in a file reading benchmark of the simple kind.

(The benchmark is for zeroing 64 kB on an corei7-avx)

running 6 tests
test fillvec_extend     ... bench:     179,061 ns/iter (+/- 1,392) = 365 MB/s
test fillvec_memset     ... bench:      17,433 ns/iter (+/- 566) = 3759 MB/s
test fillvec_push       ... bench:      64,536 ns/iter (+/- 961) = 1015 MB/s
test fillvec_resize     ... bench:     178,781 ns/iter (+/- 4,458) = 366 MB/s
test fillvec_set_memory ... bench:      17,391 ns/iter (+/- 178) = 3768 MB/s
test fillvec_setlen     ... bench:      17,398 ns/iter (+/- 443) = 3766 MB/s

After the forum post I confirmed .set_memory(0) benches the same as memset.

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

@alexcrichton It's a pretty important optimization -- it's the same area where we have contemplated even using uninitialized memory. So just calling a better memset.. it's necessary, we need to do this, we just need to find the best way how to.

@gmjosack
Copy link
Contributor

gmjosack commented Jul 8, 2015

Just to clarify, this same pattern is used directly in BufReader.

Would fixing that be in scope for this PR or should I file a separate bug?

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

@gmjosack: Those two lines should just be vec![0; cap] but now the question becomes, does that compile to memset. One could hope, but experience says no such hope available.

Benchmark check.. it does! Huzzah!

test fillvec_vec_macro  ... bench:      17,805 ns/iter (+/- 625) = 3680 MB/s

Edit: Added a commit for that. I think that's enough?

@gmjosack
Copy link
Contributor

gmjosack commented Jul 8, 2015

Thanks! Looking forward to this PR making it through. This had pretty significant impact on some code discussed on Reddit.

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

@gmjosack The good news is that zeroing memory can be 10x faster, is that enough to reach parity with your Python program?

@gmjosack
Copy link
Contributor

gmjosack commented Jul 8, 2015

Yeah, even dropping with_capacity to 2k (over the default 64k) increased performance significantly.

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

it's basically impossible to keep track of "what incantation today optimizes to what we want"

I think that in grow_zerofill, the code in that function does what it says, and it performs well.

Instead tweaking and trying to find a safe code that compiles to memset would be fragile and would fit the description of "what incantation today optimizes to what we want" much better. This code is head on, we use write_memory, the intrinsic that's just another name for memset.

@AlisdairO
Copy link
Contributor

The more I think about this the more I think it makes sense to add a method to Vec to extend/initialise to some constant. I can understand reluctance about over-polluting Vec, but I think there's a not-so-terrible argument even in the absence of a performance improvement - which is that the ergonomics of extend are not especially great for simple initialisation.

Add to that the fact that the simpler code is clearly easier for the compiler to optimise, making it possible to avoid building up hacks in the codebase to work around it.

@bluss bluss force-pushed the read-to-end-memset branch from c0d0a80 to e2d3168 Compare July 8, 2015 08:58
@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

Benchmark for simplest possible input file → vector data grab.

Using unbuffered File since we'll read directly into the Vec anyway.

// based on @AlisdairO's test
use std::env;
use std::fs;
use std::io::Read;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let mut f = fs::File::open(arg).unwrap();
    let mut data = Vec::with_capacity(1024);
    f.read_to_end(&mut data).unwrap();
    println!("{}", data.len());
}

Input file size: 493 MB. Compiler options: rustc -O. Comparing with just the recent nightly, should be sufficient. Average of three runs of ./readfile inputfile

rustc 1.3.0-nightly (20f421c 2015-07-06)

        803,873272      task-clock (msec)         #    0,997 CPUs utilized            ( +-  0,18% )
                66      context-switches          #    0,082 K/sec                    ( +- 89,34% )
                 1      cpu-migrations            #    0,001 K/sec                    ( +-100,00% )
           125 873      page-faults               #    0,157 M/sec                    ( +-  0,27% )
     2 151 672 809      cycles                    #    2,677 GHz                      ( +-  0,09% )
       755 711 632      stalled-cycles-frontend   #   35,12% frontend cycles idle     ( +-  0,29% )
       528 451 567      stalled-cycles-backend    #   24,56% backend  cycles idle     ( +-  0,44% )
     4 393 447 830      instructions              #    2,04  insns per cycle        
                                                  #    0,17  stalled cycles per insn  ( +-  0,04% )
     1 168 243 252      branches                  # 1453,268 M/sec                    ( +-  0,03% )
           375 102      branch-misses             #    0,03% of all branches          ( +-  0,58% )

       0,805964240 seconds time elapsed                                          ( +-  0,32% )

With this PR:

        474,183825      task-clock (msec)         #    0,984 CPUs utilized            ( +-  0,76% )
               321      context-switches          #    0,677 K/sec                    ( +- 52,58% )
                 1      cpu-migrations            #    0,003 K/sec                    ( +- 25,00% )
           137 635      page-faults               #    0,290 M/sec                    ( +-  0,45% )
     1 260 766 605      cycles                    #    2,659 GHz                      ( +-  0,60% )
       861 361 099      stalled-cycles-frontend   #   68,32% frontend cycles idle     ( +-  0,52% )
       596 525 192      stalled-cycles-backend    #   47,31% backend  cycles idle     ( +-  0,50% )
       889 947 773      instructions              #    0,71  insns per cycle        
                                                  #    0,97  stalled cycles per insn  ( +-  0,83% )
       152 589 084      branches                  #  321,793 M/sec                    ( +-  0,88% )
           422 393      branch-misses             #    0,28% of all branches          ( +-  2,62% )

       0,482055693 seconds time elapsed                                          ( +-  1,43% )

And yes, for some reason perf localizes the decimal point..

Edit: Timings using Vec::with_capacity(500 MB) are 0,718637177 ( +- 1,28% ) and 0,340066534 ( +- 1,14% ) respectively, so that's an independent degree of improvement (that the user already controls).

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

Like @Stebalien said on the forum, there is room for improvement. This is just the general Read::read_to_end implementation. in File::read_to_end, we have no user code between the File and the Read trait, so we don't need to initialize memory at all if File specializes their implementation.

@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

Improving .resize() instead looks like this. This benchmark verifies that resize performs is a memset, and validates that vec![0; cap] still performs well. (benchmark source)

test fillvec_extend     ... bench:      92,298 ns/iter (+/- 11,857) = 710 MB/s
test fillvec_memset     ... bench:      21,047 ns/iter (+/- 629) = 3113 MB/s
test fillvec_push       ... bench:      68,248 ns/iter (+/- 1,908) = 960 MB/s
test fillvec_resize     ... bench:      21,093 ns/iter (+/- 2,601) = 3106 MB/s
test fillvec_set_memory ... bench:      21,043 ns/iter (+/- 281) = 3114 MB/s
test fillvec_setlen     ... bench:      21,052 ns/iter (+/- 635) = 3113 MB/s
test fillvec_vec_macro  ... bench:      21,128 ns/iter (+/- 2,676) = 3101 MB/s

@bluss bluss force-pushed the read-to-end-memset branch from e2d3168 to 7e4ce38 Compare July 8, 2015 13:01
@bluss bluss changed the title io: Use a more efficient way to zerofill the vector in read_to_end Improve zerofill in Vec::resize and Read::read_to_end Jul 8, 2015
@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

Pushed new angle of attack: improving Vec::resize, by having it share code with std::vec::from_elem (which we use in vec![x; n].)

The new version of the PR introduces no new unsafe code, only tweaking from_elemto share the impl.

@bluss bluss force-pushed the read-to-end-memset branch from 7e4ce38 to adbb363 Compare July 8, 2015 14:33
} else {
self.truncate(new_len);
}
}

/// Extend the vector by `n` additional clones of `value`.
#[inline(always)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you omit this #[inline(always)] tag? This seems like a fairly significant chunk of code to always inline. I would think the tag could be omitted entirely as well (this is already generic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used in two locations and I want it to inline into them (Vec::resize and from_elem). Do you think this is not going to work out that way? It's a private function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok since #[inline(always)] "leaks out" this is not as intended.

@alexcrichton
Copy link
Member

Nice wins @bluss! I'm always a fan of re-using or improving what exists :)

r=me with a few minor comments here and there

bluss added 2 commits July 8, 2015 19:40
We needed a more efficient way to zerofill the vector in read_to_end.
This to reduce the memory intialization overhead to a minimum.

Use the implementation of `std::vec::from_elem` (used for the vec![]
macro) for Vec::resize as well. For simple element types like u8, this
compiles to memset, so it makes Vec::resize much more efficient.
Use the vec![] macro directly to create a sized, zeroed vector.

This should result in a big speedup when creating BufReader, because
vec![0; cap] compiles to a memset call, while the previous extend code
currently did not.
@bluss bluss force-pushed the read-to-end-memset branch from adbb363 to a5cc17a Compare July 8, 2015 17:41
@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

Removed the inline, addressed .clone(). I took the liberty of renaming extend_elemextend_with_element.

@alexcrichton
Copy link
Member

@bors: r+ a5cc17a

@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 8, 2015
@bluss
Copy link
Member Author

bluss commented Jul 8, 2015

relnotes → better perf is good news that might want to be bragged about.

@nikomatsakis
Copy link
Contributor

nice.

@bors
Copy link
Collaborator

bors commented Jul 8, 2015

⌛ Testing commit a5cc17a with merge 020d201...

bors added a commit that referenced this pull request Jul 8, 2015
Improve zerofill in Vec::resize and Read::read_to_end

We needed a more efficient way to zerofill the vector in read_to_end.
This to reduce the memory intialization overhead to a minimum.

Use the implementation of `std::vec::from_elem` (used for the vec![]
macro) for Vec::resize as well. For simple element types like u8, this
compiles to memset, so it makes Vec::resize much more efficient.
@bors bors merged commit a5cc17a into rust-lang:master Jul 8, 2015
@bluss bluss deleted the read-to-end-memset branch July 8, 2015 22:50
bors added a commit that referenced this pull request Jul 9, 2015
In a followup to PR #26849, improve one more location for I/O where
we can use `Vec::resize` to ensure better performance when zeroing
buffers.

Use the `vec![elt; n]` macro everywhere we can in the tree. It replaces
`repeat(elt).take(n).collect()` which is more verbose, requires type
hints, and right now produces worse code. `vec![]` is preferable for vector
initialization.

The `vec![]` replacement touches upon one I/O path too, Stdin::read
for windows, and that should be a small improvement.

r? @alexcrichton
@brson
Copy link
Contributor

brson commented Jul 12, 2015

Thanks for tagging this for relnotes @bluss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants