Improve zerofill in Vec::resize and Read::read_to_end #26849

bluss · 2015-07-07T11:55:50Z

Improve zerofill in Vec::resize and Read::read_to_end

We needed a more efficient way to zerofill the vector in read_to_end.
This to reduce the memory intialization overhead to a minimum.

Use the implementation of std::vec::from_elem (used for the vec![]
macro) for Vec::resize as well. For simple element types like u8, this
compiles to memset, so it makes Vec::resize much more efficient.

rust-highfive · 2015-07-07T11:56:04Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

bluss · 2015-07-07T16:22:01Z

src/libstd/io/mod.rs

+        buf.set_len(len + additional);
+        buf[len..].set_memory(0);
+    }
+}


Maybe this should be a method on Vec<u8> or even Vec<T> where T: Copy in general (with element supplied). The latter should work since a plain for loop with *elt = 0 will be recognized as memset for the T=u8 case.

Maybe .resize() would be fine if it used a for loop and not extend. ~~I'll look.~~ I sort of take that back. Doing it right will just be another place with a complicated for loop..

nikomatsakis · 2015-07-07T16:55:25Z

looks good to me, worthwhile hack for now

nikomatsakis · 2015-07-07T16:55:33Z

but do you have any kind of measurements (or a benchmark)?

nikomatsakis · 2015-07-07T16:55:50Z

cc @alexcrichton

alexcrichton · 2015-07-07T16:58:16Z

I've personally not been a huge fan of small optimizations like this in the codebase in the past as it's basically impossible to keep track of "what incantation today optimizes to what we want". In the ideal world we'd fix extend + take and then add a regression test that it always turns into a memset, but I realize that may be difficult. Regardless I agree with @nikomatsakis that I'd at least like to see comparison numbers to see how much of a difference this makes.

gmjosack · 2015-07-07T20:52:08Z

This bit me today in BufReader. The default size of BufReader is 64k so having suboptimal zeroing made my code slower than similar code in Python.

AlisdairO · 2015-07-07T21:31:51Z

This PR came out of a post here: https://users.rust-lang.org/t/reading-from-stdin-performance/2025 . The difference is pretty severe - for the shootout reverse-complement test we're talking maybe 15-20% of the total runtime is taken up by zeroing of a buffer for a read in file. It's about 10x slower to 0-init a Vec by extend vs memset.

If the sentiment is against adding small hacks like this, I'd favour either (if it's reasonably achievable) making extend optimise appropriately as soon as possible, or adding a simpler method to vec for Copy data that will optimise more reliably (see https://users.rust-lang.org/t/reading-from-stdin-performance/2025/13?u=alisdairo ). I realise that we probably don't want to pollute Vec with too many methods, but the inability to safely 0-init/extend a buffer with reasonable performance is a pretty bad black eye for a lot of performance sensitive work - especially in a world where we avoid use of uninitialised buffers.

bluss · 2015-07-08T00:48:48Z

I did the benchmarks in that forum thread, but that's just different ways to zerofill a vector (code here). I'll use @AlisdairO and try this new code in a file reading benchmark of the simple kind.

(The benchmark is for zeroing 64 kB on an corei7-avx)

running 6 tests
test fillvec_extend     ... bench:     179,061 ns/iter (+/- 1,392) = 365 MB/s
test fillvec_memset     ... bench:      17,433 ns/iter (+/- 566) = 3759 MB/s
test fillvec_push       ... bench:      64,536 ns/iter (+/- 961) = 1015 MB/s
test fillvec_resize     ... bench:     178,781 ns/iter (+/- 4,458) = 366 MB/s
test fillvec_set_memory ... bench:      17,391 ns/iter (+/- 178) = 3768 MB/s
test fillvec_setlen     ... bench:      17,398 ns/iter (+/- 443) = 3766 MB/s

After the forum post I confirmed .set_memory(0) benches the same as memset.

bluss · 2015-07-08T00:51:39Z

@alexcrichton It's a pretty important optimization -- it's the same area where we have contemplated even using uninitialized memory. So just calling a better memset.. it's necessary, we need to do this, we just need to find the best way how to.

gmjosack · 2015-07-08T01:11:26Z

Just to clarify, this same pattern is used directly in BufReader.

Would fixing that be in scope for this PR or should I file a separate bug?

bluss · 2015-07-08T01:14:59Z

@gmjosack: Those two lines should just be vec![0; cap] but now the question becomes, does that compile to memset. One could hope, but experience says no such hope available.

Benchmark check.. it does! Huzzah!

test fillvec_vec_macro  ... bench:      17,805 ns/iter (+/- 625) = 3680 MB/s

Edit: Added a commit for that. I think that's enough?

gmjosack · 2015-07-08T01:25:30Z

Thanks! Looking forward to this PR making it through. This had pretty significant impact on some code discussed on Reddit.

bluss · 2015-07-08T01:30:29Z

@gmjosack The good news is that zeroing memory can be 10x faster, is that enough to reach parity with your Python program?

gmjosack · 2015-07-08T01:32:43Z

Yeah, even dropping with_capacity to 2k (over the default 64k) increased performance significantly.

bluss · 2015-07-08T01:35:58Z

it's basically impossible to keep track of "what incantation today optimizes to what we want"

I think that in grow_zerofill, the code in that function does what it says, and it performs well.

Instead tweaking and trying to find a safe code that compiles to memset would be fragile and would fit the description of "what incantation today optimizes to what we want" much better. This code is head on, we use write_memory, the intrinsic that's just another name for memset.

AlisdairO · 2015-07-08T08:46:21Z

The more I think about this the more I think it makes sense to add a method to Vec to extend/initialise to some constant. I can understand reluctance about over-polluting Vec, but I think there's a not-so-terrible argument even in the absence of a performance improvement - which is that the ergonomics of extend are not especially great for simple initialisation.

Add to that the fact that the simpler code is clearly easier for the compiler to optimise, making it possible to avoid building up hacks in the codebase to work around it.

bluss · 2015-07-08T10:13:15Z

Benchmark for simplest possible input file → vector data grab.

Using unbuffered File since we'll read directly into the Vec anyway.

// based on @AlisdairO's test
use std::env;
use std::fs;
use std::io::Read;

fn main() {
    let arg = env::args().nth(1).unwrap();
    let mut f = fs::File::open(arg).unwrap();
    let mut data = Vec::with_capacity(1024);
    f.read_to_end(&mut data).unwrap();
    println!("{}", data.len());
}

Input file size: 493 MB. Compiler options: rustc -O. Comparing with just the recent nightly, should be sufficient. Average of three runs of ./readfile inputfile

rustc 1.3.0-nightly (20f421c 2015-07-06)

        803,873272      task-clock (msec)         #    0,997 CPUs utilized            ( +-  0,18% )
                66      context-switches          #    0,082 K/sec                    ( +- 89,34% )
                 1      cpu-migrations            #    0,001 K/sec                    ( +-100,00% )
           125 873      page-faults               #    0,157 M/sec                    ( +-  0,27% )
     2 151 672 809      cycles                    #    2,677 GHz                      ( +-  0,09% )
       755 711 632      stalled-cycles-frontend   #   35,12% frontend cycles idle     ( +-  0,29% )
       528 451 567      stalled-cycles-backend    #   24,56% backend  cycles idle     ( +-  0,44% )
     4 393 447 830      instructions              #    2,04  insns per cycle        
                                                  #    0,17  stalled cycles per insn  ( +-  0,04% )
     1 168 243 252      branches                  # 1453,268 M/sec                    ( +-  0,03% )
           375 102      branch-misses             #    0,03% of all branches          ( +-  0,58% )

       0,805964240 seconds time elapsed                                          ( +-  0,32% )

With this PR:

        474,183825      task-clock (msec)         #    0,984 CPUs utilized            ( +-  0,76% )
               321      context-switches          #    0,677 K/sec                    ( +- 52,58% )
                 1      cpu-migrations            #    0,003 K/sec                    ( +- 25,00% )
           137 635      page-faults               #    0,290 M/sec                    ( +-  0,45% )
     1 260 766 605      cycles                    #    2,659 GHz                      ( +-  0,60% )
       861 361 099      stalled-cycles-frontend   #   68,32% frontend cycles idle     ( +-  0,52% )
       596 525 192      stalled-cycles-backend    #   47,31% backend  cycles idle     ( +-  0,50% )
       889 947 773      instructions              #    0,71  insns per cycle        
                                                  #    0,97  stalled cycles per insn  ( +-  0,83% )
       152 589 084      branches                  #  321,793 M/sec                    ( +-  0,88% )
           422 393      branch-misses             #    0,28% of all branches          ( +-  2,62% )

       0,482055693 seconds time elapsed                                          ( +-  1,43% )

And yes, for some reason perf localizes the decimal point..

Edit: Timings using Vec::with_capacity(500 MB) are 0,718637177 ( +- 1,28% ) and 0,340066534 ( +- 1,14% ) respectively, so that's an independent degree of improvement (that the user already controls).

bluss · 2015-07-08T10:17:07Z

Like @Stebalien said on the forum, there is room for improvement. This is just the general Read::read_to_end implementation. in File::read_to_end, we have no user code between the File and the Read trait, so we don't need to initialize memory at all if File specializes their implementation.

bluss · 2015-07-08T12:58:00Z

Improving .resize() instead looks like this. This benchmark verifies that resize performs is a memset, and validates that vec![0; cap] still performs well. (benchmark source)

test fillvec_extend     ... bench:      92,298 ns/iter (+/- 11,857) = 710 MB/s
test fillvec_memset     ... bench:      21,047 ns/iter (+/- 629) = 3113 MB/s
test fillvec_push       ... bench:      68,248 ns/iter (+/- 1,908) = 960 MB/s
test fillvec_resize     ... bench:      21,093 ns/iter (+/- 2,601) = 3106 MB/s
test fillvec_set_memory ... bench:      21,043 ns/iter (+/- 281) = 3114 MB/s
test fillvec_setlen     ... bench:      21,052 ns/iter (+/- 635) = 3113 MB/s
test fillvec_vec_macro  ... bench:      21,128 ns/iter (+/- 2,676) = 3101 MB/s

bluss · 2015-07-08T13:03:46Z

Pushed new angle of attack: improving Vec::resize, by having it share code with std::vec::from_elem (which we use in vec![x; n].)

The new version of the PR introduces no new unsafe code, only tweaking from_elemto share the impl.

alexcrichton · 2015-07-08T17:17:11Z

src/libcollections/vec.rs

        } else {
            self.truncate(new_len);
        }
    }

+    /// Extend the vector by `n` additional clones of `value`.
+    #[inline(always)]


Can you omit this #[inline(always)] tag? This seems like a fairly significant chunk of code to always inline. I would think the tag could be omitted entirely as well (this is already generic)

It's only used in two locations and I want it to inline into them (Vec::resize and from_elem). Do you think this is not going to work out that way? It's a private function.

Ok since #[inline(always)] "leaks out" this is not as intended.

alexcrichton · 2015-07-08T17:19:18Z

Nice wins @bluss! I'm always a fan of re-using or improving what exists :)

r=me with a few minor comments here and there

We needed a more efficient way to zerofill the vector in read_to_end. This to reduce the memory intialization overhead to a minimum. Use the implementation of `std::vec::from_elem` (used for the vec![] macro) for Vec::resize as well. For simple element types like u8, this compiles to memset, so it makes Vec::resize much more efficient.

Use the vec![] macro directly to create a sized, zeroed vector. This should result in a big speedup when creating BufReader, because vec![0; cap] compiles to a memset call, while the previous extend code currently did not.

bluss · 2015-07-08T17:41:44Z

Removed the inline, addressed .clone(). I took the liberty of renaming extend_elem → extend_with_element.

alexcrichton · 2015-07-08T17:46:46Z

@bors: r+ a5cc17a

bluss · 2015-07-08T17:58:50Z

relnotes → better perf is good news that might want to be bragged about.

nikomatsakis · 2015-07-08T20:46:41Z

nice.

bors · 2015-07-08T21:12:55Z

⌛ Testing commit a5cc17a with merge 020d201...

Improve zerofill in Vec::resize and Read::read_to_end We needed a more efficient way to zerofill the vector in read_to_end. This to reduce the memory intialization overhead to a minimum. Use the implementation of `std::vec::from_elem` (used for the vec![] macro) for Vec::resize as well. For simple element types like u8, this compiles to memset, so it makes Vec::resize much more efficient.

bors · 2015-07-08T22:45:17Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt

@alexcrichton

In a followup to PR #26849, improve one more location for I/O where we can use `Vec::resize` to ensure better performance when zeroing buffers. Use the `vec![elt; n]` macro everywhere we can in the tree. It replaces `repeat(elt).take(n).collect()` which is more verbose, requires type hints, and right now produces worse code. `vec![]` is preferable for vector initialization. The `vec![]` replacement touches upon one I/O path too, Stdin::read for windows, and that should be a small improvement. r? @alexcrichton

brson · 2015-07-12T21:31:08Z

Thanks for tagging this for relnotes @bluss.

rust-highfive assigned nikomatsakis Jul 7, 2015

bluss reviewed Jul 7, 2015
View reviewed changes

bluss force-pushed the read-to-end-memset branch from c0d0a80 to e2d3168 Compare July 8, 2015 08:58

bluss force-pushed the read-to-end-memset branch from e2d3168 to 7e4ce38 Compare July 8, 2015 13:01

bluss changed the title ~~io: Use a more efficient way to zerofill the vector in read_to_end~~ Improve zerofill in Vec::resize and Read::read_to_end Jul 8, 2015

bluss force-pushed the read-to-end-memset branch from 7e4ce38 to adbb363 Compare July 8, 2015 14:33

alexcrichton reviewed Jul 8, 2015
View reviewed changes

bluss added 2 commits July 8, 2015 19:40

io: Simplify BufReader::with_capacity

a5cc17a

Use the vec![] macro directly to create a sized, zeroed vector. This should result in a big speedup when creating BufReader, because vec![0; cap] compiles to a memset call, while the previous extend code currently did not.

bluss force-pushed the read-to-end-memset branch from adbb363 to a5cc17a Compare July 8, 2015 17:41

bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 8, 2015

bors merged commit a5cc17a into rust-lang:master Jul 8, 2015

bluss deleted the read-to-end-memset branch July 8, 2015 22:50

bluss mentioned this pull request Jul 8, 2015

Use vec![] for vector creation #26904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve zerofill in Vec::resize and Read::read_to_end #26849

Improve zerofill in Vec::resize and Read::read_to_end #26849

bluss commented Jul 7, 2015

rust-highfive commented Jul 7, 2015

bluss Jul 7, 2015

bluss Jul 7, 2015

nikomatsakis commented Jul 7, 2015

nikomatsakis commented Jul 7, 2015

nikomatsakis commented Jul 7, 2015

alexcrichton commented Jul 7, 2015

gmjosack commented Jul 7, 2015

AlisdairO commented Jul 7, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

AlisdairO commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

alexcrichton Jul 8, 2015

bluss Jul 8, 2015

bluss Jul 8, 2015

alexcrichton commented Jul 8, 2015

bluss commented Jul 8, 2015

alexcrichton commented Jul 8, 2015

bluss commented Jul 8, 2015

nikomatsakis commented Jul 8, 2015

bors commented Jul 8, 2015

bors commented Jul 8, 2015

brson commented Jul 12, 2015

Improve zerofill in Vec::resize and Read::read_to_end #26849

Improve zerofill in Vec::resize and Read::read_to_end #26849

Conversation

bluss commented Jul 7, 2015

rust-highfive commented Jul 7, 2015

bluss Jul 7, 2015

Choose a reason for hiding this comment

bluss Jul 7, 2015

Choose a reason for hiding this comment

nikomatsakis commented Jul 7, 2015

nikomatsakis commented Jul 7, 2015

nikomatsakis commented Jul 7, 2015

alexcrichton commented Jul 7, 2015

gmjosack commented Jul 7, 2015

AlisdairO commented Jul 7, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

gmjosack commented Jul 8, 2015

bluss commented Jul 8, 2015

AlisdairO commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

bluss commented Jul 8, 2015

alexcrichton Jul 8, 2015

Choose a reason for hiding this comment

bluss Jul 8, 2015

Choose a reason for hiding this comment

bluss Jul 8, 2015

Choose a reason for hiding this comment

alexcrichton commented Jul 8, 2015

bluss commented Jul 8, 2015

alexcrichton commented Jul 8, 2015

bluss commented Jul 8, 2015

nikomatsakis commented Jul 8, 2015

bors commented Jul 8, 2015

bors commented Jul 8, 2015

brson commented Jul 12, 2015