read_to_end is very slow (>30x slower than 0.10) #15177

arjantop · 2014-06-25T10:57:21Z

master:

    #[bench]
    fn read_to_end(b: &mut Bencher) {
        let bytes = Vec::from_elem(100, 10u8);
        b.iter(|| {
            let mut reader = BufReader::new(bytes.as_slice());
            black_box(reader.read_to_end())
        })
    }

running 1 test
test bench::read_to_end ... bench:      2534 ns/iter (+/- 88)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

0.10:

    #[bench]
    fn read_to_end(b: &mut BenchHarness) {
        let bytes = Vec::from_elem(100, 10u8);
        b.iter(|| {
            let mut reader = BufReader::new(bytes.as_slice());
            black_box(reader.read_to_end())
        })
    }

running 1 test
test bench::read_to_end ... bench:        80 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

The text was updated successfully, but these errors were encountered:

alexcrichton · 2014-06-25T13:25:49Z

Can you be sure that the 0.10 version isn't being optimized away by returning the result of reader.read_to_end()?

arjantop · 2014-06-25T13:33:00Z

Yes, it was a part of a larger example when I noticed this. Adding black_box around does not change the results.
I have updated the examples to include black_box.

alexcrichton · 2014-06-25T13:39:12Z

I'm only able to run perf over the most current version, but it appears that essentially the entire time is spent in jemalloc, so this is perhaps just a benchmark that jemalloc does not perform well on?

arjantop · 2014-06-25T13:44:58Z

I have code doing read_byte one by one, doing comparisons, and putting it in Vec that is 5x faster than plain read_to_end

thestinger · 2014-09-16T03:27:41Z

Exporting JE_MALLOC_CONF=lg_dirty_mult:-1 results in a drastic performance increase by making it not care about reducing RES usage. In practice, glibc was never reducing the peak RES because it uses sbrk for everything that's not an enormous memory allocation. It's a memory vs. performance trade-off and I don't think we can do much about it other than bikeshedding the default lg_dirty_mult setting.

brson · 2014-09-16T18:26:43Z

@thestinger I'm not satisfied with that answer. A 30x performance regression in a common utility function as not something to sweep under the rug.

thestinger · 2014-09-16T19:06:29Z

@brson: It's not a 30x performance regression in a common utility function. It's a micro-benchmark of the utility function where dirty page purging performs poorly.

 % ./foo --bench

running 1 test
test read_to_end ... bench:      4463 ns/iter (+/- 94)

% JE_MALLOC_CONF=lg_dirty_mult:-1 ./foo --bench       

running 1 test
test read_to_end ... bench:       307 ns/iter (+/- 6)

% JE_MALLOC_CONF=lg_dirty_mult:2 ./foo --bench

running 1 test
test read_to_end ... bench:       267 ns/iter (+/- 23)

% JE_MALLOC_CONF=lg_dirty_mult:3 ./foo --bench      

running 1 test
test read_to_end ... bench:      4451 ns/iter (+/- 109)

It's a clear cut memory vs. performance trade-off. Either it consumes more memory or it wastes time purging the dirty pages. Disabling it by default would fix the regression on some microbenchmarks, but memory usage would regress back to where it was before jemalloc was used.

Allocators need to make many performance compromises, and this happens to be one with a user-facing knob. It could be exposed from the heap module with je_mallctl. I don't really see the point of fretting over the performance trade-off when it's configurable at runtime. It's not really an actionable bug unless micro-benchmarks are going to be prioritized over memory usage.

jemalloc used to have a special case specifically to make microbenchmarks perform better with dirty page purging enabled, but the special case was removed because it was quite arbitrary. I think it's better for a microbenchmark to be an honest reflection of the performance on real workloads.

thestinger · 2014-09-16T19:09:12Z

Perhaps it does make sense to disable purging by default to improve performance in trivial applications and have large applications like the Rust compiler and Servo to explicitly turn on purging. It's not the choice jemalloc makes upstream though.

thestinger · 2014-09-17T22:47:50Z

A demonstration of how the dirty page purging ratio plays into this:

extern crate test;

use test::{Bencher, black_box};
use std::io::BufReader;

#[bench]
fn read_to_end(b: &mut Bencher) {
    let bytes = Vec::from_elem(100, 10u8);
    b.iter(|| {
        let mut reader = BufReader::new(bytes.as_slice());
        black_box(reader.read_to_end())
    })
}

test read_to_end ... bench:      4458 ns/iter (+/- 918)

extern crate test;

use test::{Bencher, black_box};
use std::io::BufReader;

#[bench]
fn read_to_end(b: &mut Bencher) {
    let _huge= Vec::<u64>::with_capacity(100000);
    let bytes = Vec::from_elem(100, 10u8);
    b.iter(|| {
        let mut reader = BufReader::new(bytes.as_slice());
        black_box(reader.read_to_end())
    })
}

test read_to_end ... bench:       280 ns/iter (+/- 3)

thestinger · 2014-10-22T18:36:03Z

Closing in favour of #18236 which is an actionable issue. There is no performance bug here, but there is a tunable performance vs. memory trade-off and we should consider a different default.

Fix panic in `handle_code_action` 🤞 that CI is happy with this, edited this via github

aturon added the A-io label Jun 25, 2014

thestinger closed this as completed Sep 16, 2014

brson reopened this Sep 16, 2014

brson added I-slow Issue: Problems and improvements with respect to performance of generated code. A-allocators Area: Custom and system allocators labels Sep 16, 2014

thestinger mentioned this issue Sep 16, 2014

expose expect intrinsics + mark push slow path #17317

Closed

thestinger self-assigned this Sep 27, 2014

thestinger added the E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. label Sep 27, 2014

thestinger added the B-notabug label Oct 21, 2014

thestinger closed this as completed Oct 22, 2014

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 17, 2023

Auto merge of rust-lang#15177 - rust-lang:Veykril-patch-1, r=Veykril

f96442a

Fix panic in `handle_code_action` 🤞 that CI is happy with this, edited this via github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_to_end is very slow (>30x slower than 0.10) #15177

read_to_end is very slow (>30x slower than 0.10) #15177

arjantop commented Jun 25, 2014

alexcrichton commented Jun 25, 2014

arjantop commented Jun 25, 2014

alexcrichton commented Jun 25, 2014

arjantop commented Jun 25, 2014

thestinger commented Sep 16, 2014

brson commented Sep 16, 2014

thestinger commented Sep 16, 2014

thestinger commented Sep 16, 2014

thestinger commented Sep 17, 2014

thestinger commented Oct 22, 2014

read_to_end is very slow (>30x slower than 0.10) #15177

read_to_end is very slow (>30x slower than 0.10) #15177

Comments

arjantop commented Jun 25, 2014

alexcrichton commented Jun 25, 2014

arjantop commented Jun 25, 2014

alexcrichton commented Jun 25, 2014

arjantop commented Jun 25, 2014

thestinger commented Sep 16, 2014

brson commented Sep 16, 2014

thestinger commented Sep 16, 2014

thestinger commented Sep 16, 2014

thestinger commented Sep 17, 2014

thestinger commented Oct 22, 2014