Skip to content

read_to_end is very slow (>30x slower than 0.10) #15177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arjantop opened this issue Jun 25, 2014 · 10 comments
Closed

read_to_end is very slow (>30x slower than 0.10) #15177

arjantop opened this issue Jun 25, 2014 · 10 comments
Assignees
Labels
A-allocators Area: Custom and system allocators E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@arjantop
Copy link
Contributor

master:

    #[bench]
    fn read_to_end(b: &mut Bencher) {
        let bytes = Vec::from_elem(100, 10u8);
        b.iter(|| {
            let mut reader = BufReader::new(bytes.as_slice());
            black_box(reader.read_to_end())
        })
    }
running 1 test
test bench::read_to_end ... bench:      2534 ns/iter (+/- 88)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

0.10:

    #[bench]
    fn read_to_end(b: &mut BenchHarness) {
        let bytes = Vec::from_elem(100, 10u8);
        b.iter(|| {
            let mut reader = BufReader::new(bytes.as_slice());
            black_box(reader.read_to_end())
        })
    }
running 1 test
test bench::read_to_end ... bench:        80 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured
@alexcrichton
Copy link
Member

Can you be sure that the 0.10 version isn't being optimized away by returning the result of reader.read_to_end()?

@arjantop
Copy link
Contributor Author

Yes, it was a part of a larger example when I noticed this. Adding black_box around does not change the results.
I have updated the examples to include black_box.

@alexcrichton
Copy link
Member

I'm only able to run perf over the most current version, but it appears that essentially the entire time is spent in jemalloc, so this is perhaps just a benchmark that jemalloc does not perform well on?

@arjantop
Copy link
Contributor Author

I have code doing read_byte one by one, doing comparisons, and putting it in Vec that is 5x faster than plain read_to_end

@aturon aturon added the A-io label Jun 25, 2014
@thestinger
Copy link
Contributor

Exporting JE_MALLOC_CONF=lg_dirty_mult:-1 results in a drastic performance increase by making it not care about reducing RES usage. In practice, glibc was never reducing the peak RES because it uses sbrk for everything that's not an enormous memory allocation. It's a memory vs. performance trade-off and I don't think we can do much about it other than bikeshedding the default lg_dirty_mult setting.

@brson
Copy link
Contributor

brson commented Sep 16, 2014

@thestinger I'm not satisfied with that answer. A 30x performance regression in a common utility function as not something to sweep under the rug.

@brson brson reopened this Sep 16, 2014
@brson brson added I-slow Issue: Problems and improvements with respect to performance of generated code. A-allocators Area: Custom and system allocators labels Sep 16, 2014
@thestinger
Copy link
Contributor

@brson: It's not a 30x performance regression in a common utility function. It's a micro-benchmark of the utility function where dirty page purging performs poorly.

 % ./foo --bench

running 1 test
test read_to_end ... bench:      4463 ns/iter (+/- 94)
% JE_MALLOC_CONF=lg_dirty_mult:-1 ./foo --bench       

running 1 test
test read_to_end ... bench:       307 ns/iter (+/- 6)
% JE_MALLOC_CONF=lg_dirty_mult:2 ./foo --bench

running 1 test
test read_to_end ... bench:       267 ns/iter (+/- 23)
% JE_MALLOC_CONF=lg_dirty_mult:3 ./foo --bench      

running 1 test
test read_to_end ... bench:      4451 ns/iter (+/- 109)

It's a clear cut memory vs. performance trade-off. Either it consumes more memory or it wastes time purging the dirty pages. Disabling it by default would fix the regression on some microbenchmarks, but memory usage would regress back to where it was before jemalloc was used.

Allocators need to make many performance compromises, and this happens to be one with a user-facing knob. It could be exposed from the heap module with je_mallctl. I don't really see the point of fretting over the performance trade-off when it's configurable at runtime. It's not really an actionable bug unless micro-benchmarks are going to be prioritized over memory usage.

jemalloc used to have a special case specifically to make microbenchmarks perform better with dirty page purging enabled, but the special case was removed because it was quite arbitrary. I think it's better for a microbenchmark to be an honest reflection of the performance on real workloads.

@thestinger
Copy link
Contributor

Perhaps it does make sense to disable purging by default to improve performance in trivial applications and have large applications like the Rust compiler and Servo to explicitly turn on purging. It's not the choice jemalloc makes upstream though.

@thestinger
Copy link
Contributor

A demonstration of how the dirty page purging ratio plays into this:

extern crate test;

use test::{Bencher, black_box};
use std::io::BufReader;

#[bench]
fn read_to_end(b: &mut Bencher) {
    let bytes = Vec::from_elem(100, 10u8);
    b.iter(|| {
        let mut reader = BufReader::new(bytes.as_slice());
        black_box(reader.read_to_end())
    })
}
test read_to_end ... bench:      4458 ns/iter (+/- 918)
extern crate test;

use test::{Bencher, black_box};
use std::io::BufReader;

#[bench]
fn read_to_end(b: &mut Bencher) {
    let _huge= Vec::<u64>::with_capacity(100000);
    let bytes = Vec::from_elem(100, 10u8);
    b.iter(|| {
        let mut reader = BufReader::new(bytes.as_slice());
        black_box(reader.read_to_end())
    })
}
test read_to_end ... bench:       280 ns/iter (+/- 3)

@thestinger thestinger self-assigned this Sep 27, 2014
@thestinger thestinger added the E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. label Sep 27, 2014
@thestinger
Copy link
Contributor

Closing in favour of #18236 which is an actionable issue. There is no performance bug here, but there is a tunable performance vs. memory trade-off and we should consider a different default.

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 17, 2023
Fix panic in `handle_code_action`

🤞 that CI is happy with this, edited this via github
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-allocators Area: Custom and system allocators E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants