-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Fix read_to_end to not grow an exact size buffer #89165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @Mark-Simulacrum (or someone else) soon. Please see the contribution instructions for more information. |
This comment has been minimized.
This comment has been minimized.
30cd7e6
to
6ff4c58
Compare
If you know how much data to expect and use `Vec::with_capacity` to pre-allocate a buffer of that capacity, `Read::read_to_end` will still double its capacity. It needs some space to perform a read, even though that read ends up returning `0`. It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB. This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns `0` then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues. Fixing this allows several workarounds in the standard library to be removed: - `Take` no longer needs to override `Read::read_to_end`. - The `reservation_size` callback that allowed `Take` to inhibit the previous over-allocation behavior isn't needed. - `fs::read` doesn't need to reserve an extra byte in `initial_buffer_size`. Curiously, there was a unit test that specifically checked that `Read::read_to_end` *does* over-allocate. I removed that test, too.
6ff4c58
to
9b9c24e
Compare
@bors try @rust-timer queue I don't expect this to turn up any surprises, but that's why they're called "surprises". |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 9b9c24e with merge 4b920a40932f74a7159435b06d96cb50212514ff... |
This will cause the free functions An alternative would be to update the documentation to recommend passing some extra capacity. |
☀️ Try build successful - checks-actions |
Queued 4b920a40932f74a7159435b06d96cb50212514ff with parent ce45663, future comparison URL. |
It shouldn't cause them an extra syscall for them since they pass an exact fit buffer. With or without this change, there will be n syscalls that return data plus a final one that returns When the buffer is not an exact fit then there might be one extra |
Finished benchmarking commit (4b920a40932f74a7159435b06d96cb50212514ff): comparison url. Summary: This change led to moderate relevant mixed results 🤷 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
I would normally think this was noise, but it looks like a lot of small regressions about the same size. Is that a coincidence, or is this causing an issue? Might be worth running the two toolchains with strace and seeing if the syscalls look noticeably different. |
Will do! What's a good way to do that? I know strace well but this is my first contribution to this project and I'm not very familiar with testing it. Are you suggesting stracing on my computer or is there a CI tool I should be using? To test this PR I've been running Also, is there some way I can re-test the regressions in the report? I don't really understand what the benchmark did or how to read the results. |
@jkugelman I'd suggest using https://rustc-dev-guide.rust-lang.org/compiler-debugging.html#downloading-artifacts-from-rusts-ci to install the exact before-and-after builds from the try above, the same ones used in the comparison. Then use both builds to do both a large and a small rustc compile, under |
@joshtriplett I wrote a script to count number of old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (0) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (0) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (1) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (1) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (2) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (2) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (3) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (3) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (4) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (4) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (5) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4395, bytes_read=30435031 (5) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (6) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (6) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4404, bytes_read=30448537 (7) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (7) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4406, bytes_read=30448537 (8) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (8) old (ce45663e14dac3f0f58be698cc530bc2e6e21682): reads=4405, bytes_read=30448537 (9) new (4b920a40932f74a7159435b06d96cb50212514ff): reads=4394, bytes_read=30435031 (9)
Is this what you had in mind? If so I can go throw a big hunk of Rust code at it and see what happens. |
I'm genuinely surprised that it's non-deterministic. I'm primarily interested in seeing the actual differences in the straces. Can you upload an strace from each toolchain for comparison? (Also, you may want to invoke the toolchain directly rather than via the rustup wrapper, if possible, to minimize noise.) |
old-ce45663e14dac3f0f58be698cc530bc2e6e21682.full.strace.gz |
I found it really difficult to visually diff them. I spent about half an hour trying to clean them up and hide all the spurious differences so I could find the "real" diffs. The (apparent) non-determinism was a headache so I went with counting overall reads instead. I don't know if you know an easy trick to compare the files. If not, I've got a good supply of elbow grease. I can go do more thorough analysis. I'm sure you don't want to spend hours on my PR. (And I really appreciate your help so far!) |
Hm, so locally I seem to get entirely reproducible results: $ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs
Performance counter stats for '/home/mark/.rustup/toolchains/ce45663e14dac3f0f58be698cc530bc2e6e21682/bin/rustc t.rs' (30 runs):
4370 syscalls:sys_enter_read
0.109497 +- 0.000566 seconds time elapsed ( +- 0.52% )
$ sudo perf stat -r 30 -e syscalls:sys_enter_read ~/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs
Performance counter stats for '/home/mark/.rustup/toolchains/4b920a40932f74a7159435b06d96cb50212514ff/bin/rustc t.rs' (30 runs):
4360 syscalls:sys_enter_read
0.108526 +- 0.000544 seconds time elapsed ( +- 0.50% ) This seems like it's both deterministic (I ran the above command several times; same result) and gives us the (expected) slight decrease in syscalls. A good number of the regressions seem to be in -doc builds, and in some of those it looks like query counts went down (presumably since this PR deletes a function from std?). That creates some amount of noise in e.g. cachegrind diffs, as we're just doing different/less work. My inclination is to not consider these as blockers for this PR. However, I would like to get some sense why you're seeing nondeterminism with the syscalls -- what system are you running on? Can you try with the above perf commands and maybe see if those are OK? |
The I fixed my
Script attached for transparency: count-reads.gz |
Ah, great, OK. Sounds like we're on the same page then. I'm still not sure exactly where the regressions in the perf run are coming from. Cycle count regressions are limited to -doc benchmarks entirely, which is a little weird, but may point to the regressions actually being caused by the change to the standard library (e.g., driven by different hashmap layout or something). I am inclined to move ahead -- @joshtriplett, do you have any remaining concerns with this PR? I think the regressions are fairly minimal and not readily diagnosable (cachegrind doesn't point to anything obviously responsible, for example). |
I reviewed both straces carefully, filtering out spurious differences, and I don't see any issues either. This seems good to go ahead with. (I do think there are some interesting things to learn from the straces, for the purposes of future performance improvements. Thanks for collecting those!) @bors r+ |
📌 Commit 9b9c24e has been approved by |
It looks like almost all of the changes in read calls are in the linker process, and caused by changes in ELF binaries. The only meaningful behavior change is this: Old:
New:
So the first call no longer reads an extra byte, and then the second call tries to read 32 bytes rather than 1. This matches the description of the "probe buffer" approach. I think there'd be some value in trying to eliminate even the "probe buffer" call, but in the meantime, this seems like an improvement. |
☀️ Test successful - checks-actions |
Finished benchmarking commit (d25de31): comparison url. Summary: This change led to small relevant regressions 😿 in compiler performance.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression |
I created a follow-up issue: #89516. |
If you know how much data to expect and use
Vec::with_capacity
to pre-allocate a buffer of that capacity,Read::read_to_end
will still double its capacity. It needs some space to perform a read, even though that read ends up returning0
.It's a bummer to carefully pre-allocate 1GB to read a 1GB file into memory and end up using 2GB.
This fixes that behavior by special casing a full buffer and reading into a small "probe" buffer instead. If that read returns
0
then it's confirmed that the buffer was the perfect size. If it doesn't, the probe buffer is appended to the normal buffer and the read loop continues.Fixing this allows several workarounds in the standard library to be removed:
Take
no longer needs to overrideRead::read_to_end
.reservation_size
callback that allowedTake
to inhibit the previous over-allocation behavior isn't needed.fs::read
doesn't need to reserve an extra byte ininitial_buffer_size
.Curiously, there was a unit test that specifically checked that
Read::read_to_end
does over-allocate. I removed that test, too.