-
Notifications
You must be signed in to change notification settings - Fork 560
Specialize more consume_iters
#631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These consumers all exist as std::iter adaptors too (except for `update`, but that one is very simple), and so we can specialize `consume_iter` to call them, to potentially benefit from any internal iteration optimizations they have. Part of rayon-rs#465. The following example gets 8x faster (1.6s before, 0.2s after), on a 6 core (12 threads) machine: ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX).into_par_iter().map(test::black_box).count(); println!("{}", count); } ``` For the built-in benchmarks, there doesn't seem to be any true slow-downs (some benchmarks are so variable that it's very hard to accurately determine the change), and a lot of speed-ups. This affects the `find` and `collect` benchmarks the most: ``` name find-old.txt ns/iter find-new.txt ns/iter diff ns/iter diff % speedup find::size1::parallel_find_first 19,329 20,003 674 3.49% x 0.97 find::size1::parallel_find_common 24,328 23,183 -1,145 -4.71% x 1.05 find::size1::parallel_find_missing 2,131,717 1,687,769 -443,948 -20.83% x 1.26 find::size1::parallel_find_last 1,606,984 1,201,934 -405,050 -25.21% x 1.34 find::size1::parallel_find_middle 1,242,411 819,751 -422,660 -34.02% x 1.52 ``` ``` name collect-old.txt ns/iter collect-new.txt ns/iter diff ns/iter diff % speedup map_collect::i_to_i::with_mutex_vec 30,813,044 31,650,037 836,993 2.72% x 0.97 map_collect::i_to_i::with_linked_list_collect_vec_sized 12,769,089 13,077,052 307,963 2.41% x 0.98 map_collect::i_mod_10_to_i::with_mutex_vec 8,222,478 8,316,029 93,551 1.14% x 0.99 map_collect::i_to_i::with_fold 49,522,357 49,662,761 140,404 0.28% x 1.00 map_collect::i_to_i::with_linked_list_collect 31,901,399 31,558,372 -343,027 -1.08% x 1.01 map_collect::i_mod_10_to_i::with_linked_list_collect 21,974,395 21,622,942 -351,453 -1.60% x 1.02 vec_collect::vec_i::with_fold 35,716,513 34,863,214 -853,299 -2.39% x 1.02 map_collect::i_to_i::with_fold_vec 51,523,306 50,049,271 -1,474,035 -2.86% x 1.03 map_collect::i_to_i::with_linked_list_collect_vec 23,086,079 22,443,957 -642,122 -2.78% x 1.03 map_collect::i_mod_10_to_i::with_mutex 80,962,611 77,863,092 -3,099,519 -3.83% x 1.04 map_collect::i_to_i::with_mutex 204,617,301 196,787,223 -7,830,078 -3.83% x 1.04 vec_collect::vec_i::with_collect 2,535,582 2,437,613 -97,969 -3.86% x 1.04 vec_collect::vec_i::with_collect_into_vec 2,665,272 2,531,158 -134,114 -5.03% x 1.05 map_collect::i_to_i::with_vec_vec_sized 14,525,832 13,627,798 -898,034 -6.18% x 1.07 vec_collect::vec_i::with_collect_into_vec_reused 1,227,069 1,109,099 -117,970 -9.61% x 1.11 vec_collect::vec_i_filtered::with_fold 41,675,555 36,487,521 -5,188,034 -12.45% x 1.14 map_collect::i_mod_10_to_i::with_collect 6,068,716 5,130,790 -937,926 -15.46% x 1.18 map_collect::i_mod_10_to_i::with_vec_vec_sized 6,249,221 5,276,541 -972,680 -15.56% x 1.18 map_collect::i_mod_10_to_i::with_linked_list_collect_vec 5,996,844 5,026,503 -970,341 -16.18% x 1.19 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized 5,897,802 4,976,394 -921,408 -15.62% x 1.19 map_collect::i_to_i::with_collect 14,137,451 11,812,855 -2,324,596 -16.44% x 1.20 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized 6,187,408 4,996,645 -1,190,763 -19.24% x 1.24 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized 14,489,709 11,024,575 -3,465,134 -23.91% x 1.31 map_collect::i_mod_10_to_i::with_fold 1,690,818 1,265,901 -424,917 -25.13% x 1.34 vec_collect::vec_i::with_linked_list_collect_vec 23,295,026 16,697,819 -6,597,207 -28.32% x 1.40 map_collect::i_mod_10_to_i::with_fold_vec 2,652,629 1,707,870 -944,759 -35.62% x 1.55 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized 14,930,174 9,183,037 -5,747,137 -38.49% x 1.63 vec_collect::vec_i::with_vec_vec_sized 15,782,752 9,705,671 -6,077,081 -38.50% x 1.63 vec_collect::vec_i_filtered::with_linked_list_collect_vec 30,026,717 18,254,188 -11,772,529 -39.21% x 1.64 vec_collect::vec_i::with_linked_list_collect_vec_sized 15,408,185 8,607,308 -6,800,877 -44.14% x 1.79 vec_collect::vec_i_filtered::with_vec_vec_sized 22,714,855 9,594,991 -13,119,864 -57.76% x 2.37 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized 23,022,071 9,483,549 -13,538,522 -58.81% x 2.43 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized 23,143,333 9,342,023 -13,801,310 -59.63% x 2.48 vec_collect::vec_i_filtered::with_collect 24,671,619 6,026,401 -18,645,218 -75.57% x 4.09 ```
Part of rayon-rs#465. This makes the following program 3x faster (1.2s before, 0.4s after) on a 6 core (12 thread) machine. ```rust extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .map(Some) .while_some() .map(test::black_box) .count(); println!("{}", count); } ```
Part of rayon-rs#465. This makes the following program 10x faster (5.7s before, 0.55s after) on a 6 core (12 thread) machine. ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .intersperse(1) .map(test::black_box) .count(); println!("{}", count); } ```
|
Hi @cuviper and/or @nikomatsakis, I'd love to see this progress, so a review would be appreciated. Thanks! |
Woah! I expected only modest gains available here, not entire multiples. Thanks for doing this! I think we need to be a little more careful about checking For a tangible (though exaggerated) example of this, I tried adding this in tests/octillion.rs: #[test]
fn find_any_octillion() {
let x = octillion().find_any(|&x| x > OCTILLION / 2);
assert!(x.is_some());
}On the master branch, this takes only a couple milliseconds, even in debug mode, as thread creation itself is the only real work. The higher splits will instantly find a match, short-circuiting the lower splits. (Assuming at least two threads -- if we really add this test, we should guarantee that like some of the other tests in that file do.) On your branch, the lower splits continue through their entire iterator, without looking up to see that the shared |
|
That's a good point! Fixing it for I don't think that concern applies to |
That's annoying. You may be able to get by with
Ah, yes, you're right. It does need to load its own global flag as you're doing, since the base consumer won't check that, but the base can otherwise take care of itself. |
Yes definitely! Should I also open an issue along the lines of " |
Sure, although I'm not sure what that change would look like, so if you have ideas please sketch them out. You'd still have the same problem if |
|
I had the same thought about using an associated type, and I think not being able to reference the folder's state is an ok limitation: most things either are never full or need to communicate with other threads to check full-ness anyway. I opened #632. |
These adaptors consume may many elements before deferring to a base folder's fullness checks, and so they need to be performed manually. For the `filter`s, there's no way to do it manually (rayon-rs#632), so the specialisations just have to be removed. For `fold` and `find_any` this can be done with a `take_while`. This extends the octillion tests to confirm this behaviour. This makes a program like the following slightly slower compared to the direct `consume_iter` without a check, but it's still faster than the non-specialized form. ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .map(test::black_box) .find_any(|_| test::black_box(false)); println!("{:?}", count); } ``` ``` $ hyperfine ./find-original ./find-no-check ./find-check Benchmark rayon-rs#1: ./find-original Time (mean ± σ): 627.6 ms ± 25.7 ms [User: 7.130 s, System: 0.014 s] Range (min … max): 588.4 ms … 656.4 ms 10 runs Benchmark rayon-rs#2: ./find-no-check Time (mean ± σ): 481.5 ms ± 10.8 ms [User: 5.415 s, System: 0.013 s] Range (min … max): 468.9 ms … 498.2 ms 10 runs Benchmark rayon-rs#3: ./find-check Time (mean ± σ): 562.3 ms ± 11.8 ms [User: 6.363 s, System: 0.013 s] Range (min … max): 542.5 ms … 578.2 ms 10 runs ``` (find-original = without specialization, find-no-check = custom `consume_iter` without `take_while`, find-check = this commit)
|
I added a commit that:
Unsurprisingly, this makes a program like the following slightly slower compared to the direct #![feature(test)]
extern crate test;
extern crate rayon;
use rayon::prelude::*;
fn main() {
let count = (0..std::u32::MAX)
.into_par_iter()
.map(test::black_box)
.find_any(|_| test::black_box(false));
println!("{:?}", count);
}(find-original = without specialization; find-no-check = custom |
cuviper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Can you just run your changes through rustfmt though? I can do it too, but that will steal most of your git-blame lines... 😉
|
Done (formatting isn't checked on CI?). Let me know if you'd like the commits squashed, for a clean history. |
No, we need to figure that out... #612
This is clean enough for me, and there are important details in your "check full" commit. bors r+ |
631: Specialize more `consume_iter`s r=cuviper a=huonw This specializes the remaining "obvious" `consume_iter`s for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to a `std` API, the second two require a bit more work); summary (on a 6 core/12 thread machine): - on the included benchmarks: - `find` is up to 1.5x faster (`find::size1::parallel_find_middle`) - `collect` is up to 4x faster (`vec_collect::vec_i_filtered::with_collect`) - using basic program that does something like `(0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()` - with no operation (i.e. just testing `map`/`count`): 8x faster - with `.map(Some).while_some()`: 3x faster - with `.intersperse(1)`: 10x faster The remaining types without a specialization are: ``` $ rg --files-with-matches 'fn consume' src/iter | xargs rg --files-without-match 'fn consume_iter' src/iter/filter_map.rs src/iter/unzip.rs src/iter/try_fold.rs src/iter/try_reduce_with.rs src/iter/try_reduce.rs src/iter/filter.rs src/iter/flat_map.rs src/iter/find_first_last/mod.rs src/iter/collect/consumer.rs ``` (I've started a WIP branch for the `try_*` types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.) Co-authored-by: Huon Wilson <[email protected]>
This specializes the remaining "obvious"
consume_iters for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to astdAPI, the second two require a bit more work); summary (on a 6 core/12 thread machine):findis up to 1.5x faster (find::size1::parallel_find_middle)collectis up to 4x faster (vec_collect::vec_i_filtered::with_collect)(0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()map/count): 8x faster.map(Some).while_some(): 3x faster.intersperse(1): 10x fasterThe remaining types without a specialization are:
(I've started a WIP branch for the
try_*types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.)