Specialize more `consume_iter`s #631

huonw · 2019-02-03T13:42:58Z

This specializes the remaining "obvious" consume_iters for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to a std API, the second two require a bit more work); summary (on a 6 core/12 thread machine):

on the included benchmarks:
- find is up to 1.5x faster (find::size1::parallel_find_middle)
- collect is up to 4x faster (vec_collect::vec_i_filtered::with_collect)
using basic program that does something like (0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()
- with no operation (i.e. just testing map/count): 8x faster
- with .map(Some).while_some(): 3x faster
- with .intersperse(1): 10x faster

The remaining types without a specialization are:

$ rg --files-with-matches 'fn consume' src/iter | xargs rg --files-without-match 'fn consume_iter'
src/iter/filter_map.rs
src/iter/unzip.rs
src/iter/try_fold.rs
src/iter/try_reduce_with.rs
src/iter/try_reduce.rs
src/iter/filter.rs
src/iter/flat_map.rs
src/iter/find_first_last/mod.rs
src/iter/collect/consumer.rs

(I've started a WIP branch for the try_* types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.)

These consumers all exist as std::iter adaptors too (except for `update`, but that one is very simple), and so we can specialize `consume_iter` to call them, to potentially benefit from any internal iteration optimizations they have. Part of rayon-rs#465. The following example gets 8x faster (1.6s before, 0.2s after), on a 6 core (12 threads) machine: ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX).into_par_iter().map(test::black_box).count(); println!("{}", count); } ``` For the built-in benchmarks, there doesn't seem to be any true slow-downs (some benchmarks are so variable that it's very hard to accurately determine the change), and a lot of speed-ups. This affects the `find` and `collect` benchmarks the most: ``` name find-old.txt ns/iter find-new.txt ns/iter diff ns/iter diff % speedup find::size1::parallel_find_first 19,329 20,003 674 3.49% x 0.97 find::size1::parallel_find_common 24,328 23,183 -1,145 -4.71% x 1.05 find::size1::parallel_find_missing 2,131,717 1,687,769 -443,948 -20.83% x 1.26 find::size1::parallel_find_last 1,606,984 1,201,934 -405,050 -25.21% x 1.34 find::size1::parallel_find_middle 1,242,411 819,751 -422,660 -34.02% x 1.52 ``` ``` name collect-old.txt ns/iter collect-new.txt ns/iter diff ns/iter diff % speedup map_collect::i_to_i::with_mutex_vec 30,813,044 31,650,037 836,993 2.72% x 0.97 map_collect::i_to_i::with_linked_list_collect_vec_sized 12,769,089 13,077,052 307,963 2.41% x 0.98 map_collect::i_mod_10_to_i::with_mutex_vec 8,222,478 8,316,029 93,551 1.14% x 0.99 map_collect::i_to_i::with_fold 49,522,357 49,662,761 140,404 0.28% x 1.00 map_collect::i_to_i::with_linked_list_collect 31,901,399 31,558,372 -343,027 -1.08% x 1.01 map_collect::i_mod_10_to_i::with_linked_list_collect 21,974,395 21,622,942 -351,453 -1.60% x 1.02 vec_collect::vec_i::with_fold 35,716,513 34,863,214 -853,299 -2.39% x 1.02 map_collect::i_to_i::with_fold_vec 51,523,306 50,049,271 -1,474,035 -2.86% x 1.03 map_collect::i_to_i::with_linked_list_collect_vec 23,086,079 22,443,957 -642,122 -2.78% x 1.03 map_collect::i_mod_10_to_i::with_mutex 80,962,611 77,863,092 -3,099,519 -3.83% x 1.04 map_collect::i_to_i::with_mutex 204,617,301 196,787,223 -7,830,078 -3.83% x 1.04 vec_collect::vec_i::with_collect 2,535,582 2,437,613 -97,969 -3.86% x 1.04 vec_collect::vec_i::with_collect_into_vec 2,665,272 2,531,158 -134,114 -5.03% x 1.05 map_collect::i_to_i::with_vec_vec_sized 14,525,832 13,627,798 -898,034 -6.18% x 1.07 vec_collect::vec_i::with_collect_into_vec_reused 1,227,069 1,109,099 -117,970 -9.61% x 1.11 vec_collect::vec_i_filtered::with_fold 41,675,555 36,487,521 -5,188,034 -12.45% x 1.14 map_collect::i_mod_10_to_i::with_collect 6,068,716 5,130,790 -937,926 -15.46% x 1.18 map_collect::i_mod_10_to_i::with_vec_vec_sized 6,249,221 5,276,541 -972,680 -15.56% x 1.18 map_collect::i_mod_10_to_i::with_linked_list_collect_vec 5,996,844 5,026,503 -970,341 -16.18% x 1.19 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized 5,897,802 4,976,394 -921,408 -15.62% x 1.19 map_collect::i_to_i::with_collect 14,137,451 11,812,855 -2,324,596 -16.44% x 1.20 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized 6,187,408 4,996,645 -1,190,763 -19.24% x 1.24 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized 14,489,709 11,024,575 -3,465,134 -23.91% x 1.31 map_collect::i_mod_10_to_i::with_fold 1,690,818 1,265,901 -424,917 -25.13% x 1.34 vec_collect::vec_i::with_linked_list_collect_vec 23,295,026 16,697,819 -6,597,207 -28.32% x 1.40 map_collect::i_mod_10_to_i::with_fold_vec 2,652,629 1,707,870 -944,759 -35.62% x 1.55 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized 14,930,174 9,183,037 -5,747,137 -38.49% x 1.63 vec_collect::vec_i::with_vec_vec_sized 15,782,752 9,705,671 -6,077,081 -38.50% x 1.63 vec_collect::vec_i_filtered::with_linked_list_collect_vec 30,026,717 18,254,188 -11,772,529 -39.21% x 1.64 vec_collect::vec_i::with_linked_list_collect_vec_sized 15,408,185 8,607,308 -6,800,877 -44.14% x 1.79 vec_collect::vec_i_filtered::with_vec_vec_sized 22,714,855 9,594,991 -13,119,864 -57.76% x 2.37 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized 23,022,071 9,483,549 -13,538,522 -58.81% x 2.43 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized 23,143,333 9,342,023 -13,801,310 -59.63% x 2.48 vec_collect::vec_i_filtered::with_collect 24,671,619 6,026,401 -18,645,218 -75.57% x 4.09 ```

Part of rayon-rs#465. This makes the following program 3x faster (1.2s before, 0.4s after) on a 6 core (12 thread) machine. ```rust extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .map(Some) .while_some() .map(test::black_box) .count(); println!("{}", count); } ```

Part of rayon-rs#465. This makes the following program 10x faster (5.7s before, 0.55s after) on a 6 core (12 thread) machine. ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .intersperse(1) .map(test::black_box) .count(); println!("{}", count); } ```

huonw · 2019-02-07T02:54:50Z

Hi @cuviper and/or @nikomatsakis, I'd love to see this progress, so a review would be appreciated. Thanks!

cuviper · 2019-02-07T10:58:53Z

1.5x, 4x, 8x, 3x, 10x

Woah! I expected only modest gains available here, not entire multiples. Thanks for doing this!

I think we need to be a little more careful about checking full flags though, for those that may process an arbitrary number of items before anything actually reaches the base consumer. This applies to filter, filter_map, and fold, as well as find_any needing to check its global state. In while_some you do have a load of its global state, which is good, but it also needs to consult whether its base is full.

For a tangible (though exaggerated) example of this, I tried adding this in tests/octillion.rs:

#[test]
fn find_any_octillion() {
    let x = octillion().find_any(|&x| x > OCTILLION / 2);
    assert!(x.is_some());
}

On the master branch, this takes only a couple milliseconds, even in debug mode, as thread creation itself is the only real work. The higher splits will instantly find a match, short-circuiting the lower splits. (Assuming at least two threads -- if we really add this test, we should guarantee that like some of the other tests in that file do.)

On your branch, the lower splits continue through their entire iterator, without looking up to see that the shared found state was set. I think this can be solved using a take_while, and similarly in filter etc. We'll give up some performance, sadly, but that seems necessary for short-circuiting to work correctly. Hopefully codegen can at least inline and const-prop the cases where full is always false.

huonw · 2019-02-07T11:25:15Z

That's a good point! Fixing it for find and fold should be easy enough, but I can't quite see how to handle it for filter/filter_map: self.base is moved into the consume_iter call, and thus there's no way to call its full in the iterator adaptor chain. I may have to revert them.

I don't think that concern applies to while_some because it isn't consuming an arbitrary number of elements for each output, and the base's consume_iter handles the full checking there.

cuviper · 2019-02-07T11:37:52Z

but I can't quite see how to handle it for filter/filter_map: self.base is moved into the consume_iter call, and thus there's no way to call its full in the iterator adaptor chain.

That's annoying. You may be able to get by with take_while().filter().next(), pulling one item at a time to pass to the consumer. I guess the performance compared to plain consume will depend on how many items the filter is actually filtering out. If you do revert, please put a comment in there so we remember these considerations.

I don't think that concern applies to while_some because it isn't consuming an arbitrary number of elements for each output,

Ah, yes, you're right. It does need to load its own global flag as you're doing, since the base consumer won't check that, but the base can otherwise take care of itself.

huonw · 2019-02-07T22:02:16Z

If you do revert, please put a comment in there so we remember these considerations.

Yes definitely!

Should I also open an issue along the lines of "Folder API doesn't allow for internal iteration optimisation of filter-like adapters", to remember it for a potential (rayon 2?) API change/iteration/improvement?

cuviper · 2019-02-08T06:31:29Z

Should I also open an issue along the lines of "Folder API doesn't allow for internal iteration optimisation of filter-like adapters", to remember it for a potential (rayon 2?) API change/iteration/improvement?

Sure, although I'm not sure what that change would look like, so if you have ideas please sketch them out. You'd still have the same problem if consume_iter took &mut self, for instance. Or the folder could have a method to return some associated type Full object, but this couldn't reference the folder's own state.

huonw · 2019-02-11T01:09:18Z

I had the same thought about using an associated type, and I think not being able to reference the folder's state is an ok limitation: most things either are never full or need to communicate with other threads to check full-ness anyway. I opened #632.

These adaptors consume may many elements before deferring to a base folder's fullness checks, and so they need to be performed manually. For the `filter`s, there's no way to do it manually (rayon-rs#632), so the specialisations just have to be removed. For `fold` and `find_any` this can be done with a `take_while`. This extends the octillion tests to confirm this behaviour. This makes a program like the following slightly slower compared to the direct `consume_iter` without a check, but it's still faster than the non-specialized form. ``` extern crate test; extern crate rayon; use rayon::prelude::*; fn main() { let count = (0..std::u32::MAX) .into_par_iter() .map(test::black_box) .find_any(|_| test::black_box(false)); println!("{:?}", count); } ``` ``` $ hyperfine ./find-original ./find-no-check ./find-check Benchmark rayon-rs#1: ./find-original Time (mean ± σ): 627.6 ms ± 25.7 ms [User: 7.130 s, System: 0.014 s] Range (min … max): 588.4 ms … 656.4 ms 10 runs Benchmark rayon-rs#2: ./find-no-check Time (mean ± σ): 481.5 ms ± 10.8 ms [User: 5.415 s, System: 0.013 s] Range (min … max): 468.9 ms … 498.2 ms 10 runs Benchmark rayon-rs#3: ./find-check Time (mean ± σ): 562.3 ms ± 11.8 ms [User: 6.363 s, System: 0.013 s] Range (min … max): 542.5 ms … 578.2 ms 10 runs ``` (find-original = without specialization, find-no-check = custom `consume_iter` without `take_while`, find-check = this commit)

huonw · 2019-02-14T11:00:02Z

I added a commit that:

reverts the change for filter and filter_map
adds take_while to fold and find_any
extends the octillion tests to cover the above

Unsurprisingly, this makes a program like the following slightly slower compared to the direct consume_iter without a check, but it's still faster than the non-specialized form.

#![feature(test)]

extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX)
        .into_par_iter()
        .map(test::black_box)
        .find_any(|_| test::black_box(false));
    println!("{:?}", count);
}

$ hyperfine ./find-original ./find-no-check ./find-check
Benchmark #1: ./find-original
  Time (mean ± σ):     627.6 ms ±  25.7 ms    [User: 7.130 s, System: 0.014 s]
  Range (min … max):   588.4 ms … 656.4 ms    10 runs

Benchmark #2: ./find-no-check
  Time (mean ± σ):     481.5 ms ±  10.8 ms    [User: 5.415 s, System: 0.013 s]
  Range (min … max):   468.9 ms … 498.2 ms    10 runs

Benchmark #3: ./find-check
  Time (mean ± σ):     562.3 ms ±  11.8 ms    [User: 6.363 s, System: 0.013 s]
  Range (min … max):   542.5 ms … 578.2 ms    10 runs

(find-original = without specialization; find-no-check = custom consume_iter without take_while; find-check = this commit)

cuviper

Looks good!

Can you just run your changes through rustfmt though? I can do it too, but that will steal most of your git-blame lines... 😉

huonw · 2019-02-18T03:29:30Z

Done (formatting isn't checked on CI?).

Let me know if you'd like the commits squashed, for a clean history.

cuviper · 2019-02-18T16:58:55Z

Done (formatting isn't checked on CI?).

No, we need to figure that out... #612

Let me know if you'd like the commits squashed, for a clean history.

This is clean enough for me, and there are important details in your "check full" commit.

bors r+

631: Specialize more `consume_iter`s r=cuviper a=huonw This specializes the remaining "obvious" `consume_iter`s for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to a `std` API, the second two require a bit more work); summary (on a 6 core/12 thread machine): - on the included benchmarks: - `find` is up to 1.5x faster (`find::size1::parallel_find_middle`) - `collect` is up to 4x faster (`vec_collect::vec_i_filtered::with_collect`) - using basic program that does something like `(0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()` - with no operation (i.e. just testing `map`/`count`): 8x faster - with `.map(Some).while_some()`: 3x faster - with `.intersperse(1)`: 10x faster The remaining types without a specialization are: ``` $ rg --files-with-matches 'fn consume' src/iter | xargs rg --files-without-match 'fn consume_iter' src/iter/filter_map.rs src/iter/unzip.rs src/iter/try_fold.rs src/iter/try_reduce_with.rs src/iter/try_reduce.rs src/iter/filter.rs src/iter/flat_map.rs src/iter/find_first_last/mod.rs src/iter/collect/consumer.rs ``` (I've started a WIP branch for the `try_*` types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.) Co-authored-by: Huon Wilson <[email protected]>

bors · 2019-02-18T18:40:04Z

Build succeeded

huonw added 3 commits February 4, 2019 00:42

huonw force-pushed the iter-opts branch from d3486d1 to cabe301 Compare February 3, 2019 21:46

cuviper self-assigned this Feb 7, 2019

huonw mentioned this pull request Feb 11, 2019

Folder API doesn't allow for internal iteration optimisation of filter-like adapters #632

Open

cuviper approved these changes Feb 16, 2019

View reviewed changes

Run rustfmt on new consume_iters.

0dbc892

bors bot merged commit 0dbc892 into rayon-rs:master Feb 18, 2019

huonw deleted the iter-opts branch March 7, 2019 12:16

cuviper mentioned this pull request Jul 5, 2019

hitting type_length_limit with relatively few operations #671

Open

Specialize more consume_iters #631

Specialize more consume_iters #631

Uh oh!

Conversation

huonw commented Feb 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huonw commented Feb 7, 2019

Uh oh!

cuviper commented Feb 7, 2019

Uh oh!

huonw commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Feb 7, 2019

Uh oh!

huonw commented Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuviper commented Feb 8, 2019

Uh oh!

huonw commented Feb 11, 2019

Uh oh!

huonw commented Feb 14, 2019

Uh oh!

cuviper left a comment

Choose a reason for hiding this comment

Uh oh!

huonw commented Feb 18, 2019

Uh oh!

cuviper commented Feb 18, 2019

Uh oh!

bors bot commented Feb 18, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Specialize more `consume_iter`s #631

Specialize more `consume_iter`s #631

huonw commented Feb 3, 2019 •

edited

Loading

huonw commented Feb 7, 2019 •

edited

Loading

huonw commented Feb 7, 2019 •

edited

Loading