Skip to content

Conversation

@huonw
Copy link
Contributor

@huonw huonw commented Feb 3, 2019

This specializes the remaining "obvious" consume_iters for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to a std API, the second two require a bit more work); summary (on a 6 core/12 thread machine):

  • on the included benchmarks:
    • find is up to 1.5x faster (find::size1::parallel_find_middle)
    • collect is up to 4x faster (vec_collect::vec_i_filtered::with_collect)
  • using basic program that does something like (0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()
    • with no operation (i.e. just testing map/count): 8x faster
    • with .map(Some).while_some(): 3x faster
    • with .intersperse(1): 10x faster

The remaining types without a specialization are:

$ rg --files-with-matches 'fn consume' src/iter | xargs rg --files-without-match 'fn consume_iter'
src/iter/filter_map.rs
src/iter/unzip.rs
src/iter/try_fold.rs
src/iter/try_reduce_with.rs
src/iter/try_reduce.rs
src/iter/filter.rs
src/iter/flat_map.rs
src/iter/find_first_last/mod.rs
src/iter/collect/consumer.rs

(I've started a WIP branch for the try_* types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.)

huonw added 3 commits February 4, 2019 00:42
These consumers all exist as std::iter adaptors too (except for
`update`, but that one is very simple), and so we can specialize
`consume_iter` to call them, to potentially benefit from any internal
iteration optimizations they have.

Part of rayon-rs#465.

The following example gets 8x faster (1.6s before, 0.2s after), on
a 6 core (12 threads) machine:

```

extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX).into_par_iter().map(test::black_box).count();
    println!("{}", count);
}
```

For the built-in benchmarks, there doesn't seem to be any true
slow-downs (some benchmarks are so variable that it's very hard to
accurately determine the change), and a lot of speed-ups. This affects
the `find` and `collect` benchmarks the most:

```
 name                                find-old.txt ns/iter  find-new.txt ns/iter  diff ns/iter   diff %  speedup
 find::size1::parallel_find_first    19,329                20,003                         674    3.49%   x 0.97
 find::size1::parallel_find_common   24,328                23,183                      -1,145   -4.71%   x 1.05
 find::size1::parallel_find_missing  2,131,717             1,687,769                 -443,948  -20.83%   x 1.26
 find::size1::parallel_find_last     1,606,984             1,201,934                 -405,050  -25.21%   x 1.34
 find::size1::parallel_find_middle   1,242,411             819,751                   -422,660  -34.02%   x 1.52
```

```
 name                                                                collect-old.txt ns/iter  collect-new.txt ns/iter  diff ns/iter   diff %  speedup
 map_collect::i_to_i::with_mutex_vec                                 30,813,044               31,650,037                    836,993    2.72%   x 0.97
 map_collect::i_to_i::with_linked_list_collect_vec_sized             12,769,089               13,077,052                    307,963    2.41%   x 0.98
 map_collect::i_mod_10_to_i::with_mutex_vec                          8,222,478                8,316,029                      93,551    1.14%   x 0.99
 map_collect::i_to_i::with_fold                                      49,522,357               49,662,761                    140,404    0.28%   x 1.00
 map_collect::i_to_i::with_linked_list_collect                       31,901,399               31,558,372                   -343,027   -1.08%   x 1.01
 map_collect::i_mod_10_to_i::with_linked_list_collect                21,974,395               21,622,942                   -351,453   -1.60%   x 1.02
 vec_collect::vec_i::with_fold                                       35,716,513               34,863,214                   -853,299   -2.39%   x 1.02
 map_collect::i_to_i::with_fold_vec                                  51,523,306               50,049,271                 -1,474,035   -2.86%   x 1.03
 map_collect::i_to_i::with_linked_list_collect_vec                   23,086,079               22,443,957                   -642,122   -2.78%   x 1.03
 map_collect::i_mod_10_to_i::with_mutex                              80,962,611               77,863,092                 -3,099,519   -3.83%   x 1.04
 map_collect::i_to_i::with_mutex                                     204,617,301              196,787,223                -7,830,078   -3.83%   x 1.04
 vec_collect::vec_i::with_collect                                    2,535,582                2,437,613                     -97,969   -3.86%   x 1.04
 vec_collect::vec_i::with_collect_into_vec                           2,665,272                2,531,158                    -134,114   -5.03%   x 1.05
 map_collect::i_to_i::with_vec_vec_sized                             14,525,832               13,627,798                   -898,034   -6.18%   x 1.07
 vec_collect::vec_i::with_collect_into_vec_reused                    1,227,069                1,109,099                    -117,970   -9.61%   x 1.11
 vec_collect::vec_i_filtered::with_fold                              41,675,555               36,487,521                 -5,188,034  -12.45%   x 1.14
 map_collect::i_mod_10_to_i::with_collect                            6,068,716                5,130,790                    -937,926  -15.46%   x 1.18
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      6,249,221                5,276,541                    -972,680  -15.56%   x 1.18
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            5,996,844                5,026,503                    -970,341  -16.18%   x 1.19
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   5,897,802                4,976,394                    -921,408  -15.62%   x 1.19
 map_collect::i_to_i::with_collect                                   14,137,451               11,812,855                 -2,324,596  -16.44%   x 1.20
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      6,187,408                4,996,645                  -1,190,763  -19.24%   x 1.24
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          14,489,709               11,024,575                 -3,465,134  -23.91%   x 1.31
 map_collect::i_mod_10_to_i::with_fold                               1,690,818                1,265,901                    -424,917  -25.13%   x 1.34
 vec_collect::vec_i::with_linked_list_collect_vec                    23,295,026               16,697,819                 -6,597,207  -28.32%   x 1.40
 map_collect::i_mod_10_to_i::with_fold_vec                           2,652,629                1,707,870                    -944,759  -35.62%   x 1.55
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           14,930,174               9,183,037                  -5,747,137  -38.49%   x 1.63
 vec_collect::vec_i::with_vec_vec_sized                              15,782,752               9,705,671                  -6,077,081  -38.50%   x 1.63
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           30,026,717               18,254,188                -11,772,529  -39.21%   x 1.64
 vec_collect::vec_i::with_linked_list_collect_vec_sized              15,408,185               8,607,308                  -6,800,877  -44.14%   x 1.79
 vec_collect::vec_i_filtered::with_vec_vec_sized                     22,714,855               9,594,991                 -13,119,864  -57.76%   x 2.37
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  23,022,071               9,483,549                 -13,538,522  -58.81%   x 2.43
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     23,143,333               9,342,023                 -13,801,310  -59.63%   x 2.48
 vec_collect::vec_i_filtered::with_collect                           24,671,619               6,026,401                 -18,645,218  -75.57%   x 4.09
```
Part of rayon-rs#465.

This makes the following program 3x faster (1.2s before, 0.4s after)
on a 6 core (12 thread) machine.

```rust
extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX)
        .into_par_iter()
        .map(Some)
        .while_some()
        .map(test::black_box)
        .count();
    println!("{}", count);
}
```
Part of rayon-rs#465.

This makes the following program 10x faster (5.7s before, 0.55s after)
on a 6 core (12 thread) machine.

```
extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX)
        .into_par_iter()
        .intersperse(1)
        .map(test::black_box)
        .count();
    println!("{}", count);
}
```
@huonw
Copy link
Contributor Author

huonw commented Feb 7, 2019

Hi @cuviper and/or @nikomatsakis, I'd love to see this progress, so a review would be appreciated. Thanks!

@cuviper cuviper self-assigned this Feb 7, 2019
@cuviper
Copy link
Member

cuviper commented Feb 7, 2019

1.5x, 4x, 8x, 3x, 10x

Woah! I expected only modest gains available here, not entire multiples. Thanks for doing this!

I think we need to be a little more careful about checking full flags though, for those that may process an arbitrary number of items before anything actually reaches the base consumer. This applies to filter, filter_map, and fold, as well as find_any needing to check its global state. In while_some you do have a load of its global state, which is good, but it also needs to consult whether its base is full.

For a tangible (though exaggerated) example of this, I tried adding this in tests/octillion.rs:

#[test]
fn find_any_octillion() {
    let x = octillion().find_any(|&x| x > OCTILLION / 2);
    assert!(x.is_some());
}

On the master branch, this takes only a couple milliseconds, even in debug mode, as thread creation itself is the only real work. The higher splits will instantly find a match, short-circuiting the lower splits. (Assuming at least two threads -- if we really add this test, we should guarantee that like some of the other tests in that file do.)

On your branch, the lower splits continue through their entire iterator, without looking up to see that the shared found state was set. I think this can be solved using a take_while, and similarly in filter etc. We'll give up some performance, sadly, but that seems necessary for short-circuiting to work correctly. Hopefully codegen can at least inline and const-prop the cases where full is always false.

@huonw
Copy link
Contributor Author

huonw commented Feb 7, 2019

That's a good point! Fixing it for find and fold should be easy enough, but I can't quite see how to handle it for filter/filter_map: self.base is moved into the consume_iter call, and thus there's no way to call its full in the iterator adaptor chain. I may have to revert them.

I don't think that concern applies to while_some because it isn't consuming an arbitrary number of elements for each output, and the base's consume_iter handles the full checking there.

@cuviper
Copy link
Member

cuviper commented Feb 7, 2019

but I can't quite see how to handle it for filter/filter_map: self.base is moved into the consume_iter call, and thus there's no way to call its full in the iterator adaptor chain.

That's annoying. You may be able to get by with take_while().filter().next(), pulling one item at a time to pass to the consumer. I guess the performance compared to plain consume will depend on how many items the filter is actually filtering out. If you do revert, please put a comment in there so we remember these considerations.

I don't think that concern applies to while_some because it isn't consuming an arbitrary number of elements for each output,

Ah, yes, you're right. It does need to load its own global flag as you're doing, since the base consumer won't check that, but the base can otherwise take care of itself.

@huonw
Copy link
Contributor Author

huonw commented Feb 7, 2019

If you do revert, please put a comment in there so we remember these considerations.

Yes definitely!

Should I also open an issue along the lines of "Folder API doesn't allow for internal iteration optimisation of filter-like adapters", to remember it for a potential (rayon 2?) API change/iteration/improvement?

@cuviper
Copy link
Member

cuviper commented Feb 8, 2019

Should I also open an issue along the lines of "Folder API doesn't allow for internal iteration optimisation of filter-like adapters", to remember it for a potential (rayon 2?) API change/iteration/improvement?

Sure, although I'm not sure what that change would look like, so if you have ideas please sketch them out. You'd still have the same problem if consume_iter took &mut self, for instance. Or the folder could have a method to return some associated type Full object, but this couldn't reference the folder's own state.

@huonw
Copy link
Contributor Author

huonw commented Feb 11, 2019

I had the same thought about using an associated type, and I think not being able to reference the folder's state is an ok limitation: most things either are never full or need to communicate with other threads to check full-ness anyway. I opened #632.

These adaptors consume may many elements before deferring to a base
folder's fullness checks, and so they need to be performed
manually. For the `filter`s, there's no way to do it manually (rayon-rs#632),
so the specialisations just have to be removed. For `fold` and
`find_any` this can be done with a `take_while`.

This extends the octillion tests to confirm this behaviour.

This makes a program like the following slightly slower compared to
the direct `consume_iter` without a check, but it's still faster than
the non-specialized form.

```
extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX)
        .into_par_iter()
        .map(test::black_box)
        .find_any(|_| test::black_box(false));
    println!("{:?}", count);
}
```

```
$ hyperfine ./find-original ./find-no-check ./find-check
Benchmark rayon-rs#1: ./find-original
  Time (mean ± σ):     627.6 ms ±  25.7 ms    [User: 7.130 s, System: 0.014 s]
  Range (min … max):   588.4 ms … 656.4 ms    10 runs

Benchmark rayon-rs#2: ./find-no-check
  Time (mean ± σ):     481.5 ms ±  10.8 ms    [User: 5.415 s, System: 0.013 s]
  Range (min … max):   468.9 ms … 498.2 ms    10 runs

Benchmark rayon-rs#3: ./find-check
  Time (mean ± σ):     562.3 ms ±  11.8 ms    [User: 6.363 s, System: 0.013 s]
  Range (min … max):   542.5 ms … 578.2 ms    10 runs
```

(find-original = without specialization, find-no-check = custom
`consume_iter` without `take_while`, find-check = this commit)
@huonw
Copy link
Contributor Author

huonw commented Feb 14, 2019

I added a commit that:

  • reverts the change for filter and filter_map
  • adds take_while to fold and find_any
  • extends the octillion tests to cover the above

Unsurprisingly, this makes a program like the following slightly slower compared to the direct consume_iter without a check, but it's still faster than the non-specialized form.

#![feature(test)]

extern crate test;
extern crate rayon;
use rayon::prelude::*;

fn main() {
    let count = (0..std::u32::MAX)
        .into_par_iter()
        .map(test::black_box)
        .find_any(|_| test::black_box(false));
    println!("{:?}", count);
}
$ hyperfine ./find-original ./find-no-check ./find-check
Benchmark #1: ./find-original
  Time (mean ± σ):     627.6 ms ±  25.7 ms    [User: 7.130 s, System: 0.014 s]
  Range (min … max):   588.4 ms … 656.4 ms    10 runs

Benchmark #2: ./find-no-check
  Time (mean ± σ):     481.5 ms ±  10.8 ms    [User: 5.415 s, System: 0.013 s]
  Range (min … max):   468.9 ms … 498.2 ms    10 runs

Benchmark #3: ./find-check
  Time (mean ± σ):     562.3 ms ±  11.8 ms    [User: 6.363 s, System: 0.013 s]
  Range (min … max):   542.5 ms … 578.2 ms    10 runs

(find-original = without specialization; find-no-check = custom consume_iter without take_while; find-check = this commit)

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Can you just run your changes through rustfmt though? I can do it too, but that will steal most of your git-blame lines... 😉

@huonw
Copy link
Contributor Author

huonw commented Feb 18, 2019

Done (formatting isn't checked on CI?).

Let me know if you'd like the commits squashed, for a clean history.

@cuviper
Copy link
Member

cuviper commented Feb 18, 2019

Done (formatting isn't checked on CI?).

No, we need to figure that out... #612

Let me know if you'd like the commits squashed, for a clean history.

This is clean enough for me, and there are important details in your "check full" commit.

bors r+

bors bot added a commit that referenced this pull request Feb 18, 2019
631: Specialize more `consume_iter`s r=cuviper a=huonw

This specializes the remaining "obvious" `consume_iter`s for the second part of #465. Benchmarks are included in each commit (the first specializes the types that can just defer to a `std` API, the second two require a bit more work); summary (on a 6 core/12 thread machine):

- on the included benchmarks:
  - `find` is up to 1.5x faster (`find::size1::parallel_find_middle`)
  - `collect` is up to 4x faster (`vec_collect::vec_i_filtered::with_collect`)
- using basic program that does something like `(0..std::u32::MAX).into_par_iter().<OPERATION>.map(test::black_box).count()`
  - with no operation (i.e. just testing `map`/`count`): 8x faster
  - with `.map(Some).while_some()`: 3x faster
  - with `.intersperse(1)`: 10x faster

The remaining types without a specialization are:

```
$ rg --files-with-matches 'fn consume' src/iter | xargs rg --files-without-match 'fn consume_iter'
src/iter/filter_map.rs
src/iter/unzip.rs
src/iter/try_fold.rs
src/iter/try_reduce_with.rs
src/iter/try_reduce.rs
src/iter/filter.rs
src/iter/flat_map.rs
src/iter/find_first_last/mod.rs
src/iter/collect/consumer.rs
```

(I've started a WIP branch for the `try_*` types at https://github.com/huonw/rayon/tree/try_fold; diff to this PR: huonw/rayon@iter-opts...try_fold.)



Co-authored-by: Huon Wilson <[email protected]>
@bors
Copy link
Contributor

bors bot commented Feb 18, 2019

@bors bors bot merged commit 0dbc892 into rayon-rs:master Feb 18, 2019
@huonw huonw deleted the iter-opts branch March 7, 2019 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants