Skip to content

Add basic parallel_for support to reduce_util #8986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from

Conversation

swolchok
Copy link
Contributor

@swolchok swolchok commented Mar 6, 2025

Initial parallel_for integration in a portable op. Needed for #8932. Feel free to hold review until rest of stack is ready and we observe successful paralleliztaion.

swolchok added 25 commits March 4, 2025 11:35
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

swolchok commented Mar 6, 2025

c10::irange which seems unnecessary

c10::irange is how we should be writing for loops over a range, full stop.

It has nested loops, with the inner loop using c10::irange

Sorry, where is the outer loop?

the existing parallel_for is quite ugly

  1. It matches PyTorch core. See the definition of at::parallel_for and the dozens of uses. If we want to materially improve this primitive, we should do it upstream. (I should upstream my addition of a version with default grain size as well.)
  2. I don't see why we can't add the new parallelization capability independent of discussions about whether and how to improve the readability of said capability.
  3. The existing parallel_for API appears to be similar to Intel oneTBB's parallel_for, which also takes a range and a task and passes a subrange to the task. (See https://github.com/uxlfoundation/oneTBB/blob/master/examples/parallel_for/game_of_life/Evolution.cpp#L177 for example.) I think the reason to do it this way is to be useful for cases that don't need to inspect each element in the range. (For example, if you need to handle early termination of your inspection of the chunk, that will get ugly.)
  4. oneTBB does have a parallel_for_each algorithm that looks like it might be what you want. (See the example.) We could certainly add a parallel_for_each and migrate to it, but again I don't think it needs to block this work.

Comment on lines +90 to +92
bool parallel_for(const int64_t begin, const int64_t end, const Func& func) {
return parallel_for(begin, end, internal::GRAIN_SIZE, func);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should let this be explicit instead of providing this convenience. Each operator, in my experience, has its own sweet spot and with this convenience function you might think i just need to slap parallel_for and its ok. So I feel its ok to not have the convenience function just to make authors be aware of grain size. Just my opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each operator, in my experience, has its own sweet spot

do you have any good references on tuning this? I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions, etc. My working assumption was that we just have to pick some number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions,

I was meaning to say that it is dependent on all of that. Operator, input sizes CPU etc. For empirical reasons the former like operator and input sizes are easiest to target and more advanced thigns can be done with CPU/SoC. I remember tuning bmm/concat, pytorch/pytorch#59596

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuning bmm/concat, pytorch/pytorch#59596

I don't think this type of tuning is generally appropriate -- the CPU operators support all CPUs; attempting to optimize for specific characteristics of one CPU is presumably going to hurt the other ones. If you want a CPU-tuned operator it should be built for that CPU.

For the specific case of grain size tuning, we could hypothetically do something similar to dtype selective build where we let the user compile in a model-specific set of carefully selected grain sizes. I suspect there would be a simpler way to get that job done, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just drop the default grain size version for now. can always find/replace it back

swolchok added 2 commits March 5, 2025 20:09
[ghstack-poisoned]
[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

swolchok commented Mar 6, 2025

The existing parallel_for API appears to be similar to Intel oneTBB's parallel_for

To be fair, the Microsoft Parallel Patterns Library's parallel_for agrees with @mergennachin . I just think when we make things with the same name as a PyTorch core primitive, they should work the same; ExecuTorch is PyTorch, to the extent that IMO it is just an unfortunate detail of the present situation that ExecuTorch doesn't just live in pytorch/pytorch.

@kimishpatel
Copy link
Contributor

@mergennachin

I dont quite follow why you find existing API ugly.

The other separate reason to either keep the same API or wholesale upgrade everything is convenient code sharing where same abstractions are used for parallelization. This is for example used in AO low-bit kernels.

@mergennachin
Copy link
Contributor

@kimishpatel

I dont quite follow why you find existing API ugly.

for (i = start; i<=end; i++){
  fn(base + i * stride);
}

is today converted to

parallel_for(start, end + 1, [&](auto start_, auto end_) {
   for (const auto i : c10::irange(start_, end_)) {
     fn(base + i * stride, i);
    }
 });

which looks like it has two for loops (i.e., parallel_for and for)

whereas this looks like it has one for loop (i.e., parallel_for)

parallel_for(start, end + 1, [&](int64_t i) {
    fn(base + i * stride);
});

I don't see why we can't add the new parallelization capability independent of discussions about whether and how to improve the readability of said capability.

@swolchok -- well in theory portable ops was first started and readability was one of the original goals so it looks like we're regressing with this (though today, with all the macros, it's hard to understand what's going on anyway, lol)

but it is unfortunate that pytorch/pytorch's parallel_for is a bit strange, and seems like it's long gone to migrate it...

i agree with the argument that consistency between pt and et is important here, so not blocking the work here.

@mergennachin
Copy link
Contributor

mergennachin commented Mar 6, 2025

Also, one idea is that we can have multiple versions of parallel_for() (and upstream to PyTorch) where there are both range based and non-range based implementations. It picks the right implementation based on compile time template meta programming (e.g., based on if the lambda has two arguments vs one argument using std::enable_if). The implementation of non-range based could call the range based one.

@kimishpatel
Copy link
Contributor

@kimishpatel

I dont quite follow why you find existing API ugly.

for (i = start; i<=end; i++){
  fn(base + i * stride);
}

is today converted to

parallel_for(start, end + 1, [&](auto start_, auto end_) {
   for (const auto i : c10::irange(start_, end_)) {
     fn(base + i * stride, i);
    }
 });

Isnt that just the matter of how the lambda is constructed? I could also do

auto lambda = [&](size_t start, size_t end) {
  for (i = start; i <= end; i++) {
    fn(base + i * stride);
  }
}
parallel_for(start, end+1, 
  [&](auto start, auto end) {labmda(start, end);}
);

But I do see your point around parralel_for itself constructing the output loop around fn(base + i * stride). If thats the case, I can understand it being a bit better (although I am not fully convinced but that maybe be because I am used to existing parallel_for semantics).

@kimishpatel
Copy link
Contributor

readability was one of the original goals so it looks like we're regressing with this

This is exactly what I said to Scot. But then

though today, with all the macros, it's hard to understand what's going on anyway, lol

Because of this and other nested lambdas and all the convoluted stuff I didnt try to hold my ground on readability anymore

@kimishpatel
Copy link
Contributor

Also, one idea is that we can have multiple versions of parallel_for() (and upstream to PyTorch) where there are both range based and non-range based implementations. It picks the right implementation based on compile time template meta programming (e.g., based on if the lambda has two arguments vs one argument using std::enable_if). The implementation of non-range based could call the range based one.

I feel we should just pick parallel_for_each or some such new API if we go down that route.

@swolchok
Copy link
Contributor Author

swolchok commented Mar 6, 2025

noting that CI is 100% green in case of rebase

swolchok added 5 commits March 6, 2025 09:56
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

swolchok commented Mar 7, 2025

This is incorrect -- the per-element function for reductions cares about the order in which it is applied. We need to port parallel_reduce, and we need to figure out some way to get tests to fail with this as written.

@swolchok
Copy link
Contributor Author

swolchok commented Mar 7, 2025

get tests to fail

solved by making parallel_for iterate backward in debug builds. (will send PR after bottom ones get reviewed -- I can only stack up 8 PRs before ghstack tells me my stack is too big and Github will rate limit me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants