Add basic parallel_for support to reduce_util #8986

swolchok · 2025-03-06T01:37:00Z

Initial parallel_for integration in a portable op. Needed for #8932. Feel free to hold review until rest of stack is ready and we observe successful paralleliztaion.

[ghstack-poisoned]

swolchok · 2025-03-06T01:37:01Z

Stack from ghstack (oldest at bottom):

swolchok · 2025-03-06T03:57:20Z

c10::irange which seems unnecessary

c10::irange is how we should be writing for loops over a range, full stop.

It has nested loops, with the inner loop using c10::irange

Sorry, where is the outer loop?

the existing parallel_for is quite ugly

It matches PyTorch core. See the definition of at::parallel_for and the dozens of uses. If we want to materially improve this primitive, we should do it upstream. (I should upstream my addition of a version with default grain size as well.)
I don't see why we can't add the new parallelization capability independent of discussions about whether and how to improve the readability of said capability.
The existing parallel_for API appears to be similar to Intel oneTBB's parallel_for, which also takes a range and a task and passes a subrange to the task. (See https://github.com/uxlfoundation/oneTBB/blob/master/examples/parallel_for/game_of_life/Evolution.cpp#L177 for example.) I think the reason to do it this way is to be useful for cases that don't need to inspect each element in the range. (For example, if you need to handle early termination of your inspection of the chunk, that will get ugly.)
oneTBB does have a parallel_for_each algorithm that looks like it might be what you want. (See the example.) We could certainly add a parallel_for_each and migrate to it, but again I don't think it needs to block this work.

kimishpatel · 2025-03-06T04:07:30Z

runtime/kernel/thread_parallel_interface.h

+bool parallel_for(const int64_t begin, const int64_t end, const Func& func) {
+  return parallel_for(begin, end, internal::GRAIN_SIZE, func);
+}


I feel we should let this be explicit instead of providing this convenience. Each operator, in my experience, has its own sweet spot and with this convenience function you might think i just need to slap parallel_for and its ok. So I feel its ok to not have the convenience function just to make authors be aware of grain size. Just my opinion

Each operator, in my experience, has its own sweet spot

do you have any good references on tuning this? I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions, etc. My working assumption was that we just have to pick some number.

I'm having trouble envisioning the sweet spot being dependent on the operator and also independent on the specific CPU, model, input dimensions,

I was meaning to say that it is dependent on all of that. Operator, input sizes CPU etc. For empirical reasons the former like operator and input sizes are easiest to target and more advanced thigns can be done with CPU/SoC. I remember tuning bmm/concat, pytorch/pytorch#59596

tuning bmm/concat, pytorch/pytorch#59596

I don't think this type of tuning is generally appropriate -- the CPU operators support all CPUs; attempting to optimize for specific characteristics of one CPU is presumably going to hurt the other ones. If you want a CPU-tuned operator it should be built for that CPU.

For the specific case of grain size tuning, we could hypothetically do something similar to dtype selective build where we let the user compile in a model-specific set of carefully selected grain sizes. I suspect there would be a simpler way to get that job done, though.

I'll just drop the default grain size version for now. can always find/replace it back

[ghstack-poisoned]

swolchok · 2025-03-06T04:15:15Z

The existing parallel_for API appears to be similar to Intel oneTBB's parallel_for

To be fair, the Microsoft Parallel Patterns Library's parallel_for agrees with @mergennachin . I just think when we make things with the same name as a PyTorch core primitive, they should work the same; ExecuTorch is PyTorch, to the extent that IMO it is just an unfortunate detail of the present situation that ExecuTorch doesn't just live in pytorch/pytorch.

kimishpatel · 2025-03-06T04:16:11Z

@mergennachin

I dont quite follow why you find existing API ugly.

The other separate reason to either keep the same API or wholesale upgrade everything is convenient code sharing where same abstractions are used for parallelization. This is for example used in AO low-bit kernels.

mergennachin · 2025-03-06T05:50:12Z

@kimishpatel

I dont quite follow why you find existing API ugly.

for (i = start; i<=end; i++){
  fn(base + i * stride);
}

is today converted to

parallel_for(start, end + 1, [&](auto start_, auto end_) {
   for (const auto i : c10::irange(start_, end_)) {
     fn(base + i * stride, i);
    }
 });

which looks like it has two for loops (i.e., parallel_for and for)

whereas this looks like it has one for loop (i.e., parallel_for)

parallel_for(start, end + 1, [&](int64_t i) {
    fn(base + i * stride);
});

I don't see why we can't add the new parallelization capability independent of discussions about whether and how to improve the readability of said capability.

@swolchok -- well in theory portable ops was first started and readability was one of the original goals so it looks like we're regressing with this (though today, with all the macros, it's hard to understand what's going on anyway, lol)

but it is unfortunate that pytorch/pytorch's parallel_for is a bit strange, and seems like it's long gone to migrate it...

i agree with the argument that consistency between pt and et is important here, so not blocking the work here.

mergennachin · 2025-03-06T06:52:40Z

Also, one idea is that we can have multiple versions of parallel_for() (and upstream to PyTorch) where there are both range based and non-range based implementations. It picks the right implementation based on compile time template meta programming (e.g., based on if the lambda has two arguments vs one argument using std::enable_if). The implementation of non-range based could call the range based one.

kimishpatel · 2025-03-06T15:02:21Z

@kimishpatel

I dont quite follow why you find existing API ugly.

for (i = start; i<=end; i++){
  fn(base + i * stride);
}

is today converted to

parallel_for(start, end + 1, [&](auto start_, auto end_) {
   for (const auto i : c10::irange(start_, end_)) {
     fn(base + i * stride, i);
    }
 });

Isnt that just the matter of how the lambda is constructed? I could also do

auto lambda = [&](size_t start, size_t end) {
  for (i = start; i <= end; i++) {
    fn(base + i * stride);
  }
}
parallel_for(start, end+1, 
  [&](auto start, auto end) {labmda(start, end);}
);

But I do see your point around parralel_for itself constructing the output loop around fn(base + i * stride). If thats the case, I can understand it being a bit better (although I am not fully convinced but that maybe be because I am used to existing parallel_for semantics).

kimishpatel · 2025-03-06T15:03:37Z

readability was one of the original goals so it looks like we're regressing with this

This is exactly what I said to Scot. But then

though today, with all the macros, it's hard to understand what's going on anyway, lol

Because of this and other nested lambdas and all the convoluted stuff I didnt try to hold my ground on readability anymore

kimishpatel · 2025-03-06T15:05:59Z

Also, one idea is that we can have multiple versions of parallel_for() (and upstream to PyTorch) where there are both range based and non-range based implementations. It picks the right implementation based on compile time template meta programming (e.g., based on if the lambda has two arguments vs one argument using std::enable_if). The implementation of non-range based could call the range based one.

I feel we should just pick parallel_for_each or some such new API if we go down that route.

swolchok · 2025-03-06T16:03:51Z

noting that CI is 100% green in case of rebase

[ghstack-poisoned]

swolchok · 2025-03-07T00:59:25Z

This is incorrect -- the per-element function for reductions cares about the order in which it is applied. We need to port parallel_reduce, and we need to figure out some way to get tests to fail with this as written.

swolchok · 2025-03-07T03:16:01Z

get tests to fail

solved by making parallel_for iterate backward in debug builds. (will send PR after bottom ones get reviewed -- I can only stack up 8 PRs before ghstack tells me my stack is too big and Github will rate limit me)

Initial parallel_for integration in a portable op. Needed for #8932. Feel free to hold review until rest of stack is ready and we observe successful paralleliztaion. ghstack-source-id: 5982dbb ghstack-comment-id: 2702502530 Pull Request resolved: pytorch/executorch#8986

swolchok added 25 commits March 4, 2025 11:35

Update

d0b11e8

[ghstack-poisoned]

Update

9437be1

[ghstack-poisoned]

Update

643e10e

[ghstack-poisoned]

Update

6f2842b

[ghstack-poisoned]

Update

e47dfeb

[ghstack-poisoned]

Update

231ebc3

[ghstack-poisoned]

Update

296513c

[ghstack-poisoned]

Update

845a01e

[ghstack-poisoned]

Update

a92958a

[ghstack-poisoned]

Update

3fa99d6

[ghstack-poisoned]

Update

a6c69a6

[ghstack-poisoned]

Update

3bd6437

[ghstack-poisoned]

Update

675f01b

[ghstack-poisoned]

Update

5f3a768

[ghstack-poisoned]

Update

9fdebee

[ghstack-poisoned]

Update

70a7096

[ghstack-poisoned]

Update

337dc23

[ghstack-poisoned]

Update

f388177

[ghstack-poisoned]

Update

2949daf

[ghstack-poisoned]

Update

7347915

[ghstack-poisoned]

Update

1a8481d

[ghstack-poisoned]

Update

e48e816

[ghstack-poisoned]

Update

3351d50

[ghstack-poisoned]

Update

0102e25

[ghstack-poisoned]

Update

956f8a5

[ghstack-poisoned]

swolchok requested review from dbort, JacobSzwejbka, lucylq and manuelcandales as code owners March 6, 2025 01:37

kimishpatel reviewed Mar 6, 2025

View reviewed changes

swolchok added 2 commits March 5, 2025 20:09

Update

c130224

[ghstack-poisoned]

Update

49c2971

[ghstack-poisoned]

swolchok added 5 commits March 6, 2025 09:56

Update

754a4f6

[ghstack-poisoned]

Update

11c5707

[ghstack-poisoned]

Update

7ca7627

[ghstack-poisoned]

Update

d428ca2

[ghstack-poisoned]

Update

62b6ef2

[ghstack-poisoned]

This was referenced Mar 6, 2025

cmake-format executorch-config.cmake #9015

Merged

Parallelize portable ops if threadpool is available, with fallback to parallel_for-as-for-loop #8932

Closed

swolchok marked this pull request as draft March 7, 2025 00:59

github-actions bot mentioned this pull request Mar 10, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#16

Open

Base automatically changed from gh/swolchok/315/head to main March 11, 2025 18:14

swolchok closed this Mar 14, 2025

This was referenced Mar 17, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#18

Open

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#20

Open

github-actions bot mentioned this pull request Mar 31, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#22

Open

github-actions bot mentioned this pull request Apr 7, 2025

Weekly pr metrics report - 2025-03-01..2025-03-07 wdvr/pytorch#26

Open

Add basic parallel_for support to reduce_util #8986

Add basic parallel_for support to reduce_util #8986

Uh oh!

Conversation

swolchok commented Mar 6, 2025

Uh oh!

swolchok commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swolchok commented Mar 6, 2025

Uh oh!

kimishpatel Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

kimishpatel Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok commented Mar 6, 2025

Uh oh!

kimishpatel commented Mar 6, 2025

Uh oh!

mergennachin commented Mar 6, 2025

Uh oh!

mergennachin commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimishpatel commented Mar 6, 2025

Uh oh!

kimishpatel commented Mar 6, 2025

Uh oh!

kimishpatel commented Mar 6, 2025

Uh oh!

swolchok commented Mar 6, 2025

Uh oh!

swolchok commented Mar 7, 2025

Uh oh!

swolchok commented Mar 7, 2025

Uh oh!

Uh oh!

swolchok commented Mar 6, 2025 •

edited

Loading

mergennachin commented Mar 6, 2025 •

edited

Loading