Skip to content

Commit 41f77da

Browse files
committed
[flang][OpenMP] Upstream do concurrent loop-nest detection.
Upstreams the next part of `do concurrent` to OpenMP mapping pass (from AMD's ROCm implementation). See #126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range `do concurrent` loops and then map them as "collapsed" loop nests to OpenMP.
1 parent 0b8b320 commit 41f77da

File tree

3 files changed

+273
-0
lines changed

3 files changed

+273
-0
lines changed

flang/docs/DoConcurrentConversionToOpenMP.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,79 @@ that:
5353
* It has been tested in a very limited way so far.
5454
* It has been tested mostly on simple synthetic inputs.
5555

56+
### Loop nest detection
57+
58+
On the `FIR` dialect level, the following loop:
59+
```fortran
60+
do concurrent(i=1:n, j=1:m, k=1:o)
61+
a(i,j,k) = i + j + k
62+
end do
63+
```
64+
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
65+
contains **only** the following:
66+
1. The operations needed to assign/update the outer loop's induction variable.
67+
1. The inner loop itself.
68+
69+
So the MLIR structure for the above example looks similar to the following:
70+
```
71+
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
72+
%i_idx_2 = fir.convert %i_idx : (index) -> i32
73+
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
74+
75+
fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
76+
%j_idx_2 = fir.convert %j_idx : (index) -> i32
77+
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
78+
79+
fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
80+
%k_idx_2 = fir.convert %k_idx : (index) -> i32
81+
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
82+
83+
... loop nest body goes here ...
84+
}
85+
}
86+
}
87+
```
88+
This applies to multi-range loops in general; they are represented in the IR as
89+
a nest of `fir.do_loop` ops with the above nesting structure.
90+
91+
Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
92+
loops and map them as "collapsed" loops in OpenMP.
93+
94+
#### Further info regarding loop nest detection
95+
96+
Loop nest detection is currently limited to the scenario described in the previous
97+
section. However, this is quite limited and can be extended in the future to cover
98+
more cases. For example, for the following loop nest, even though, both loops are
99+
perfectly nested; at the moment, only the outer loop is parallelized:
100+
```fortran
101+
do concurrent(i=1:n)
102+
do concurrent(j=1:m)
103+
a(i,j) = i * j
104+
end do
105+
end do
106+
```
107+
108+
Similarly, for the following loop nest, even though the intervening statement `x = 41`
109+
does not have any memory effects that would affect parallelization, this nest is
110+
not parallelized as well (only the outer loop is).
111+
112+
```fortran
113+
do concurrent(i=1:n)
114+
x = 41
115+
do concurrent(j=1:m)
116+
a(i,j) = i * j
117+
end do
118+
end do
119+
```
120+
121+
The above also has the consequence that the `j` variable will **not** be
122+
privatized in the OpenMP parallel/target region. In other words, it will be
123+
treated as if it was a `shared` variable. For more details about privatization,
124+
see the "Data environment" section below.
125+
126+
See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
127+
of what is and is not detected as a perfect loop nest.
128+
56129
<!--
57130
More details about current status will be added along with relevant parts of the
58131
implementation in later upstreaming patches.
@@ -150,6 +223,7 @@ targeting OpenMP.
150223
- [x] Command line options for `flang` and `bbc`.
151224
- [x] Conversion pass skeleton (no transormations happen yet).
152225
- [x] Status description and tracking document (this document).
226+
- [x] Loop nest detection to identify multi-range loops.
153227
- [ ] Basic host/CPU mapping support.
154228
- [ ] Basic device/GPU mapping support.
155229
- [ ] More advanced host and device support (expaned to multiple items as needed).

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,10 @@
99
#include "flang/Optimizer/Dialect/FIROps.h"
1010
#include "flang/Optimizer/OpenMP/Passes.h"
1111
#include "flang/Optimizer/OpenMP/Utils.h"
12+
#include "mlir/Analysis/SliceAnalysis.h"
1213
#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
1314
#include "mlir/Transforms/DialectConversion.h"
15+
#include "mlir/Transforms/RegionUtils.h"
1416

1517
namespace flangomp {
1618
#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
@@ -21,6 +23,106 @@ namespace flangomp {
2123
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")
2224

2325
namespace {
26+
namespace looputils {
27+
using LoopNest = llvm::SetVector<fir::DoLoopOp>;
28+
29+
/// Loop \p innerLoop is considered perfectly-nested inside \p outerLoop iff
30+
/// there are no operations in \p outerloop's body other than:
31+
///
32+
/// 1. the operations needed to assing/update \p outerLoop's induction variable.
33+
/// 2. \p innerLoop itself.
34+
///
35+
/// \p return true if \p innerLoop is perfectly nested inside \p outerLoop
36+
/// according to the above definition.
37+
bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
38+
mlir::ForwardSliceOptions forwardSliceOptions;
39+
forwardSliceOptions.inclusive = true;
40+
// We don't care about the outer-loop's induction variable's uses within the
41+
// inner-loop, so we filter out these uses.
42+
//
43+
// This filter tells `getForwardSlice` (below) to only collect operations
44+
// which produce results defined above (i.e. outside) the inner-loop's body.
45+
//
46+
// Since `outerLoop.getInductionVar()` is a block argument (to the
47+
// outer-loop's body), the filter effectively collects uses of
48+
// `outerLoop.getInductionVar()` inside the outer-loop but outside the
49+
// inner-loop.
50+
forwardSliceOptions.filter = [&](mlir::Operation *op) {
51+
return mlir::areValuesDefinedAbove(op->getResults(), innerLoop.getRegion());
52+
};
53+
54+
llvm::SetVector<mlir::Operation *> indVarSlice;
55+
mlir::getForwardSlice(outerLoop.getInductionVar(), &indVarSlice,
56+
forwardSliceOptions);
57+
llvm::DenseSet<mlir::Operation *> indVarSet(indVarSlice.begin(),
58+
indVarSlice.end());
59+
60+
llvm::DenseSet<mlir::Operation *> outerLoopBodySet;
61+
// The following walk collects ops inside `outerLoop` that are **not**:
62+
// * the outer-loop itself,
63+
// * or the inner-loop,
64+
// * or the `fir.result` op (the outer-loop's terminator).
65+
outerLoop.walk<mlir::WalkOrder::PreOrder>([&](mlir::Operation *op) {
66+
if (op == outerLoop)
67+
return mlir::WalkResult::advance();
68+
69+
if (op == innerLoop)
70+
return mlir::WalkResult::skip();
71+
72+
if (mlir::isa<fir::ResultOp>(op))
73+
return mlir::WalkResult::advance();
74+
75+
outerLoopBodySet.insert(op);
76+
return mlir::WalkResult::advance();
77+
});
78+
79+
// If `outerLoopBodySet` ends up having the same ops as `indVarSet`, then
80+
// `outerLoop` only contains ops that setup its induction variable +
81+
// `innerLoop` + the `fir.result` terminator. In other words, `innerLoop` is
82+
// perfectly nested inside `outerLoop`.
83+
bool result = (outerLoopBodySet == indVarSet);
84+
mlir::Location loc = outerLoop.getLoc();
85+
LLVM_DEBUG(DBGS() << "Loop pair starting at location " << loc << " is"
86+
<< (result ? "" : " not") << " perfectly nested\n");
87+
88+
return result;
89+
}
90+
91+
/// Starting with `outerLoop` collect a perfectly nested loop nest, if any. This
92+
/// function collects as much as possible loops in the nest; it case it fails to
93+
/// recognize a certain nested loop as part of the nest it just returns the
94+
/// parent loops it discovered before.
95+
mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
96+
LoopNest &loopNest) {
97+
assert(currentLoop.getUnordered());
98+
99+
while (true) {
100+
loopNest.insert(currentLoop);
101+
auto directlyNestedLoops = currentLoop.getRegion().getOps<fir::DoLoopOp>();
102+
llvm::SmallVector<fir::DoLoopOp> unorderedLoops;
103+
104+
for (auto nestedLoop : directlyNestedLoops)
105+
if (nestedLoop.getUnordered())
106+
unorderedLoops.push_back(nestedLoop);
107+
108+
if (unorderedLoops.empty())
109+
break;
110+
111+
if (unorderedLoops.size() > 1)
112+
return mlir::failure();
113+
114+
fir::DoLoopOp nestedUnorderedLoop = unorderedLoops.front();
115+
116+
if (!isPerfectlyNested(currentLoop, nestedUnorderedLoop))
117+
return mlir::failure();
118+
119+
currentLoop = nestedUnorderedLoop;
120+
}
121+
122+
return mlir::success();
123+
}
124+
} // namespace looputils
125+
24126
class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
25127
public:
26128
using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
@@ -31,6 +133,14 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
31133
mlir::LogicalResult
32134
matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
33135
mlir::ConversionPatternRewriter &rewriter) const override {
136+
looputils::LoopNest loopNest;
137+
bool hasRemainingNestedLoops =
138+
failed(looputils::collectLoopNest(doLoop, loopNest));
139+
if (hasRemainingNestedLoops)
140+
mlir::emitWarning(doLoop.getLoc(),
141+
"Some `do concurent` loops are not perfectly-nested. "
142+
"These will be serialzied.");
143+
34144
// TODO This will be filled in with the next PRs that upstreams the rest of
35145
// the ROCm implementaion.
36146
return mlir::success();
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
! Tests loop-nest detection algorithm for do-concurrent mapping.
2+
3+
! REQUIRES: asserts
4+
5+
! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host \
6+
! RUN: -mmlir -debug %s -o - 2> %t.log || true
7+
8+
! RUN: FileCheck %s < %t.log
9+
10+
program main
11+
implicit none
12+
13+
contains
14+
15+
subroutine foo(n)
16+
implicit none
17+
integer :: n, m
18+
integer :: i, j, k
19+
integer :: x
20+
integer, dimension(n) :: a
21+
integer, dimension(n, n, n) :: b
22+
23+
! CHECK: Loop pair starting at location
24+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
25+
do concurrent(i=1:n, j=1:bar(n*m, n/m))
26+
a(i) = n
27+
end do
28+
29+
! CHECK: Loop pair starting at location
30+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
31+
do concurrent(i=bar(n, x):n, j=1:bar(n*m, n/m))
32+
a(i) = n
33+
end do
34+
35+
! CHECK: Loop pair starting at location
36+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
37+
do concurrent(i=bar(n, x):n)
38+
do concurrent(j=1:bar(n*m, n/m))
39+
a(i) = n
40+
end do
41+
end do
42+
43+
! CHECK: Loop pair starting at location
44+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
45+
do concurrent(i=1:n)
46+
x = 10
47+
do concurrent(j=1:m)
48+
b(i,j,k) = i * j + k
49+
end do
50+
end do
51+
52+
! CHECK: Loop pair starting at location
53+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
54+
do concurrent(i=1:n)
55+
do concurrent(j=1:m)
56+
b(i,j,k) = i * j + k
57+
end do
58+
x = 10
59+
end do
60+
61+
! CHECK: Loop pair starting at location
62+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
63+
do concurrent(i=1:n)
64+
do concurrent(j=1:m)
65+
b(i,j,k) = i * j + k
66+
x = 10
67+
end do
68+
end do
69+
70+
! Verify the (i,j) and (j,k) pairs of loops are detected as perfectly nested.
71+
!
72+
! CHECK: Loop pair starting at location
73+
! CHECK: loc("{{.*}}":[[# @LINE + 3]]:{{.*}}) is perfectly nested
74+
! CHECK: Loop pair starting at location
75+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
76+
do concurrent(i=bar(n, x):n, j=1:bar(n*m, n/m), k=1:bar(n*m, bar(n*m, n/m)))
77+
a(i) = n
78+
end do
79+
end subroutine
80+
81+
pure function bar(n, m)
82+
implicit none
83+
integer, intent(in) :: n, m
84+
integer :: bar
85+
86+
bar = n + m
87+
end function
88+
89+
end program main

0 commit comments

Comments
 (0)