Skip to content

[flang][OpenMP] Upstream first part of do concurrent mapping #126026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions clang/include/clang/Driver/Options.td
Original file line number Diff line number Diff line change
Expand Up @@ -6974,6 +6974,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri

def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
HelpText<"Emit hermetic module files (no nested USE association)">;

def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
Values<"none, host, device">;
} // let Visibility = [FC1Option, FlangOption]

def J : JoinedOrSeparate<["-"], "J">,
Expand Down
3 changes: 2 additions & 1 deletion clang/lib/Driver/ToolChains/Flang.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
CmdArgs.push_back("-fversion-loops-for-stride");

Args.addAllArgs(CmdArgs,
{options::OPT_flang_experimental_hlfir,
{options::OPT_fdo_concurrent_to_openmp_EQ,
options::OPT_flang_experimental_hlfir,
options::OPT_flang_deprecated_no_hlfir,
options::OPT_fno_ppc_native_vec_elem_order,
options::OPT_fppc_native_vec_elem_order,
Expand Down
155 changes: 155 additions & 0 deletions flang/docs/DoConcurrentConversionToOpenMP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
<!--===- docs/DoConcurrentMappingToOpenMP.md

Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
See https://llvm.org/LICENSE.txt for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

-->

# `DO CONCURRENT` mapping to OpenMP

```{contents}
---
local:
---
```

This document seeks to describe the effort to parallelize `do concurrent` loops
by mapping them to OpenMP worksharing constructs. The goals of this document
are:
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
constructs.
* Tracking the current status of such mapping.
* Describing the limitations of the current implementation.
* Describing next steps.
* Tracking the current upstreaming status (from the AMD ROCm fork).

## Usage

In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
This maps such loops to the equivalent of `omp parallel do`.
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
This maps such loops to the equivalent of
`omp target teams distribute parallel do`.
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
loops are emitted as sequential loops.

The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
OpenMP is also enabled. So you need to provide the following options to flang in
order to enable it:
```
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
```
For mapping to device, the target device architecture must be specified as well.
See `-fopenmp-targets` and `--offload-arch` for more info.

## Current status

Under the hood, `do concurrent` mapping is implemented in the
`DoConcurrentConversionPass`. This is still an experimental pass which means
that:
* It has been tested in a very limited way so far.
* It has been tested mostly on simple synthetic inputs.

<!--
More details about current status will be added along with relevant parts of the
implementation in later upstreaming patches.
-->

## Next steps

This section describes some of the open questions/issues that are not tackled yet
even in the downstream implementation.

### Delayed privatization

So far, we emit the privatization logic for IVs inline in the parallel/target
region. This is enough for our purposes right now since we don't
localize/privatize any sophisticated types of variables yet. Once we have need
for more advanced localization through `do concurrent`'s locality specifiers
(see below), delayed privatization will enable us to have a much cleaner IR.
Once delayed privatization's implementation upstream is supported for the
required constructs by the pass, we will move to it rather than inlined/early
privatization.

### Locality specifiers for `do concurrent`

Locality specifiers will enable the user to control the data environment of the
loop nest in a more fine-grained way. Implementing these specifiers on the
`FIR` dialect level is needed in order to support this in the
`DoConcurrentConversionPass`.

Such specifiers will also unlock a potential solution to the
non-perfectly-nested loops' IVs issue described above. In particular, for a
non-perfectly nested loop, one middle-ground proposal/solution would be to:
* Emit the loop's IV as shared/mapped just like we do currently.
* Emit a warning that the IV of the loop is emitted as shared/mapped.
* Given support for `LOCAL`, we can recommend the user to explicitly
localize/privatize the loop's IV if they choose to.

#### Sharing TableGen clause records from the OpenMP dialect

At the moment, the FIR dialect does not have a way to model locality specifiers
on the IR level. Instead, something similar to early/eager privatization in OpenMP
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
to OpenMP (and other parallel programming models) much easier.

Therefore, one way to approach this problem is to extract the TableGen records
for relevant OpenMP clauses in a shared dialect for "data environment management"
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
as well.

#### Supporting reductions

Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
is also still an open TODO. We can potentially extend the MLIR infrastructure
proposed in the previous section to share reduction records among the different
relevant dialects as well.

### More advanced detection of loop nests

As pointed out earlier, any intervening code between the headers of 2 nested
`do concurrent` loops prevents us from detecting this as a loop nest. In some
cases this is overly conservative. Therefore, a more flexible detection logic
of loop nests needs to be implemented.

### Data-dependence analysis

Right now, we map loop nests without analysing whether such mapping is safe to
do or not. We probably need to at least warn the user of unsafe loop nests due
to loop-carried dependencies.

### Non-rectangular loop nests

So far, we did not need to use the pass for non-rectangular loop nests. For
example:
```fortran
do concurrent(i=1:n)
do concurrent(j=i:n)
...
end do
end do
```
We defer this to the (hopefully) near future when we get the conversion in a
good share for the samples/projects at hand.

### Generalizing the pass to other parallel programming models

Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
this in a more generalized direction and allow the pass to target other models;
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
targeting OpenMP.


## Upstreaming status

- [x] Command line options for `flang` and `bbc`.
- [x] Conversion pass skeleton (no transormations happen yet).
- [x] Status description and tracking document (this document).
- [ ] Basic host/CPU mapping support.
- [ ] Basic device/GPU mapping support.
- [ ] More advanced host and device support (expaned to multiple items as needed).
1 change: 1 addition & 0 deletions flang/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ on how to get in touch with us and to learn more about the current status.
DebugGeneration
Directives
DoConcurrent
DoConcurrentConversionToOpenMP
Extensions
F202X
FIRArrayOperations
Expand Down
2 changes: 2 additions & 0 deletions flang/include/flang/Frontend/CodeGenOptions.def
Original file line number Diff line number Diff line change
Expand Up @@ -42,5 +42,7 @@ ENUM_CODEGENOPT(DebugInfo, llvm::codegenoptions::DebugInfoKind, 4, llvm::codeg
ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers

ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP

#undef CODEGENOPT
#undef ENUM_CODEGENOPT
5 changes: 5 additions & 0 deletions flang/include/flang/Frontend/CodeGenOptions.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
#ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
#define FORTRAN_FRONTEND_CODEGENOPTIONS_H

#include "flang/Optimizer/OpenMP/Utils.h"
#include "llvm/Frontend/Debug/Options.h"
#include "llvm/Frontend/Driver/CodeGenOptions.h"
#include "llvm/Support/CodeGen.h"
Expand Down Expand Up @@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
/// (-mlarge-data-threshold).
uint64_t LargeDataThreshold;

/// Optionally map `do concurrent` loops to OpenMP. This is only valid of
/// OpenMP is enabled.
using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;

// Define accessors/mutators for code generation options of enumeration type.
#define CODEGENOPT(Name, Bits, Default)
#define ENUM_CODEGENOPT(Name, Type, Bits, Default) \
Expand Down
2 changes: 2 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
#define FORTRAN_OPTIMIZER_OPENMP_PASSES_H

#include "flang/Optimizer/OpenMP/Utils.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/IR/BuiltinOps.h"
#include "mlir/Pass/Pass.h"
Expand All @@ -30,6 +31,7 @@ namespace flangomp {
/// divided into units of work.
bool shouldUseWorkshareLowering(mlir::Operation *op);

std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
} // namespace flangomp

#endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H
30 changes: 30 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
];
}

def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";

let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
to their correspnding equivalent OpenMP worksharing constructs.

For now the following is supported:
- Mapping simple loops to `parallel do`.

Still TODO:
- More extensive testing.
}];

let dependentDialects = ["mlir::omp::OpenMPDialect"];

let options = [
Option<"mapTo", "map-to",
"flangomp::DoConcurrentMappingKind",
/*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
"Try to map `do concurrent` loops to OpenMP [none|host|device]",
[{::llvm::cl::values(
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
"none", "Do not lower `do concurrent` to OpenMP"),
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
"host", "Lower to run in parallel on the CPU"),
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
"device", "Lower to run in parallel on the GPU")
)}]>,
];
}

// Needs to be scheduled on Module as we create functions in it
def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
Expand Down
26 changes: 26 additions & 0 deletions flang/include/flang/Optimizer/OpenMP/Utils.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
//
//===----------------------------------------------------------------------===//

#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H

namespace flangomp {

enum class DoConcurrentMappingKind {
DCMK_None, ///< Do not lower `do concurrent` to OpenMP.
DCMK_Host, ///< Lower to run in parallel on the CPU.
DCMK_Device ///< Lower to run in parallel on the GPU.
};

} // namespace flangomp

#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H
18 changes: 15 additions & 3 deletions flang/include/flang/Optimizer/Passes/Pipelines.h
Original file line number Diff line number Diff line change
Expand Up @@ -128,16 +128,28 @@ void createHLFIRToFIRPassPipeline(
mlir::PassManager &pm, bool enableOpenMP,
llvm::OptimizationLevel optLevel = defaultOptLevel);

struct OpenMPFIRPassPipelineOpts {
/// Whether code is being generated for a target device rather than the host
/// device
bool isTargetDevice;

/// Controls how to map `do concurrent` loops; to device, host, or none at
/// all.
Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
doConcurrentMappingKind;
};

/// Create a pass pipeline for handling certain OpenMP transformations needed
/// prior to FIR lowering.
///
/// WARNING: These passes must be run immediately after the lowering to ensure
/// that the FIR is correct with respect to OpenMP operations/attributes.
///
/// \param pm - MLIR pass manager that will hold the pipeline definition.
/// \param isTargetDevice - Whether code is being generated for a target device
/// rather than the host device.
void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
/// \param opts - options to control OpenMP code-gen; see struct docs for more
/// details.
void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
OpenMPFIRPassPipelineOpts opts);

#if !defined(FLANG_EXCLUDE_CODEGEN)
void createDebugPasses(mlir::PassManager &pm,
Expand Down
28 changes: 28 additions & 0 deletions flang/lib/Frontend/CompilerInvocation.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,32 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
return true;
}

static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
llvm::opt::ArgList &args,
clang::DiagnosticsEngine &diags) {
llvm::opt::Arg *arg =
args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
if (!arg)
return;

using DoConcurrentMappingKind =
Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
std::optional<DoConcurrentMappingKind> val =
llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
arg->getValue())
.Case("none", DoConcurrentMappingKind::DCMK_None)
.Case("host", DoConcurrentMappingKind::DCMK_Host)
.Case("device", DoConcurrentMappingKind::DCMK_Device)
.Default(std::nullopt);

if (!val.has_value()) {
diags.Report(clang::diag::err_drv_invalid_value)
<< arg->getAsString(args) << arg->getValue();
}

opts.setDoConcurrentMapping(val.value());
}

static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
llvm::opt::ArgList &args,
clang::DiagnosticsEngine &diags) {
Expand Down Expand Up @@ -430,6 +456,8 @@ static void parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
clang::driver::options::OPT_funderscoring, false)) {
opts.Underscoring = 0;
}

parseDoConcurrentMapping(opts, args, diags);
}

/// Parses all target input arguments and populates the target
Expand Down
Loading
Loading