Skip to content

[Clang][OpenCL][AMDGPU] Allow a kernel to call another kernel #115821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lalaniket8
Copy link
Contributor

@lalaniket8 lalaniket8 commented Nov 12, 2024

This feature is currently not supported in the compiler.
To facilitate this we emit a stub version of each kernel
function body with different name mangling scheme, and
replaces the respective kernel call-sites appropriately.

Fixes #60313

D120566 was an earlier attempt made to upstream a solution
for this issue.

Copy link
Contributor Author

lalaniket8 commented Nov 12, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@lalaniket8 lalaniket8 force-pushed the users/lalaniket8/emit-device-version-of-openCL-kernel branch 4 times, most recently from 9449cea to f8265de Compare November 13, 2024 10:15
@lalaniket8 lalaniket8 force-pushed the users/lalaniket8/emit-device-version-of-openCL-kernel branch from f8265de to 61c9004 Compare November 25, 2024 09:07
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the kernel metadata has a lot of special codegen associated with it. It seems the approach here is to turn the kernels into thin wrappers that call an outlined function?

@shiltian
Copy link
Contributor

Yes, that's the idea. In that way, we will not have any function call to a amdgpu_kernel.

@lalaniket8 lalaniket8 force-pushed the users/lalaniket8/emit-device-version-of-openCL-kernel branch from 61c9004 to 00167a3 Compare November 27, 2024 08:38
@lalaniket8 lalaniket8 marked this pull request as ready for review November 27, 2024 08:40
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. labels Nov 27, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang

Author: Aniket Lal (lalaniket8)

Changes

OpenCL allows a kernel function to call another kernel function.
To facilitate this we emit a stub version of each kernel function
with different name mangling scheme, and replace the kernel
callsite appropriately.

Fixes #60313


Patch is 23.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/115821.diff

15 Files Affected:

  • (modified) clang/include/clang/AST/GlobalDecl.h (+26-11)
  • (modified) clang/lib/AST/Expr.cpp (+2-1)
  • (modified) clang/lib/AST/ItaniumMangle.cpp (+15)
  • (modified) clang/lib/AST/Mangle.cpp (+1-1)
  • (modified) clang/lib/AST/MicrosoftMangle.cpp (+6)
  • (modified) clang/lib/CodeGen/CGBlocks.cpp (+10-6)
  • (modified) clang/lib/CodeGen/CGCall.cpp (+9-2)
  • (modified) clang/lib/CodeGen/CGExpr.cpp (+4-1)
  • (modified) clang/lib/CodeGen/CGOpenCLRuntime.cpp (+9-2)
  • (modified) clang/lib/CodeGen/CGOpenCLRuntime.h (+3-1)
  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+7)
  • (added) clang/test/CodeGenOpenCL/opencl-kernel-call.cl (+43)
  • (modified) clang/test/CodeGenOpenCL/reflect.cl (+1-1)
  • (modified) clang/test/CodeGenOpenCL/spir-calling-conv.cl (+2-2)
  • (modified) clang/test/CodeGenOpenCL/visibility.cl (+40-13)
diff --git a/clang/include/clang/AST/GlobalDecl.h b/clang/include/clang/AST/GlobalDecl.h
index 386693cabb1fbb..8a9f4b4c60e5e5 100644
--- a/clang/include/clang/AST/GlobalDecl.h
+++ b/clang/include/clang/AST/GlobalDecl.h
@@ -71,6 +71,10 @@ class GlobalDecl {
   GlobalDecl(const FunctionDecl *D, unsigned MVIndex = 0)
       : MultiVersionIndex(MVIndex) {
     if (!D->hasAttr<CUDAGlobalAttr>()) {
+      if (D->hasAttr<OpenCLKernelAttr>()) {
+        Value.setPointerAndInt(D, unsigned(KernelReferenceKind::Kernel));
+        return;
+      }
       Init(D);
       return;
     }
@@ -78,7 +82,8 @@ class GlobalDecl {
   }
   GlobalDecl(const FunctionDecl *D, KernelReferenceKind Kind)
       : Value(D, unsigned(Kind)) {
-    assert(D->hasAttr<CUDAGlobalAttr>() && "Decl is not a GPU kernel!");
+    assert((D->hasAttr<CUDAGlobalAttr>() && "Decl is not a GPU kernel!") ||
+           (D->hasAttr<OpenCLKernelAttr>() && "Decl is not a OpenCL kernel!"));
   }
   GlobalDecl(const NamedDecl *D) { Init(D); }
   GlobalDecl(const BlockDecl *D) { Init(D); }
@@ -130,13 +135,15 @@ class GlobalDecl {
   }
 
   KernelReferenceKind getKernelReferenceKind() const {
-    assert(((isa<FunctionDecl>(getDecl()) &&
-             cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>()) ||
-            (isa<FunctionTemplateDecl>(getDecl()) &&
-             cast<FunctionTemplateDecl>(getDecl())
-                 ->getTemplatedDecl()
-                 ->hasAttr<CUDAGlobalAttr>())) &&
-           "Decl is not a GPU kernel!");
+    assert((((isa<FunctionDecl>(getDecl()) &&
+              cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>()) ||
+             (isa<FunctionTemplateDecl>(getDecl()) &&
+              cast<FunctionTemplateDecl>(getDecl())
+                  ->getTemplatedDecl()
+                  ->hasAttr<CUDAGlobalAttr>())) &&
+            "Decl is not a GPU kernel!") ||
+           (isDeclOpenCLKernel() && "Decl is not a OpenCL kernel!"));
+
     return static_cast<KernelReferenceKind>(Value.getInt());
   }
 
@@ -196,13 +203,21 @@ class GlobalDecl {
   }
 
   GlobalDecl getWithKernelReferenceKind(KernelReferenceKind Kind) {
-    assert(isa<FunctionDecl>(getDecl()) &&
-           cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>() &&
-           "Decl is not a GPU kernel!");
+    assert((isa<FunctionDecl>(getDecl()) &&
+            cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>() &&
+            "Decl is not a GPU kernel!") ||
+           (isDeclOpenCLKernel() && "Decl is not a OpenCL kernel!"));
     GlobalDecl Result(*this);
     Result.Value.setInt(unsigned(Kind));
     return Result;
   }
+
+  bool isDeclOpenCLKernel() const {
+    auto FD = dyn_cast<FunctionDecl>(getDecl());
+    if (FD)
+      return FD->hasAttr<OpenCLKernelAttr>();
+    return FD;
+  }
 };
 
 } // namespace clang
diff --git a/clang/lib/AST/Expr.cpp b/clang/lib/AST/Expr.cpp
index a4fb4d5a1f2ec4..ddd88ac6a21050 100644
--- a/clang/lib/AST/Expr.cpp
+++ b/clang/lib/AST/Expr.cpp
@@ -693,7 +693,8 @@ std::string PredefinedExpr::ComputeName(PredefinedIdentKind IK,
           GD = GlobalDecl(CD, Ctor_Base);
         else if (const CXXDestructorDecl *DD = dyn_cast<CXXDestructorDecl>(ND))
           GD = GlobalDecl(DD, Dtor_Base);
-        else if (ND->hasAttr<CUDAGlobalAttr>())
+        else if (ND->hasAttr<CUDAGlobalAttr>() ||
+                 ND->hasAttr<OpenCLKernelAttr>())
           GD = GlobalDecl(cast<FunctionDecl>(ND));
         else
           GD = GlobalDecl(ND);
diff --git a/clang/lib/AST/ItaniumMangle.cpp b/clang/lib/AST/ItaniumMangle.cpp
index 27a993a631dae9..7e46d6c520fb69 100644
--- a/clang/lib/AST/ItaniumMangle.cpp
+++ b/clang/lib/AST/ItaniumMangle.cpp
@@ -526,6 +526,7 @@ class CXXNameMangler {
   void mangleSourceName(const IdentifierInfo *II);
   void mangleRegCallName(const IdentifierInfo *II);
   void mangleDeviceStubName(const IdentifierInfo *II);
+  void mangleOCLDeviceStubName(const IdentifierInfo *II);
   void mangleSourceNameWithAbiTags(
       const NamedDecl *ND, const AbiTagList *AdditionalAbiTags = nullptr);
   void mangleLocalName(GlobalDecl GD,
@@ -1561,8 +1562,13 @@ void CXXNameMangler::mangleUnqualifiedName(
       bool IsDeviceStub =
           FD && FD->hasAttr<CUDAGlobalAttr>() &&
           GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
+      bool IsOCLDeviceStub =
+          FD && FD->hasAttr<OpenCLKernelAttr>() &&
+          GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
       if (IsDeviceStub)
         mangleDeviceStubName(II);
+      else if (IsOCLDeviceStub)
+        mangleOCLDeviceStubName(II);
       else if (IsRegCall)
         mangleRegCallName(II);
       else
@@ -1780,6 +1786,15 @@ void CXXNameMangler::mangleDeviceStubName(const IdentifierInfo *II) {
       << II->getName();
 }
 
+void CXXNameMangler::mangleOCLDeviceStubName(const IdentifierInfo *II) {
+  // <source-name> ::= <positive length number> __clang_ocl_kern_imp_
+  // <identifier> <number> ::= [n] <non-negative decimal integer> <identifier>
+  // ::= <unqualified source code identifier>
+  StringRef OCLDeviceStubNamePrefix = "__clang_ocl_kern_imp_";
+  Out << II->getLength() + OCLDeviceStubNamePrefix.size() - 1
+      << OCLDeviceStubNamePrefix << II->getName();
+}
+
 void CXXNameMangler::mangleSourceName(const IdentifierInfo *II) {
   // <source-name> ::= <positive length number> <identifier>
   // <number> ::= [n] <non-negative decimal integer>
diff --git a/clang/lib/AST/Mangle.cpp b/clang/lib/AST/Mangle.cpp
index 15be9c62bf8880..1d1c4dd0e39b7a 100644
--- a/clang/lib/AST/Mangle.cpp
+++ b/clang/lib/AST/Mangle.cpp
@@ -539,7 +539,7 @@ class ASTNameGenerator::Implementation {
         GD = GlobalDecl(CtorD, Ctor_Complete);
       else if (const auto *DtorD = dyn_cast<CXXDestructorDecl>(D))
         GD = GlobalDecl(DtorD, Dtor_Complete);
-      else if (D->hasAttr<CUDAGlobalAttr>())
+      else if (D->hasAttr<CUDAGlobalAttr>() || D->hasAttr<OpenCLKernelAttr>())
         GD = GlobalDecl(cast<FunctionDecl>(D));
       else
         GD = GlobalDecl(D);
diff --git a/clang/lib/AST/MicrosoftMangle.cpp b/clang/lib/AST/MicrosoftMangle.cpp
index 94a7ce6c1321d3..e439875a2538ba 100644
--- a/clang/lib/AST/MicrosoftMangle.cpp
+++ b/clang/lib/AST/MicrosoftMangle.cpp
@@ -1161,9 +1161,15 @@ void MicrosoftCXXNameMangler::mangleUnqualifiedName(GlobalDecl GD,
                   ->getTemplatedDecl()
                   ->hasAttr<CUDAGlobalAttr>())) &&
             GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
+        bool IsOCLDeviceStub =
+            ND && (isa<FunctionDecl>(ND) && ND->hasAttr<OpenCLKernelAttr>()) &&
+            GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
         if (IsDeviceStub)
           mangleSourceName(
               (llvm::Twine("__device_stub__") + II->getName()).str());
+        else if (IsOCLDeviceStub)
+          mangleSourceName(
+              (llvm::Twine("__clang_ocl_kern_imp_") + II->getName()).str());
         else
           mangleSourceName(II->getName());
         break;
diff --git a/clang/lib/CodeGen/CGBlocks.cpp b/clang/lib/CodeGen/CGBlocks.cpp
index a7584a95c8ca7b..7f98a897c36907 100644
--- a/clang/lib/CodeGen/CGBlocks.cpp
+++ b/clang/lib/CodeGen/CGBlocks.cpp
@@ -48,7 +48,7 @@ CGBlockInfo::CGBlockInfo(const BlockDecl *block, StringRef name)
 BlockByrefHelpers::~BlockByrefHelpers() {}
 
 /// Build the given block as a global block.
-static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
+static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM, GlobalDecl GD,
                                         const CGBlockInfo &blockInfo,
                                         llvm::Constant *blockFn);
 
@@ -1085,8 +1085,10 @@ llvm::Value *CodeGenFunction::EmitBlockLiteral(const CGBlockInfo &blockInfo) {
       blockAddr.getPointer(), ConvertType(blockInfo.getBlockExpr()->getType()));
 
   if (IsOpenCL) {
-    CGM.getOpenCLRuntime().recordBlockInfo(blockInfo.BlockExpression, InvokeFn,
-                                           result, blockInfo.StructureType);
+    CGM.getOpenCLRuntime().recordBlockInfo(
+        blockInfo.BlockExpression, InvokeFn, result, blockInfo.StructureType,
+        CurGD && CurGD.isDeclOpenCLKernel() &&
+            (CurGD.getKernelReferenceKind() == KernelReferenceKind::Kernel));
   }
 
   return result;
@@ -1285,7 +1287,7 @@ CodeGenModule::GetAddrOfGlobalBlock(const BlockExpr *BE,
   return getAddrOfGlobalBlockIfEmitted(BE);
 }
 
-static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
+static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM, GlobalDecl GD,
                                         const CGBlockInfo &blockInfo,
                                         llvm::Constant *blockFn) {
   assert(blockInfo.CanBeGlobal);
@@ -1378,7 +1380,9 @@ static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
     CGM.getOpenCLRuntime().recordBlockInfo(
         blockInfo.BlockExpression,
         cast<llvm::Function>(blockFn->stripPointerCasts()), Result,
-        literal->getValueType());
+        literal->getValueType(),
+        GD && GD.isDeclOpenCLKernel() &&
+            (GD.getKernelReferenceKind() == KernelReferenceKind::Kernel));
   return Result;
 }
 
@@ -1487,7 +1491,7 @@ llvm::Function *CodeGenFunction::GenerateBlockFunction(
     auto GenVoidPtrTy = getContext().getLangOpts().OpenCL
                             ? CGM.getOpenCLRuntime().getGenericVoidPointerType()
                             : VoidPtrTy;
-    buildGlobalBlock(CGM, blockInfo,
+    buildGlobalBlock(CGM, CurGD, blockInfo,
                      llvm::ConstantExpr::getPointerCast(fn, GenVoidPtrTy));
   }
 
diff --git a/clang/lib/CodeGen/CGCall.cpp b/clang/lib/CodeGen/CGCall.cpp
index 20455dbb820914..215fe5bd9da8dd 100644
--- a/clang/lib/CodeGen/CGCall.cpp
+++ b/clang/lib/CodeGen/CGCall.cpp
@@ -2343,6 +2343,15 @@ void CodeGenModule::ConstructAttributeList(StringRef Name,
   // Collect function IR attributes from the CC lowering.
   // We'll collect the paramete and result attributes later.
   CallingConv = FI.getEffectiveCallingConvention();
+  GlobalDecl GD = CalleeInfo.getCalleeDecl();
+  const Decl *TargetDecl = CalleeInfo.getCalleeDecl().getDecl();
+  if (TargetDecl) {
+    if (auto FD = dyn_cast<FunctionDecl>(TargetDecl)) {
+      if (FD->hasAttr<OpenCLKernelAttr>() &&
+          GD.getKernelReferenceKind() == KernelReferenceKind::Stub)
+        CallingConv = llvm::CallingConv::C;
+    }
+  }
   if (FI.isNoReturn())
     FuncAttrs.addAttribute(llvm::Attribute::NoReturn);
   if (FI.isCmseNSCall())
@@ -2352,8 +2361,6 @@ void CodeGenModule::ConstructAttributeList(StringRef Name,
   AddAttributesFromFunctionProtoType(getContext(), FuncAttrs,
                                      CalleeInfo.getCalleeFunctionProtoType());
 
-  const Decl *TargetDecl = CalleeInfo.getCalleeDecl().getDecl();
-
   // Attach assumption attributes to the declaration. If this is a call
   // site, attach assumptions from the caller to the call as well.
   AddAttributesFromOMPAssumes(FuncAttrs, TargetDecl);
diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
index 229f0e29f02341..e2222a6393b9b1 100644
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -5692,7 +5692,10 @@ CGCallee CodeGenFunction::EmitCallee(const Expr *E) {
   // Resolve direct calls.
   } else if (auto DRE = dyn_cast<DeclRefExpr>(E)) {
     if (auto FD = dyn_cast<FunctionDecl>(DRE->getDecl())) {
-      return EmitDirectCallee(*this, FD);
+      auto CalleeDecl = FD->hasAttr<OpenCLKernelAttr>()
+                            ? GlobalDecl(FD, KernelReferenceKind::Stub)
+                            : FD;
+      return EmitDirectCallee(*this, CalleeDecl);
     }
   } else if (auto ME = dyn_cast<MemberExpr>(E)) {
     if (auto FD = dyn_cast<FunctionDecl>(ME->getMemberDecl())) {
diff --git a/clang/lib/CodeGen/CGOpenCLRuntime.cpp b/clang/lib/CodeGen/CGOpenCLRuntime.cpp
index 115b618056a445..a78d783831293e 100644
--- a/clang/lib/CodeGen/CGOpenCLRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenCLRuntime.cpp
@@ -126,14 +126,21 @@ static const BlockExpr *getBlockExpr(const Expr *E) {
 /// corresponding block expression.
 void CGOpenCLRuntime::recordBlockInfo(const BlockExpr *E,
                                       llvm::Function *InvokeF,
-                                      llvm::Value *Block, llvm::Type *BlockTy) {
-  assert(!EnqueuedBlockMap.contains(E) && "Block expression emitted twice");
+                                      llvm::Value *Block, llvm::Type *BlockTy,
+                                      bool isBlkExprInOCLKern) {
+
+  // FIXME: Since OpenCL Kernels are emitted twice (kernel version and stub
+  // version), its constituent BlockExpr will also be emitted twice.
+  assert((!EnqueuedBlockMap.contains(E) ||
+          EnqueuedBlockMap[E].isBlkExprInOCLKern != isBlkExprInOCLKern) &&
+         "Block expression emitted twice");
   assert(isa<llvm::Function>(InvokeF) && "Invalid invoke function");
   assert(Block->getType()->isPointerTy() && "Invalid block literal type");
   EnqueuedBlockMap[E].InvokeFunc = InvokeF;
   EnqueuedBlockMap[E].BlockArg = Block;
   EnqueuedBlockMap[E].BlockTy = BlockTy;
   EnqueuedBlockMap[E].KernelHandle = nullptr;
+  EnqueuedBlockMap[E].isBlkExprInOCLKern = isBlkExprInOCLKern;
 }
 
 llvm::Function *CGOpenCLRuntime::getInvokeFunction(const Expr *E) {
diff --git a/clang/lib/CodeGen/CGOpenCLRuntime.h b/clang/lib/CodeGen/CGOpenCLRuntime.h
index 34613c3516f374..78bb5980cd87dc 100644
--- a/clang/lib/CodeGen/CGOpenCLRuntime.h
+++ b/clang/lib/CodeGen/CGOpenCLRuntime.h
@@ -46,6 +46,7 @@ class CGOpenCLRuntime {
     llvm::Value *KernelHandle;  /// Enqueued block kernel reference.
     llvm::Value *BlockArg;      /// The first argument to enqueued block kernel.
     llvm::Type *BlockTy;        /// Type of the block argument.
+    bool isBlkExprInOCLKern; /// Does the BlockExpr reside in an OpenCL Kernel.
   };
   /// Maps block expression to block information.
   llvm::DenseMap<const Expr *, EnqueuedBlockInfo> EnqueuedBlockMap;
@@ -93,7 +94,8 @@ class CGOpenCLRuntime {
   /// \param InvokeF invoke function emitted for the block expression.
   /// \param Block block literal emitted for the block expression.
   void recordBlockInfo(const BlockExpr *E, llvm::Function *InvokeF,
-                       llvm::Value *Block, llvm::Type *BlockTy);
+                       llvm::Value *Block, llvm::Type *BlockTy,
+                       bool isBlkExprInOCLKern);
 
   /// \return LLVM block invoke function emitted for an expression derived from
   /// the block expression.
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 716c43431667cc..df196479994e4b 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1887,6 +1887,9 @@ static std::string getMangledNameImpl(CodeGenModule &CGM, GlobalDecl GD,
     } else if (FD && FD->hasAttr<CUDAGlobalAttr>() &&
                GD.getKernelReferenceKind() == KernelReferenceKind::Stub) {
       Out << "__device_stub__" << II->getName();
+    } else if (FD && FD->hasAttr<OpenCLKernelAttr>() &&
+               GD.getKernelReferenceKind() == KernelReferenceKind::Stub) {
+      Out << "__clang_ocl_kern_imp_" << II->getName();
     } else {
       Out << II->getName();
     }
@@ -3850,6 +3853,10 @@ void CodeGenModule::EmitGlobal(GlobalDecl GD) {
 
   // Ignore declarations, they will be emitted on their first use.
   if (const auto *FD = dyn_cast<FunctionDecl>(Global)) {
+
+    if (FD->hasAttr<OpenCLKernelAttr>() && FD->doesThisDeclarationHaveABody())
+      addDeferredDeclToEmit(GlobalDecl(FD, KernelReferenceKind::Stub));
+
     // Update deferred annotations with the latest declaration if the function
     // function was already used or defined.
     if (FD->hasAttr<AnnotateAttr>()) {
diff --git a/clang/test/CodeGenOpenCL/opencl-kernel-call.cl b/clang/test/CodeGenOpenCL/opencl-kernel-call.cl
new file mode 100644
index 00000000000000..f575728f237630
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/opencl-kernel-call.cl
@@ -0,0 +1,43 @@
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -emit-llvm -o - %s | FileCheck %s
+
+// CHECK: define dso_local amdgpu_kernel void @callee_kern({{.*}})
+__attribute__((noinline)) kernel void callee_kern(global int *A){
+  *A = 1;
+}
+
+__attribute__((noinline)) kernel void ext_callee_kern(global int *A);
+
+// CHECK: define dso_local void @callee_func({{.*}})
+__attribute__((noinline)) void callee_func(global int *A){
+  *A = 2;
+}
+
+// CHECK: define dso_local amdgpu_kernel void @caller_kern({{.*}})
+kernel void caller_kern(global int* A){
+  callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}})
+  ext_callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}})
+  callee_func(A);
+  // CHECK: tail call void @callee_func({{.*}})
+
+}
+
+// CHECK: define dso_local void @__clang_ocl_kern_imp_callee_kern({{.*}})
+
+// CHECK: declare void @__clang_ocl_kern_imp_ext_callee_kern({{.*}})
+
+// CHECK: define dso_local void @caller_func({{.*}})
+void caller_func(global int* A){
+  callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}}) #7
+  ext_callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}}) #8
+  callee_func(A);
+  // CHECK: tail call void @callee_func({{.*}})
+}
+
+// CHECK: define dso_local void @__clang_ocl_kern_imp_caller_kern({{.*}}) 
+// CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}}) 
+// CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}}) 
+// CHECK: tail call void @callee_func({{.*}})
diff --git a/clang/test/CodeGenOpenCL/reflect.cl b/clang/test/CodeGenOpenCL/reflect.cl
index 9ae4a5f027d358..0e3e50be9745eb 100644
--- a/clang/test/CodeGenOpenCL/reflect.cl
+++ b/clang/test/CodeGenOpenCL/reflect.cl
@@ -13,7 +13,7 @@ bool device_function() {
 }
 
 // CHECK-LABEL: define dso_local spir_kernel void @kernel_function(
-// CHECK-SAME: ptr addrspace(1) noundef align 4 [[I:%.*]]) #[[ATTR2:[0-9]+]] !kernel_arg_addr_space !4 !kernel_arg_access_qual !5 !kernel_arg_type !6 !kernel_arg_base_type !6 !kernel_arg_type_qual !7 {
+// CHECK-SAME: ptr addrspace(1) noundef align 4 [[I:%.*]]) #[[ATTR2:[0-9]+]] !kernel_arg_addr_space !5 !kernel_arg_access_qual !6 !kernel_arg_type !7 !kernel_arg_base_type !7 !kernel_arg_type_qual !8 {
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[I_ADDR:%.*]] = alloca ptr addrspace(1), align 4
 // CHECK-NEXT:    store ptr addrspace(1) [[I]], ptr [[I_ADDR]], align 4
diff --git a/clang/test/CodeGenOpenCL/spir-calling-conv.cl b/clang/test/CodeGenOpenCL/spir-calling-conv.cl
index 569ea0cbe1af60..6c8f20511b8bd6 100644
--- a/clang/test/CodeGenOpenCL/spir-calling-conv.cl
+++ b/clang/test/CodeGenOpenCL/spir-calling-conv.cl
@@ -11,8 +11,8 @@ kernel void foo(global int *A)
   // CHECK: %{{[a-z0-9_]+}} = tail call spir_func i32 @get_dummy_id(i32 noundef 0)
   A[id] = id;
   bar(A);
-  // CHECK: tail call spir_kernel void @bar(ptr addrspace(1) noundef align 4 %A)
+  // CHECK: tail call void @__clang_ocl_kern_imp_bar(ptr addrspace(1) noundef align 4 %A)
 }
 
 // CHECK: declare spir_func i32 @get_dummy_id(i32 noundef)
-// CHECK: declare spir_kernel void @bar(ptr addrspace(1) noundef align 4)
+// CHECK: declare void @__clang_ocl_kern_imp_bar(ptr addrspace(1) noundef align 4)
diff --git a/clang/test/CodeGenOpenCL/visibility.cl b/clang/test/CodeGenOpenCL/visibility.cl
index addfe33377f939..e5dc5b29c5140b 100644
--- a/clang/test/CodeGenOpenCL/visibility.cl
+++ b/clang/test/CodeGenOpenCL/visibility.cl
@@ -85,31 +85,42 @@ __attribute__((visibility("default"))) extern void ext_func_default();
 void use() {
     glob = ext + ext_hidden + ext_protected + ext_default;
     ext_kern();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern()
+    // FVIS-HIDDEN: tail call void @__clang_ocl_kern_imp_ext_kern()
     ext_kern_hidden();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
+    // FVIS-HIDDEN: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
     ext_kern_protected();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern_protected()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern_protected()
+    // FVIS-HIDDEN: tail call voi...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Nov 27, 2024

@llvm/pr-subscribers-clang-codegen

Author: Aniket Lal (lalaniket8)

Changes

OpenCL allows a kernel function to call another kernel function.
To facilitate this we emit a stub version of each kernel function
with different name mangling scheme, and replace the kernel
callsite appropriately.

Fixes #60313


Patch is 23.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/115821.diff

15 Files Affected:

  • (modified) clang/include/clang/AST/GlobalDecl.h (+26-11)
  • (modified) clang/lib/AST/Expr.cpp (+2-1)
  • (modified) clang/lib/AST/ItaniumMangle.cpp (+15)
  • (modified) clang/lib/AST/Mangle.cpp (+1-1)
  • (modified) clang/lib/AST/MicrosoftMangle.cpp (+6)
  • (modified) clang/lib/CodeGen/CGBlocks.cpp (+10-6)
  • (modified) clang/lib/CodeGen/CGCall.cpp (+9-2)
  • (modified) clang/lib/CodeGen/CGExpr.cpp (+4-1)
  • (modified) clang/lib/CodeGen/CGOpenCLRuntime.cpp (+9-2)
  • (modified) clang/lib/CodeGen/CGOpenCLRuntime.h (+3-1)
  • (modified) clang/lib/CodeGen/CodeGenModule.cpp (+7)
  • (added) clang/test/CodeGenOpenCL/opencl-kernel-call.cl (+43)
  • (modified) clang/test/CodeGenOpenCL/reflect.cl (+1-1)
  • (modified) clang/test/CodeGenOpenCL/spir-calling-conv.cl (+2-2)
  • (modified) clang/test/CodeGenOpenCL/visibility.cl (+40-13)
diff --git a/clang/include/clang/AST/GlobalDecl.h b/clang/include/clang/AST/GlobalDecl.h
index 386693cabb1fbb..8a9f4b4c60e5e5 100644
--- a/clang/include/clang/AST/GlobalDecl.h
+++ b/clang/include/clang/AST/GlobalDecl.h
@@ -71,6 +71,10 @@ class GlobalDecl {
   GlobalDecl(const FunctionDecl *D, unsigned MVIndex = 0)
       : MultiVersionIndex(MVIndex) {
     if (!D->hasAttr<CUDAGlobalAttr>()) {
+      if (D->hasAttr<OpenCLKernelAttr>()) {
+        Value.setPointerAndInt(D, unsigned(KernelReferenceKind::Kernel));
+        return;
+      }
       Init(D);
       return;
     }
@@ -78,7 +82,8 @@ class GlobalDecl {
   }
   GlobalDecl(const FunctionDecl *D, KernelReferenceKind Kind)
       : Value(D, unsigned(Kind)) {
-    assert(D->hasAttr<CUDAGlobalAttr>() && "Decl is not a GPU kernel!");
+    assert((D->hasAttr<CUDAGlobalAttr>() && "Decl is not a GPU kernel!") ||
+           (D->hasAttr<OpenCLKernelAttr>() && "Decl is not a OpenCL kernel!"));
   }
   GlobalDecl(const NamedDecl *D) { Init(D); }
   GlobalDecl(const BlockDecl *D) { Init(D); }
@@ -130,13 +135,15 @@ class GlobalDecl {
   }
 
   KernelReferenceKind getKernelReferenceKind() const {
-    assert(((isa<FunctionDecl>(getDecl()) &&
-             cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>()) ||
-            (isa<FunctionTemplateDecl>(getDecl()) &&
-             cast<FunctionTemplateDecl>(getDecl())
-                 ->getTemplatedDecl()
-                 ->hasAttr<CUDAGlobalAttr>())) &&
-           "Decl is not a GPU kernel!");
+    assert((((isa<FunctionDecl>(getDecl()) &&
+              cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>()) ||
+             (isa<FunctionTemplateDecl>(getDecl()) &&
+              cast<FunctionTemplateDecl>(getDecl())
+                  ->getTemplatedDecl()
+                  ->hasAttr<CUDAGlobalAttr>())) &&
+            "Decl is not a GPU kernel!") ||
+           (isDeclOpenCLKernel() && "Decl is not a OpenCL kernel!"));
+
     return static_cast<KernelReferenceKind>(Value.getInt());
   }
 
@@ -196,13 +203,21 @@ class GlobalDecl {
   }
 
   GlobalDecl getWithKernelReferenceKind(KernelReferenceKind Kind) {
-    assert(isa<FunctionDecl>(getDecl()) &&
-           cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>() &&
-           "Decl is not a GPU kernel!");
+    assert((isa<FunctionDecl>(getDecl()) &&
+            cast<FunctionDecl>(getDecl())->hasAttr<CUDAGlobalAttr>() &&
+            "Decl is not a GPU kernel!") ||
+           (isDeclOpenCLKernel() && "Decl is not a OpenCL kernel!"));
     GlobalDecl Result(*this);
     Result.Value.setInt(unsigned(Kind));
     return Result;
   }
+
+  bool isDeclOpenCLKernel() const {
+    auto FD = dyn_cast<FunctionDecl>(getDecl());
+    if (FD)
+      return FD->hasAttr<OpenCLKernelAttr>();
+    return FD;
+  }
 };
 
 } // namespace clang
diff --git a/clang/lib/AST/Expr.cpp b/clang/lib/AST/Expr.cpp
index a4fb4d5a1f2ec4..ddd88ac6a21050 100644
--- a/clang/lib/AST/Expr.cpp
+++ b/clang/lib/AST/Expr.cpp
@@ -693,7 +693,8 @@ std::string PredefinedExpr::ComputeName(PredefinedIdentKind IK,
           GD = GlobalDecl(CD, Ctor_Base);
         else if (const CXXDestructorDecl *DD = dyn_cast<CXXDestructorDecl>(ND))
           GD = GlobalDecl(DD, Dtor_Base);
-        else if (ND->hasAttr<CUDAGlobalAttr>())
+        else if (ND->hasAttr<CUDAGlobalAttr>() ||
+                 ND->hasAttr<OpenCLKernelAttr>())
           GD = GlobalDecl(cast<FunctionDecl>(ND));
         else
           GD = GlobalDecl(ND);
diff --git a/clang/lib/AST/ItaniumMangle.cpp b/clang/lib/AST/ItaniumMangle.cpp
index 27a993a631dae9..7e46d6c520fb69 100644
--- a/clang/lib/AST/ItaniumMangle.cpp
+++ b/clang/lib/AST/ItaniumMangle.cpp
@@ -526,6 +526,7 @@ class CXXNameMangler {
   void mangleSourceName(const IdentifierInfo *II);
   void mangleRegCallName(const IdentifierInfo *II);
   void mangleDeviceStubName(const IdentifierInfo *II);
+  void mangleOCLDeviceStubName(const IdentifierInfo *II);
   void mangleSourceNameWithAbiTags(
       const NamedDecl *ND, const AbiTagList *AdditionalAbiTags = nullptr);
   void mangleLocalName(GlobalDecl GD,
@@ -1561,8 +1562,13 @@ void CXXNameMangler::mangleUnqualifiedName(
       bool IsDeviceStub =
           FD && FD->hasAttr<CUDAGlobalAttr>() &&
           GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
+      bool IsOCLDeviceStub =
+          FD && FD->hasAttr<OpenCLKernelAttr>() &&
+          GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
       if (IsDeviceStub)
         mangleDeviceStubName(II);
+      else if (IsOCLDeviceStub)
+        mangleOCLDeviceStubName(II);
       else if (IsRegCall)
         mangleRegCallName(II);
       else
@@ -1780,6 +1786,15 @@ void CXXNameMangler::mangleDeviceStubName(const IdentifierInfo *II) {
       << II->getName();
 }
 
+void CXXNameMangler::mangleOCLDeviceStubName(const IdentifierInfo *II) {
+  // <source-name> ::= <positive length number> __clang_ocl_kern_imp_
+  // <identifier> <number> ::= [n] <non-negative decimal integer> <identifier>
+  // ::= <unqualified source code identifier>
+  StringRef OCLDeviceStubNamePrefix = "__clang_ocl_kern_imp_";
+  Out << II->getLength() + OCLDeviceStubNamePrefix.size() - 1
+      << OCLDeviceStubNamePrefix << II->getName();
+}
+
 void CXXNameMangler::mangleSourceName(const IdentifierInfo *II) {
   // <source-name> ::= <positive length number> <identifier>
   // <number> ::= [n] <non-negative decimal integer>
diff --git a/clang/lib/AST/Mangle.cpp b/clang/lib/AST/Mangle.cpp
index 15be9c62bf8880..1d1c4dd0e39b7a 100644
--- a/clang/lib/AST/Mangle.cpp
+++ b/clang/lib/AST/Mangle.cpp
@@ -539,7 +539,7 @@ class ASTNameGenerator::Implementation {
         GD = GlobalDecl(CtorD, Ctor_Complete);
       else if (const auto *DtorD = dyn_cast<CXXDestructorDecl>(D))
         GD = GlobalDecl(DtorD, Dtor_Complete);
-      else if (D->hasAttr<CUDAGlobalAttr>())
+      else if (D->hasAttr<CUDAGlobalAttr>() || D->hasAttr<OpenCLKernelAttr>())
         GD = GlobalDecl(cast<FunctionDecl>(D));
       else
         GD = GlobalDecl(D);
diff --git a/clang/lib/AST/MicrosoftMangle.cpp b/clang/lib/AST/MicrosoftMangle.cpp
index 94a7ce6c1321d3..e439875a2538ba 100644
--- a/clang/lib/AST/MicrosoftMangle.cpp
+++ b/clang/lib/AST/MicrosoftMangle.cpp
@@ -1161,9 +1161,15 @@ void MicrosoftCXXNameMangler::mangleUnqualifiedName(GlobalDecl GD,
                   ->getTemplatedDecl()
                   ->hasAttr<CUDAGlobalAttr>())) &&
             GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
+        bool IsOCLDeviceStub =
+            ND && (isa<FunctionDecl>(ND) && ND->hasAttr<OpenCLKernelAttr>()) &&
+            GD.getKernelReferenceKind() == KernelReferenceKind::Stub;
         if (IsDeviceStub)
           mangleSourceName(
               (llvm::Twine("__device_stub__") + II->getName()).str());
+        else if (IsOCLDeviceStub)
+          mangleSourceName(
+              (llvm::Twine("__clang_ocl_kern_imp_") + II->getName()).str());
         else
           mangleSourceName(II->getName());
         break;
diff --git a/clang/lib/CodeGen/CGBlocks.cpp b/clang/lib/CodeGen/CGBlocks.cpp
index a7584a95c8ca7b..7f98a897c36907 100644
--- a/clang/lib/CodeGen/CGBlocks.cpp
+++ b/clang/lib/CodeGen/CGBlocks.cpp
@@ -48,7 +48,7 @@ CGBlockInfo::CGBlockInfo(const BlockDecl *block, StringRef name)
 BlockByrefHelpers::~BlockByrefHelpers() {}
 
 /// Build the given block as a global block.
-static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
+static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM, GlobalDecl GD,
                                         const CGBlockInfo &blockInfo,
                                         llvm::Constant *blockFn);
 
@@ -1085,8 +1085,10 @@ llvm::Value *CodeGenFunction::EmitBlockLiteral(const CGBlockInfo &blockInfo) {
       blockAddr.getPointer(), ConvertType(blockInfo.getBlockExpr()->getType()));
 
   if (IsOpenCL) {
-    CGM.getOpenCLRuntime().recordBlockInfo(blockInfo.BlockExpression, InvokeFn,
-                                           result, blockInfo.StructureType);
+    CGM.getOpenCLRuntime().recordBlockInfo(
+        blockInfo.BlockExpression, InvokeFn, result, blockInfo.StructureType,
+        CurGD && CurGD.isDeclOpenCLKernel() &&
+            (CurGD.getKernelReferenceKind() == KernelReferenceKind::Kernel));
   }
 
   return result;
@@ -1285,7 +1287,7 @@ CodeGenModule::GetAddrOfGlobalBlock(const BlockExpr *BE,
   return getAddrOfGlobalBlockIfEmitted(BE);
 }
 
-static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
+static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM, GlobalDecl GD,
                                         const CGBlockInfo &blockInfo,
                                         llvm::Constant *blockFn) {
   assert(blockInfo.CanBeGlobal);
@@ -1378,7 +1380,9 @@ static llvm::Constant *buildGlobalBlock(CodeGenModule &CGM,
     CGM.getOpenCLRuntime().recordBlockInfo(
         blockInfo.BlockExpression,
         cast<llvm::Function>(blockFn->stripPointerCasts()), Result,
-        literal->getValueType());
+        literal->getValueType(),
+        GD && GD.isDeclOpenCLKernel() &&
+            (GD.getKernelReferenceKind() == KernelReferenceKind::Kernel));
   return Result;
 }
 
@@ -1487,7 +1491,7 @@ llvm::Function *CodeGenFunction::GenerateBlockFunction(
     auto GenVoidPtrTy = getContext().getLangOpts().OpenCL
                             ? CGM.getOpenCLRuntime().getGenericVoidPointerType()
                             : VoidPtrTy;
-    buildGlobalBlock(CGM, blockInfo,
+    buildGlobalBlock(CGM, CurGD, blockInfo,
                      llvm::ConstantExpr::getPointerCast(fn, GenVoidPtrTy));
   }
 
diff --git a/clang/lib/CodeGen/CGCall.cpp b/clang/lib/CodeGen/CGCall.cpp
index 20455dbb820914..215fe5bd9da8dd 100644
--- a/clang/lib/CodeGen/CGCall.cpp
+++ b/clang/lib/CodeGen/CGCall.cpp
@@ -2343,6 +2343,15 @@ void CodeGenModule::ConstructAttributeList(StringRef Name,
   // Collect function IR attributes from the CC lowering.
   // We'll collect the paramete and result attributes later.
   CallingConv = FI.getEffectiveCallingConvention();
+  GlobalDecl GD = CalleeInfo.getCalleeDecl();
+  const Decl *TargetDecl = CalleeInfo.getCalleeDecl().getDecl();
+  if (TargetDecl) {
+    if (auto FD = dyn_cast<FunctionDecl>(TargetDecl)) {
+      if (FD->hasAttr<OpenCLKernelAttr>() &&
+          GD.getKernelReferenceKind() == KernelReferenceKind::Stub)
+        CallingConv = llvm::CallingConv::C;
+    }
+  }
   if (FI.isNoReturn())
     FuncAttrs.addAttribute(llvm::Attribute::NoReturn);
   if (FI.isCmseNSCall())
@@ -2352,8 +2361,6 @@ void CodeGenModule::ConstructAttributeList(StringRef Name,
   AddAttributesFromFunctionProtoType(getContext(), FuncAttrs,
                                      CalleeInfo.getCalleeFunctionProtoType());
 
-  const Decl *TargetDecl = CalleeInfo.getCalleeDecl().getDecl();
-
   // Attach assumption attributes to the declaration. If this is a call
   // site, attach assumptions from the caller to the call as well.
   AddAttributesFromOMPAssumes(FuncAttrs, TargetDecl);
diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
index 229f0e29f02341..e2222a6393b9b1 100644
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -5692,7 +5692,10 @@ CGCallee CodeGenFunction::EmitCallee(const Expr *E) {
   // Resolve direct calls.
   } else if (auto DRE = dyn_cast<DeclRefExpr>(E)) {
     if (auto FD = dyn_cast<FunctionDecl>(DRE->getDecl())) {
-      return EmitDirectCallee(*this, FD);
+      auto CalleeDecl = FD->hasAttr<OpenCLKernelAttr>()
+                            ? GlobalDecl(FD, KernelReferenceKind::Stub)
+                            : FD;
+      return EmitDirectCallee(*this, CalleeDecl);
     }
   } else if (auto ME = dyn_cast<MemberExpr>(E)) {
     if (auto FD = dyn_cast<FunctionDecl>(ME->getMemberDecl())) {
diff --git a/clang/lib/CodeGen/CGOpenCLRuntime.cpp b/clang/lib/CodeGen/CGOpenCLRuntime.cpp
index 115b618056a445..a78d783831293e 100644
--- a/clang/lib/CodeGen/CGOpenCLRuntime.cpp
+++ b/clang/lib/CodeGen/CGOpenCLRuntime.cpp
@@ -126,14 +126,21 @@ static const BlockExpr *getBlockExpr(const Expr *E) {
 /// corresponding block expression.
 void CGOpenCLRuntime::recordBlockInfo(const BlockExpr *E,
                                       llvm::Function *InvokeF,
-                                      llvm::Value *Block, llvm::Type *BlockTy) {
-  assert(!EnqueuedBlockMap.contains(E) && "Block expression emitted twice");
+                                      llvm::Value *Block, llvm::Type *BlockTy,
+                                      bool isBlkExprInOCLKern) {
+
+  // FIXME: Since OpenCL Kernels are emitted twice (kernel version and stub
+  // version), its constituent BlockExpr will also be emitted twice.
+  assert((!EnqueuedBlockMap.contains(E) ||
+          EnqueuedBlockMap[E].isBlkExprInOCLKern != isBlkExprInOCLKern) &&
+         "Block expression emitted twice");
   assert(isa<llvm::Function>(InvokeF) && "Invalid invoke function");
   assert(Block->getType()->isPointerTy() && "Invalid block literal type");
   EnqueuedBlockMap[E].InvokeFunc = InvokeF;
   EnqueuedBlockMap[E].BlockArg = Block;
   EnqueuedBlockMap[E].BlockTy = BlockTy;
   EnqueuedBlockMap[E].KernelHandle = nullptr;
+  EnqueuedBlockMap[E].isBlkExprInOCLKern = isBlkExprInOCLKern;
 }
 
 llvm::Function *CGOpenCLRuntime::getInvokeFunction(const Expr *E) {
diff --git a/clang/lib/CodeGen/CGOpenCLRuntime.h b/clang/lib/CodeGen/CGOpenCLRuntime.h
index 34613c3516f374..78bb5980cd87dc 100644
--- a/clang/lib/CodeGen/CGOpenCLRuntime.h
+++ b/clang/lib/CodeGen/CGOpenCLRuntime.h
@@ -46,6 +46,7 @@ class CGOpenCLRuntime {
     llvm::Value *KernelHandle;  /// Enqueued block kernel reference.
     llvm::Value *BlockArg;      /// The first argument to enqueued block kernel.
     llvm::Type *BlockTy;        /// Type of the block argument.
+    bool isBlkExprInOCLKern; /// Does the BlockExpr reside in an OpenCL Kernel.
   };
   /// Maps block expression to block information.
   llvm::DenseMap<const Expr *, EnqueuedBlockInfo> EnqueuedBlockMap;
@@ -93,7 +94,8 @@ class CGOpenCLRuntime {
   /// \param InvokeF invoke function emitted for the block expression.
   /// \param Block block literal emitted for the block expression.
   void recordBlockInfo(const BlockExpr *E, llvm::Function *InvokeF,
-                       llvm::Value *Block, llvm::Type *BlockTy);
+                       llvm::Value *Block, llvm::Type *BlockTy,
+                       bool isBlkExprInOCLKern);
 
   /// \return LLVM block invoke function emitted for an expression derived from
   /// the block expression.
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index 716c43431667cc..df196479994e4b 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -1887,6 +1887,9 @@ static std::string getMangledNameImpl(CodeGenModule &CGM, GlobalDecl GD,
     } else if (FD && FD->hasAttr<CUDAGlobalAttr>() &&
                GD.getKernelReferenceKind() == KernelReferenceKind::Stub) {
       Out << "__device_stub__" << II->getName();
+    } else if (FD && FD->hasAttr<OpenCLKernelAttr>() &&
+               GD.getKernelReferenceKind() == KernelReferenceKind::Stub) {
+      Out << "__clang_ocl_kern_imp_" << II->getName();
     } else {
       Out << II->getName();
     }
@@ -3850,6 +3853,10 @@ void CodeGenModule::EmitGlobal(GlobalDecl GD) {
 
   // Ignore declarations, they will be emitted on their first use.
   if (const auto *FD = dyn_cast<FunctionDecl>(Global)) {
+
+    if (FD->hasAttr<OpenCLKernelAttr>() && FD->doesThisDeclarationHaveABody())
+      addDeferredDeclToEmit(GlobalDecl(FD, KernelReferenceKind::Stub));
+
     // Update deferred annotations with the latest declaration if the function
     // function was already used or defined.
     if (FD->hasAttr<AnnotateAttr>()) {
diff --git a/clang/test/CodeGenOpenCL/opencl-kernel-call.cl b/clang/test/CodeGenOpenCL/opencl-kernel-call.cl
new file mode 100644
index 00000000000000..f575728f237630
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/opencl-kernel-call.cl
@@ -0,0 +1,43 @@
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -emit-llvm -o - %s | FileCheck %s
+
+// CHECK: define dso_local amdgpu_kernel void @callee_kern({{.*}})
+__attribute__((noinline)) kernel void callee_kern(global int *A){
+  *A = 1;
+}
+
+__attribute__((noinline)) kernel void ext_callee_kern(global int *A);
+
+// CHECK: define dso_local void @callee_func({{.*}})
+__attribute__((noinline)) void callee_func(global int *A){
+  *A = 2;
+}
+
+// CHECK: define dso_local amdgpu_kernel void @caller_kern({{.*}})
+kernel void caller_kern(global int* A){
+  callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}})
+  ext_callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}})
+  callee_func(A);
+  // CHECK: tail call void @callee_func({{.*}})
+
+}
+
+// CHECK: define dso_local void @__clang_ocl_kern_imp_callee_kern({{.*}})
+
+// CHECK: declare void @__clang_ocl_kern_imp_ext_callee_kern({{.*}})
+
+// CHECK: define dso_local void @caller_func({{.*}})
+void caller_func(global int* A){
+  callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}}) #7
+  ext_callee_kern(A);
+  // CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}}) #8
+  callee_func(A);
+  // CHECK: tail call void @callee_func({{.*}})
+}
+
+// CHECK: define dso_local void @__clang_ocl_kern_imp_caller_kern({{.*}}) 
+// CHECK: tail call void @__clang_ocl_kern_imp_callee_kern({{.*}}) 
+// CHECK: tail call void @__clang_ocl_kern_imp_ext_callee_kern({{.*}}) 
+// CHECK: tail call void @callee_func({{.*}})
diff --git a/clang/test/CodeGenOpenCL/reflect.cl b/clang/test/CodeGenOpenCL/reflect.cl
index 9ae4a5f027d358..0e3e50be9745eb 100644
--- a/clang/test/CodeGenOpenCL/reflect.cl
+++ b/clang/test/CodeGenOpenCL/reflect.cl
@@ -13,7 +13,7 @@ bool device_function() {
 }
 
 // CHECK-LABEL: define dso_local spir_kernel void @kernel_function(
-// CHECK-SAME: ptr addrspace(1) noundef align 4 [[I:%.*]]) #[[ATTR2:[0-9]+]] !kernel_arg_addr_space !4 !kernel_arg_access_qual !5 !kernel_arg_type !6 !kernel_arg_base_type !6 !kernel_arg_type_qual !7 {
+// CHECK-SAME: ptr addrspace(1) noundef align 4 [[I:%.*]]) #[[ATTR2:[0-9]+]] !kernel_arg_addr_space !5 !kernel_arg_access_qual !6 !kernel_arg_type !7 !kernel_arg_base_type !7 !kernel_arg_type_qual !8 {
 // CHECK-NEXT:  entry:
 // CHECK-NEXT:    [[I_ADDR:%.*]] = alloca ptr addrspace(1), align 4
 // CHECK-NEXT:    store ptr addrspace(1) [[I]], ptr [[I_ADDR]], align 4
diff --git a/clang/test/CodeGenOpenCL/spir-calling-conv.cl b/clang/test/CodeGenOpenCL/spir-calling-conv.cl
index 569ea0cbe1af60..6c8f20511b8bd6 100644
--- a/clang/test/CodeGenOpenCL/spir-calling-conv.cl
+++ b/clang/test/CodeGenOpenCL/spir-calling-conv.cl
@@ -11,8 +11,8 @@ kernel void foo(global int *A)
   // CHECK: %{{[a-z0-9_]+}} = tail call spir_func i32 @get_dummy_id(i32 noundef 0)
   A[id] = id;
   bar(A);
-  // CHECK: tail call spir_kernel void @bar(ptr addrspace(1) noundef align 4 %A)
+  // CHECK: tail call void @__clang_ocl_kern_imp_bar(ptr addrspace(1) noundef align 4 %A)
 }
 
 // CHECK: declare spir_func i32 @get_dummy_id(i32 noundef)
-// CHECK: declare spir_kernel void @bar(ptr addrspace(1) noundef align 4)
+// CHECK: declare void @__clang_ocl_kern_imp_bar(ptr addrspace(1) noundef align 4)
diff --git a/clang/test/CodeGenOpenCL/visibility.cl b/clang/test/CodeGenOpenCL/visibility.cl
index addfe33377f939..e5dc5b29c5140b 100644
--- a/clang/test/CodeGenOpenCL/visibility.cl
+++ b/clang/test/CodeGenOpenCL/visibility.cl
@@ -85,31 +85,42 @@ __attribute__((visibility("default"))) extern void ext_func_default();
 void use() {
     glob = ext + ext_hidden + ext_protected + ext_default;
     ext_kern();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern()
+    // FVIS-HIDDEN: tail call void @__clang_ocl_kern_imp_ext_kern()
     ext_kern_hidden();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
+    // FVIS-HIDDEN: tail call void @__clang_ocl_kern_imp_ext_kern_hidden()
     ext_kern_protected();
+    // FVIS-DEFAULT: tail call void @__clang_ocl_kern_imp_ext_kern_protected()
+    // FVIS-PROTECTED: tail call void @__clang_ocl_kern_imp_ext_kern_protected()
+    // FVIS-HIDDEN: tail call voi...
[truncated]

@lalaniket8 lalaniket8 changed the title [Clang] Emit stub version of OpenCL Kernel [Clang][OpenCL][AMDGPU] Allow a kernel to call another kernel Nov 27, 2024
// <source-name> ::= <positive length number> __clang_ocl_kern_imp_
// <identifier> <number> ::= [n] <non-negative decimal integer> <identifier>
// ::= <unqualified source code identifier>
StringRef OCLDeviceStubNamePrefix = "__clang_ocl_kern_imp_";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using some symbols that a regular user-defined function can't be named to avoid conflict.

Copy link
Contributor Author

@lalaniket8 lalaniket8 Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names starting with double underscore are reserved, and shouldn't conflict with user defined names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you still can't prevent people from doing it right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's undefined behavior if they do, the same that you can technically define a function in C called _Z3foov with a completely different signature from what C++ would expect. This sort of thing is generally fine as long as we make some effort to avoid unintentional conflicts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of how the symbols should behave, this might be backwards. Alternatively we could preserve the as-is name as the callable implicit function, and add some mangling to the kernel entry point. However, I bet there's a lot more code expecting the current kernel name used directly as kernel symbol behavior though, so maybe changing that is best left for later

@lalaniket8 lalaniket8 force-pushed the users/lalaniket8/emit-device-version-of-openCL-kernel branch from 00167a3 to 013801b Compare November 29, 2024 08:36
if (D->hasAttr<OpenCLKernelAttr>()) {
Value.setPointerAndInt(D, unsigned(KernelReferenceKind::Kernel));
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's new commonality between OpenCL and CUDA kernels, and this feels like a generalizable feature. Please add a method to FunctionDecl (and, looking at the rest of the patch, also FunctionTemplateDecl?) named something like isReferenceableKernel() that checks for either of these two attributes. Most of the diffs in this patch should end up just changing various bits of existing code that check specifically for CUDAGlobalAttr to call that method instead.

For this change specifically, you'll want to make sure that getDefaultKernelReference does the right thing for OpenCL kernels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added isReferenceableKernel() to FunctionDecl. Since OpenCL does not support template kernel functions, I have not added it to FunctionTemplateDecl.

CGM.getOpenCLRuntime().recordBlockInfo(
blockInfo.BlockExpression, InvokeFn, result, blockInfo.StructureType,
CurGD && CurGD.isDeclOpenCLKernel() &&
(CurGD.getKernelReferenceKind() == KernelReferenceKind::Kernel));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to protect against double-registering the block, right? But that's only necessary if you're double-emitting the kernel body, which you really shouldn't be doing — there are a lot of other places that assume that function bodies are only emitted once, like static locals. Probably the kernel should be emitted by calling the stub or vice-versa?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we replace the function body of the stub with a call to the respective kernel?
This way we emit the function body only once.
So a kernel to kernel function call is replaced by a kernel to stub to kernel call?

For example, the following snippet:

kernel void callee_kern(){

}
kernel void caller_kern(){
	callee_kern();
}

should generate IR that looks like:

define dso_local amdgpu_kernel void @callee_kern({{.*}}){
 
}
define dso_local void @_clang_ocl_kern_imp_callee_kern({{.*}}){
	entry:
	tail call amdgpu void @callee_kern({{.*}})
}
define dso_local amdgpu_kernel void @caller_kern({{.*}}){
	tail call void @__clang_ocl_kern_imp_callee_kern();
}
define dso_local void @_clang_ocl_kern_imp_caller_kern({{.*}}){
	tail call void @__clang_ocl_kern_imp_callee_kern();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one of the options, yeah. You'd always emit stubs as directly calling their kernels (so _clang_ocl_kern_imp_caller_kern would actually just call caller_kern, at least coming out of IRGen), and then you'd let LLVM decide when and whether to inline. The main disadvantage is that there might not already be code to emit calls to kernels; I imagine that's normally only done by the OpenCL runtime, and the ABI is probably pretty different from the normal call ABI. But maybe that's not a big deal.

The second option is that you could emit stubs by making sure the kernel is emitted first and then directly cloning the kernel body into the stub. Whether this is materially different from just inlining, I don't know, but it's an option. We do have code for this sort of thing already — we need it for emitting virtual varargs thunks in C++.

The third option is that you could emit kernels as directly calling their stubs. The advantage here is that all the calls are just normal calls that you definitely already have code to handle. The disadvantage is that you'd always be forcing the stub to be emitted, even in the 99% case that there are no other uses of it. It'll probably get reliably inlined away, but still, it's burning some extra compile time.

The last option is that you could emit kernels by emitting the stub and then directly cloning the function body into the kernel.

Up to you; they all have trade-offs. But I do think you need to not double-emit the kernel body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a kernel to kernel function call is replaced by a kernel to stub to kernel call

Calls to kernel (from device and kernel functions) is not supported during backend IselLowering : https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1127
So, we can't have a stub calling a kernel go into the backend.
In-lining the callee kernel body into the stub function might not be helpful in cases where callee kernel is just a function declaration.

kernel is emitted first and then directly cloning the kernel body into the stub

emit kernels by emitting the stub and then directly cloning the function body into the kernel

Some doubts on these points: Wouldn't this also result in having two copies of the kernel body ? Would this avoid the problems we see during double emission ?

Copy link
Contributor

@rjmccall rjmccall Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calls to kernel (from device and kernel functions) is not supported during backend IselLowering :

That makes sense. Okay, so you need to call the stub. That's probably fine — LLVM should reliably clean it up regardless — but it'll cause a lot of test case churn unless you force inlining prior to LLVM.

Some doubts on these points: Wouldn't this also result in having two copies of the kernel body ? Would this avoid the problems we see during double emission ?

Having two copies of a function body is not the problem. This is essentially just inlining, so as long as the kernel body doesn't do something that's semantically incompatible with inlining (which there aren't many examples of — it's mostly just the address-of-label extension, which I assume GPUs aren't going to handle well anyway), it's fine to inline.

The problem is that the visitor for emitting the body does not expect to run twice on the same code, and so it will emit code incorrectly for structures that require supporting global variables and functions, such as blocks and static locals. For example, this patch contains code to work around the double-emission of a block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the emission of kernels as directly calling its respective stubs. The kernel body only consists of a call to the stub. The stub contains the kernel body and is emitted with alwaysinline attribute.

if (auto FD = dyn_cast<FunctionDecl>(TargetDecl)) {
if (FD->hasAttr<OpenCLKernelAttr>() &&
GD.getKernelReferenceKind() == KernelReferenceKind::Stub)
CallingConv = llvm::CallingConv::C;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong place to be doing this. If we're calling the stub, the CGFunctionInfo needs to contain the right CC information for the stub, and then we won't need special overrides like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved change of calling convention to CodeGenTypes::arrangeFunctionDeclaration() so that we obtain the correct CGFunctionInfo.

auto CalleeDecl = FD->hasAttr<OpenCLKernelAttr>()
? GlobalDecl(FD, KernelReferenceKind::Stub)
: FD;
return EmitDirectCallee(*this, CalleeDecl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It looks like the CUDA folks had this same problem and came up with an awkward workaround for it in EmitDirectCallee. We should really just be requesting the right GD in the first place. Could you add a getGlobalDeclForDirectCall function that does the right thing for both modes? If it ends up causing complicated behavior/test changes in CUDA mode, you can feel free to exclude CUDA for now and just leave a comment saying that the workaround should be removed in favor of doing the right thing in that function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added getGlobalDeclForDirectCall() which returns the correct GD for OpenCL mode, however I am a bit unclear on what is required for CUDA mode.

CUDA folks had this same problem and came up with an awkward workaround for it in EmitDirectCallee

Are you referring to the reassignment of CalleePtr in EmitDirectCallee()?
Do we want to pass a GD into EmitDirectCallee() which already contains the correct CalleePtr for CUDA mode ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I meant, and yeah, I feel like we should just pass down the right GD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I can get the correct CalleePtr which is of type llvm::Constant, but I am not able to figure out how to convert the same into a GlobalDecl object.

Copy link
Collaborator

@yxsamliu yxsamliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

anikelal added 8 commits April 7, 2025 14:15
This feature is currently not supported in the compiler.
To facilitate this we emit a stub version of each kernel
function body with different name mangling scheme, and
replaces the respective kernel call-sites appropriately.

Fixes #60313

D120566 was an earlier attempt made to upstream a solution
for this issue.
Simplifying isDeclOpenCLKernel() and removing resolved comments
Test aggregate paratemers passed byref and byval
@lalaniket8 lalaniket8 force-pushed the users/lalaniket8/emit-device-version-of-openCL-kernel branch from 037f952 to 1b8c7cf Compare April 7, 2025 10:39
@lalaniket8 lalaniket8 merged commit 642481a into main Apr 8, 2025
11 checks passed
@lalaniket8 lalaniket8 deleted the users/lalaniket8/emit-device-version-of-openCL-kernel branch April 8, 2025 04:59
Copy link

github-actions bot commented Apr 8, 2025

@lalaniket8 Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Apr 8, 2025
To facilitate OpenCL Kernel to Kernel call, we emit a stub version of each kernel
function body with different name mangling scheme, and
replaces the respective kernel call-sites appropriately.

Corresponding upstream PR llvm#115821
Fixes https://ontrack-internal.amd.com/browse/SWDEV-245936
shiltian added a commit that referenced this pull request Apr 8, 2025
Due to a previous workaround allowing kernels to be called from other functions,
Clang currently doesn't use the `byref` attribute for aggregate kernel
arguments. The issue was recently resolved in
#115821. With that fix, we can now
enable the use of `byref` consistently across all languages.

Co-authored-by: Matt Arsenault <[email protected]>
@MrSidims MrSidims requested a review from svenvh April 9, 2025 09:36
@MrSidims
Copy link
Contributor

MrSidims commented Apr 9, 2025

@lalaniket8 @arsenm I don't have a strong opinion, but shouldn't this transformation be done during lowering to the target? Current version of the patch brings odd behavior for LLVM IR to SPIR-V lowering for OpenCL kernels. SPIR-V don't allow one EntryPoint to refer another EntryPoint, so during such lowering the translator moves kernel's body to impl function (just like this patch does). Together they result in quite odd behavior: KhronosGroup/SPIRV-LLVM-Translator#3115

@yxsamliu
Copy link
Collaborator

yxsamliu commented Apr 9, 2025

@lalaniket8 @arsenm I don't have a strong opinion, but shouldn't this transformation be done during lowering to the target? Current version of the patch brings odd behavior for LLVM IR to SPIR-V lowering for OpenCL kernels. SPIR-V don't allow one EntryPoint to refer another EntryPoint, so during such lowering the translator moves kernel's body to impl function (just like this patch does). Together they result in quite odd behavior: KhronosGroup/SPIRV-LLVM-Translator#3115

My understanding is that for most targets kernel and non-kernel functions have different calling conventions which affects how struct-type arguments are passed. Allowing kernel functions callable in IR could cause issues in mid-end, that is why for any kernel a non-kernel __clang_ocl_kern_imp* function is created and used where the kernel function is called. This is necessary to have the proper IR.

For LLVM/SPIRV translator, this change should be transparent. __clang_ocl_kern_imp* should behave like a common non-kernel function called by a kernel function.

@arsenm

@arsenm
Copy link
Contributor

arsenm commented Apr 13, 2025

@lalaniket8 @arsenm I don't have a strong opinion, but shouldn't this transformation be done during lowering to the target?

This is the lowering to the target.

Current version of the patch brings odd behavior for LLVM IR to SPIR-V lowering for OpenCL kernels. SPIR-V don't allow one EntryPoint to refer another EntryPoint,

Yes, that's the entire point of this change. We've avoided introducing the invalid situation of an entry calling another entry, which requires 2 IR function bodies with different calling conventions. The calling convention is getting set incorrectly on the implementation function if you see an entry calling another entry

shiltian added a commit that referenced this pull request Apr 13, 2025
…ents (#134892)

Due to a previous workaround allowing kernels to be called from other
functions,
Clang currently doesn't use the `byref` attribute for aggregate kernel
arguments. The issue was recently resolved in
#115821. With that fix, we can
now
enable the use of `byref` consistently across all languages.

Co-authored-by: Matt Arsenault <[email protected]>

Fixes SWDEV-247226.

Co-authored-by: Matt Arsenault <[email protected]>
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 13, 2025
…ernel arguments (#134892)

Due to a previous workaround allowing kernels to be called from other
functions,
Clang currently doesn't use the `byref` attribute for aggregate kernel
arguments. The issue was recently resolved in
llvm/llvm-project#115821. With that fix, we can
now
enable the use of `byref` consistently across all languages.

Co-authored-by: Matt Arsenault <[email protected]>

Fixes SWDEV-247226.

Co-authored-by: Matt Arsenault <[email protected]>
@MrSidims
Copy link
Contributor

MrSidims commented Apr 14, 2025

This is the lowering to the target.

My glossary might be lacking some definitions, but what I really meant by lowering is: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU or https://github.com/llvm/llvm-project/tree/main/llvm/lib/Target/SPIRV .

I definitely see a value of having this resolved in one place instead of multiple places, and clang is a good candidate for it. On the other hand: a. some targets don't have such restriction, and for them this change, while harmless, is still redundant; b. clang is not a single frontend, especially now - in MLIR world, does it mean, what every frontend should make such adjustment when lowering to AMDGPU or SPIR-V?

Just a note/disclaimer, this change makes sense and it doesn't bring (known to me) regressions in some downsteam (the reason I'm here is a surprise to see this during some of my experiments) and adjusting tests in the translator is totally fine for me. What I'm trying to understand if it's really should be done in clang for OpenCL for every target. Again, some of the targets might already have such transformation implemented and this change results in generation of 3 functions (instead of a single kernel).

Also, do you plan to introduce the same change for OpenMP or Fortran at some point of time? If so, again, doesn't it make sense to modify https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU instead of clang?

@bwyma
Copy link
Contributor

bwyma commented Apr 14, 2025

@t-tye Would you mind reviewing this patch from a debug perspective?

Potentially multiple subprograms with the same source name, same source correlation in different routines, missing artificial or trampoline attributes, stepping between functions works as expected, etc. Your input would be appreciated.

@MrSidims
Copy link
Contributor

this change makes sense and it doesn't bring (known to me) regressions

Actually, there is an incorrect behavior in the following test case: https://godbolt.org/z/dc3T7Mo3G , note __clang_ocl_kern_imp_sample_kernel_float was generated, but was never called. @lalaniket8 can this be addressed?

@lalaniket8
Copy link
Contributor Author

this change makes sense and it doesn't bring (known to me) regressions

Actually, there is an incorrect behavior in the following test case: https://godbolt.org/z/dc3T7Mo3G , note __clang_ocl_kern_imp_sample_kernel_float was generated, but was never called. @lalaniket8 can this be addressed?

The call was placed and inlined since -O2 optimization flag was passed. Default inlining happens with -O1 and higher flags.
You can change it to -O0 and see the call to __clang_ocl_kern_imp_sample_kernel_float.

var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
…ents (llvm#134892)

Due to a previous workaround allowing kernels to be called from other
functions,
Clang currently doesn't use the `byref` attribute for aggregate kernel
arguments. The issue was recently resolved in
llvm#115821. With that fix, we can
now
enable the use of `byref` consistently across all languages.

Co-authored-by: Matt Arsenault <[email protected]>

Fixes SWDEV-247226.

Co-authored-by: Matt Arsenault <[email protected]>
@arsenm
Copy link
Contributor

arsenm commented May 1, 2025

My glossary might be lacking some definitions, but what I really meant by lowering is

The IR is not a direct vehicle for language semantics. This case should explicitly not be representable in the target IR, and the frontend needs to implement the calling convention in terms of valid IR for the target.

What I'm trying to understand if it's really should be done in clang for OpenCL for every target.

This is implementing a stupid OpenCL feature nobody asked for, so clang should deal with it. Every other language can happily ignore this. AMDGPU, NVPTX, and SPIRV all have the same behavior where kernels / entry points are a separate calling convention that cannot be called from device code. This call should be unrepresentable in the IR, and forcing it to be such just makes the backends worse. We cannot move the calling convention lowering forward without removing this artificial usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU backend:X86 clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unsupported calling convention: AMDGPU_KERNEL encountered when codegen some amdgpu (opencl) bitcode
10 participants