-
Notifications
You must be signed in to change notification settings - Fork 13.3k
AMDGPUTargetStreamer generates .kd symbols, breaking LTO requirement, may be discarded by --gc-sections #119479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-lld-elf Author: Ben Vanik (benvanik)
The caching added in https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080 by @MaskRay seems to have broken LTO and `--gc-sections` for certain use cases. Specifically the change made in `lld/ELF/MarkLive.cpp` to use the cached `isExported` value instead of calling `includeInDynsym` seems to be causing additional sections to be dropped that should not be (or at least were not before the change): https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)
The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a By reverting the mentioned line in The following repro shows the issue ( [[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
//
} compiled using $ clang \
-x c -std=c23 \
-target amdgcn-amd-amdhsa -march=gfx1100 \
-nogpulib \
-fgpu-rdc \
-fno-ident \
-fvisibility=hidden \
-O3 \
lld_lto_bug.c \
-c -emit-llvm -o lld_lto_bug.bc or since bc files cannot be attached:
Linking with LTO and gc-sections: lld \
-flavor gnu \
-m elf64_amdgpu \
-shared \
-plugin-opt=mcpu=gfx1100 \
-plugin-opt=O3 \
--lto-CGO3 \
--gc-sections \
--print-gc-sections \
--strip-debug \
--discard-all \
--discard-locals \
-o lld_lto_bug.so \
lld_lto_bug.bc Before the commit this will print the expected output (no removal of the rodata):
After the commit with the regression removing the rodata:
This can be verified with llvm-readelf as before:
The And after:
|
Thanks for the detailed reproduce. I think the issue is on AMDGPU's side about the If we add
There is a difference with the bitcode compilation case. LLVM LTO requires that each bitcode file lists all symbol resolution. (You can find a comment "changing the symbol table could break the semantic assumptions of LTO".) However, the AMDGPU backend breaks this contract by not recording While a symbol
In older MarkLive.cpp, To fix this, you can either ensure that the IR symbol table contains If you know the symbol name, you can additionally specify |
@llvm/issue-subscribers-backend-amdgpu Author: Ben Vanik (benvanik)
The caching added in https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080 by @MaskRay seems to have broken LTO and `--gc-sections` for certain use cases. Specifically the change made in `lld/ELF/MarkLive.cpp` to use the cached `isExported` value instead of calling `includeInDynsym` seems to be causing additional sections to be dropped that should not be (or at least were not before the change): https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)
The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a By reverting the mentioned line in The following repro shows the issue ( [[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
//
} compiled using $ clang \
-x c -std=c23 \
-target amdgcn-amd-amdhsa -march=gfx1100 \
-nogpulib \
-fgpu-rdc \
-fno-ident \
-fvisibility=hidden \
-O3 \
lld_lto_bug.c \
-c -emit-llvm -o lld_lto_bug.bc or since bc files cannot be attached: ; ModuleID = 'lld_lto_bug.bc'
source_filename = "lld_lto_bug.c"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"
@<!-- -->__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define protected amdgpu_kernel void @<!-- -->some_kernel(i32 noundef %n) local_unnamed_addr #<!-- -->0 {
entry:
ret void
}
attributes #<!-- -->0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot12-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="false" }
!llvm.module.flags = !{!0, !1, !2}
!0 = !{i32 1, !"amdhsa_code_object_version", i32 500}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2} Linking with LTO and gc-sections: lld \
-flavor gnu \
-m elf64_amdgpu \
-shared \
-plugin-opt=mcpu=gfx1100 \
-plugin-opt=O3 \
--lto-CGO3 \
--gc-sections \
--print-gc-sections \
--strip-debug \
--discard-all \
--discard-locals \
-o lld_lto_bug.so \
lld_lto_bug.bc Before the commit this will print the expected output (no removal of the rodata): removing unused section lld_lto_bug_patched.so.lto.o:(.text) After the commit with the regression removing the rodata: removing unused section lld_lto_bug.so.lto.o:(.text)
removing unused section lld_lto_bug.so.lto.o:(.rodata) This can be verified with llvm-readelf as before: Symbol table '.dynsym' contains 3 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000001500 4 FUNC GLOBAL PROTECTED 7 some_kernel
2: 0000000000000480 64 OBJECT GLOBAL PROTECTED 6 some_kernel.kd The And after: Symbol table '.dynsym' contains 2 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000001500 4 FUNC GLOBAL PROTECTED 6 some_kernel |
How would you do this? |
Thanks for looking into it! As a user is there anything I can do for the R_AMDGPU_NONE relocation or whatnot when coming through clang? Shoving some attributes somewhere perhaps? I can certainly add a bunch of
|
I'm not sure this contract is followed in practice. I can think of a wide variety of transforms that should run after LTO and can potentially introduce new symbols. Most instrumentation passes (sanitizers, PGO) run after LTO and generate new symbols that weren't in the IR symbol table. I believe in the last year coroutine splitting was moved after LTO, which splits coroutines into three symbols. Perhaps the contract should be that post-LTO IR passes cannot introduce new exported, dynamic, non-hidden symbols. The AMD backend pass can still use other mechanisms to ensure that kernel descriptors are not GC'd like Instrumentation passes generally do not need to emit new dynamic symbols. They are pretty much always hidden internal symbols. I'm not sure about coroutines. It seems like you could get into trouble if a coroutine gets inlined into two DSOs and then each DSO references different resume and destroy functions. |
Does anyone know why we use the |
(only recently digging into all this, so I don't know the history) the .kd symbol is a descriptor in .rodata that lets the runtime resolve metadata about the function - the non-.kd (original) symbol is the function itself in .text - it's unfortunate that it was named |
Sure, but how does the runtime discover the .kd symbols? Typically custom sections are used to implement these sorts of reflection and registration schemes, but there are other ways. |
Yeah, it's just confusing to me because we could've easily made a custom symbol type on the kernel functions (right now they're indistinguishable from normal functions). After that it would just be a key-value lookup in the |
https://github.com/ROCm/ROCR-Runtime/blob/e93efba9cc892e8ef878ef25ddea16c7773af51a/runtime/hsa-runtime/loader/executable.cpp#L1465 extremely trivial lookup of every symbol in the table it seems. |
My total guess is tooling: if I wanted disassemblers/readelf/breakpoints in gdb/etc to work nicely I'd want native functions separated from the metadata and I'd want the functions with their original name. As much as |
The I know NVIDIA just uses a custom visibility in st_other to mark some functions as kernels. I'm not sure why we need the separate kernel descriptors, since it doesn't seem to contain the register usage information that the notes do, but the notes contain the same information this kernel descriptor does. I'm not a member of the ROCr runtime team so I don't know the motivation behind it, but it feels like it could be better handled than scanning all the symbols, likely these should be in a special section at a minimum, that section can then be marked |
IIRC we used to jam this struct in the text section of the function, and then offset the actual code from it. That was considered wrong so we ended up with this instead |
Personally the per-symbol .kd-like thing felt better to me vs. the global text notes - much easier to read struct fields than have an entire parser when building tooling around something and felt more in-line with ELF as a format (so not also needing a loosely-structured text parser on top of struct-based ELF) - but I'm not on the rocr team and just someone staring at big binaries and seeing more human-readable text than I'd like :) |
I don't know how to reproduce it, can you tell me if this fixes it? diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
index ffde4d33f134..6946827b48a1 100644
--- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
+++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
@@ -1003,7 +1003,12 @@ void AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor(
if (KernelCodeSymbol->getVisibility() == ELF::STV_DEFAULT)
KernelCodeSymbol->setVisibility(ELF::STV_PROTECTED);
+ Streamer.pushSection();
+ Streamer.switchSection(Context.getELFSection(
+ ".amdhsa.kd", ELF::SHT_PROGBITS, ELF::SHF_ALLOC | ELF::SHF_GNU_RETAIN));
Streamer.emitLabel(KernelDescriptorSymbol);
+ Streamer.emitValueToAlignment(Align(alignof(amdhsa::kernel_descriptor_t)), 0,
+ 1, 0);
Streamer.emitValue(
KernelDescriptor.group_segment_fixed_size,
sizeof(amdhsa::kernel_descriptor_t::group_segment_fixed_size));
@@ -1043,4 +1048,5 @@ void AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor(
sizeof(amdhsa::kernel_descriptor_t::kernarg_preload));
for (uint32_t i = 0; i < sizeof(amdhsa::kernel_descriptor_t::reserved3); ++i)
Streamer.emitInt8(0u);
+ Streamer.popSection();
}
|
My only comment is that there is extreme sensitivity to kernel launch overhead. If we move the kernel descriptors we need to avoid affecting that latency. |
This would just move it out of the data section and into its own section. The current code just scans the whole symbol table for these, so if anything it would be faster to just memcpy the whole thing from the section table. |
cc: @bcahoon |
Tested this and it works! (sorry, long winter vacation :) Here's the diff between current main and this patch, which shows the OBJECT entries being restored: 2c2
< Symbol table '.dynsym' contains 18 entries:
---
> Symbol table '.dynsym' contains 35 entries:
5,21c5,38
< 1: 0000000000004700 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_fill_x4
< 2: 0000000000004800 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_fill_x8
< 3: 0000000000004b00 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_copy_x4
< 4: 0000000000004c00 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_copy_x8
< 5: 0000000000005600 1484 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_cmd_return
< 6: 0000000000004500 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_fill_x1
< 7: 0000000000004900 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_copy_x1
< 8: 0000000000009a00 908 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_queue_retire_entry
< 9: 0000000000009e00 5028 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_queue_scheduler_initialize
< 10: 0000000000004600 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_fill_x2
< 11: 0000000000004a00 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_copy_x2
< 12: 0000000000004d00 4 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_buffer_copy_x64
< 13: 0000000000004f00 1756 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_cmd_branch
< 14: 000000000000b200 46948 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_queue_scheduler_tick
< 15: 0000000000016a00 252 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_trace_buffer_initialize
< 16: 0000000000004e00 128 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_cmd_dispatch_indirect_update
< 17: 0000000000005c00 15628 FUNC GLOBAL PROTECTED 8 iree_hal_amdgpu_device_cmd_block_issue
---
> 1: 0000000000005200 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_fill_x8
> 2: 0000000000005500 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_copy_x4
> 3: 0000000000004f00 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_fill_x1
> 4: 000000000000a400 908 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_queue_retire_entry
> 5: 0000000000003550 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_fill_x4.kd
> 6: 0000000000003690 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_copy_x8.kd
> 7: 0000000000003790 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_cmd_return.kd
> 8: 0000000000005000 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_fill_x2
> 9: 0000000000005900 1756 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_cmd_branch
> 10: 0000000000017400 252 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_trace_buffer_initialize
> 11: 00000000000035d0 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_copy_x1.kd
> 12: 0000000000003850 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_queue_scheduler_initialize.kd
> 13: 0000000000003610 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_copy_x2.kd
> 14: 00000000000036d0 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_copy_x64.kd
> 15: 0000000000003890 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_queue_scheduler_tick.kd
> 16: 0000000000005100 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_fill_x4
> 17: 0000000000005600 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_copy_x8
> 18: 0000000000006000 1484 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_cmd_return
> 19: 0000000000003710 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_cmd_dispatch_indirect_update.kd
> 20: 00000000000037d0 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_cmd_block_issue.kd
> 21: 0000000000005300 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_copy_x1
> 22: 000000000000a800 5028 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_queue_scheduler_initialize
> 23: 0000000000003590 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_fill_x8.kd
> 24: 0000000000003650 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_copy_x4.kd
> 25: 0000000000005400 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_copy_x2
> 26: 0000000000005700 4 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_buffer_copy_x64
> 27: 000000000000bc00 46948 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_queue_scheduler_tick
> 28: 00000000000034d0 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_fill_x1.kd
> 29: 0000000000003810 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_queue_retire_entry.kd
> 30: 0000000000005800 128 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_cmd_dispatch_indirect_update
> 31: 0000000000006600 15628 FUNC GLOBAL PROTECTED 9 iree_hal_amdgpu_device_cmd_block_issue
> 32: 0000000000003510 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_buffer_fill_x2.kd
> 33: 0000000000003750 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_cmd_branch.kd
> 34: 00000000000038d0 64 OBJECT GLOBAL PROTECTED 7 iree_hal_amdgpu_device_trace_buffer_initialize.kd |
Cool, @arsenm @b-sumner should we pursue this solution? The only thing missing is changing a bunch of tests, otherwise I don't think it'll affect anything terribly. |
Summary: This patch emits all kernel descriptors to a new section rather than mixing them in with the other symbols in `.rodata` or `.text`. The advantage to this approach is that it will allow the runtime to memcpy the whole section rather than copying individual symbols whose names end in `.kd`. Additionally, this fixes an issue where `--gc-sections` will trim these symbols as they are not used by anything and simply exist as metadata. Adding the `SHF_GNU_RETAIN` prevents this entirely. Fixes: llvm#119479
The caching added in 3733ed6 by @MaskRay seems to have broken LTO and
--gc-sections
for certain use cases. Specifically the change made inlld/ELF/MarkLive.cpp
to use the cachedisExported
value instead of callingincludeInDynsym
seems to be causing additional sections to be dropped that should not be (or at least were not before the change): 3733ed6#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a
.kd
suffix and are emitted byAMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor
. Prior to the referenced commit these symbols existed and after they don't.By reverting the mentioned line in
MarkLive.cpp
the original behavior is restored. I'm not familiar with the codebase but I suspectisExported
is not initialized or not safe to cache at that location.The following repro shows the issue (
lld_lto_bug.c
):compiled using
or since bc files cannot be attached:
Linking with LTO and gc-sections:
Before the commit this will print the expected output (no removal of the rodata):
removing unused section lld_lto_bug_patched.so.lto.o:(.text)
After the commit with the regression removing the rodata:
This can be verified with llvm-readelf as before:
The
some_kernel.kd
OBJECT
is what is required at runtime to use the ELF.And after:
The text was updated successfully, but these errors were encountered: