Skip to content

AMDGPUTargetStreamer generates .kd symbols, breaking LTO requirement, may be discarded by --gc-sections #119479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #19636
benvanik opened this issue Dec 11, 2024 · 23 comments · May be fixed by #122930
Open
Tracked by #19636

Comments

@benvanik
Copy link
Contributor

benvanik commented Dec 11, 2024

The caching added in 3733ed6 by @MaskRay seems to have broken LTO and --gc-sections for certain use cases. Specifically the change made in lld/ELF/MarkLive.cpp to use the cached isExported value instead of calling includeInDynsym seems to be causing additional sections to be dropped that should not be (or at least were not before the change): 3733ed6#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)

The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a .kd suffix and are emitted by AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor. Prior to the referenced commit these symbols existed and after they don't.

By reverting the mentioned line in MarkLive.cpp the original behavior is restored. I'm not familiar with the codebase but I suspect isExported is not initialized or not safe to cache at that location.

The following repro shows the issue (lld_lto_bug.c):

[[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
  //
}

compiled using

$ clang \
  -x c -std=c23 \
  -target amdgcn-amd-amdhsa -march=gfx1100 \
  -nogpulib \
  -fgpu-rdc \
  -fno-ident \
  -fvisibility=hidden \
  -O3 \
  lld_lto_bug.c \
  -c -emit-llvm -o lld_lto_bug.bc

or since bc files cannot be attached:

; ModuleID = 'lld_lto_bug.bc'
source_filename = "lld_lto_bug.c"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

@__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define protected amdgpu_kernel void @some_kernel(i32 noundef %n) local_unnamed_addr #0 {
entry:
  ret void
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot12-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="false" }

!llvm.module.flags = !{!0, !1, !2}

!0 = !{i32 1, !"amdhsa_code_object_version", i32 500}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}

Linking with LTO and gc-sections:

lld \
  -flavor gnu \
  -m elf64_amdgpu \
  -shared \
  -plugin-opt=mcpu=gfx1100 \
  -plugin-opt=O3 \
  --lto-CGO3 \
  --gc-sections \
  --print-gc-sections \
  --strip-debug \
  --discard-all \
  --discard-locals \
  -o lld_lto_bug.so \
  lld_lto_bug.bc

Before the commit this will print the expected output (no removal of the rodata):

removing unused section lld_lto_bug_patched.so.lto.o:(.text)

After the commit with the regression removing the rodata:

removing unused section lld_lto_bug.so.lto.o:(.text)
removing unused section lld_lto_bug.so.lto.o:(.rodata)

This can be verified with llvm-readelf as before:

Symbol table '.dynsym' contains 3 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   7 some_kernel
     2: 0000000000000480    64 OBJECT  GLOBAL PROTECTED   6 some_kernel.kd

The some_kernel.kd OBJECT is what is required at runtime to use the ELF.

And after:

Symbol table '.dynsym' contains 2 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   6 some_kernel
@github-actions github-actions bot added the lld label Dec 11, 2024
@EugeneZelenko EugeneZelenko added lld:ELF and removed lld labels Dec 11, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 11, 2024

@llvm/issue-subscribers-lld-elf

Author: Ben Vanik (benvanik)

The caching added in https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080 by @MaskRay seems to have broken LTO and `--gc-sections` for certain use cases. Specifically the change made in `lld/ELF/MarkLive.cpp` to use the cached `isExported` value instead of calling `includeInDynsym` seems to be causing additional sections to be dropped that should not be (or at least were not before the change): https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)

The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a .kd suffix and are emitted by AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor. Prior to the referenced commit these symbols existed and after they don't.

By reverting the mentioned line in MarkLive.cpp the original behavior is restored. I'm not familiar with the codebase but I suspect isExported is not initialized or not safe to cache at that location.

The following repro shows the issue (lld_lto_bug.c):

[[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
  //
}

compiled using

$ clang \
  -x c -std=c23 \
  -target amdgcn-amd-amdhsa -march=gfx1100 \
  -nogpulib \
  -fgpu-rdc \
  -fno-ident \
  -fvisibility=hidden \
  -O3 \
  lld_lto_bug.c \
  -c -emit-llvm -o lld_lto_bug.bc

or since bc files cannot be attached:

; ModuleID = 'lld_lto_bug.bc'
source_filename = "lld_lto_bug.c"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

@<!-- -->__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define protected amdgpu_kernel void @<!-- -->some_kernel(i32 noundef %n) local_unnamed_addr #<!-- -->0 {
entry:
  ret void
}

attributes #<!-- -->0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot12-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="false" }

!llvm.module.flags = !{!0, !1, !2}

!0 = !{i32 1, !"amdhsa_code_object_version", i32 500}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}

Linking with LTO and gc-sections:

lld \
  -flavor gnu \
  -m elf64_amdgpu \
  -shared \
  -plugin-opt=mcpu=gfx1100 \
  -plugin-opt=O3 \
  --lto-CGO3 \
  --gc-sections \
  --print-gc-sections \
  --strip-debug \
  --discard-all \
  --discard-locals \
  -o lld_lto_bug.so \
  lld_lto_bug.bc

Before the commit this will print the expected output (no removal of the rodata):

removing unused section lld_lto_bug_patched.so.lto.o:(.text)

After the commit with the regression removing the rodata:

removing unused section lld_lto_bug.so.lto.o:(.text)
removing unused section lld_lto_bug.so.lto.o:(.rodata)

This can be verified with llvm-readelf as before:

Symbol table '.dynsym' contains 3 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   7 some_kernel
     2: 0000000000000480    64 OBJECT  GLOBAL PROTECTED   6 some_kernel.kd

The some_kernel.kd OBJECT is what is required at runtime to use the ELF.

And after:

Symbol table '.dynsym' contains 2 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   6 some_kernel

@MaskRay
Copy link
Member

MaskRay commented Dec 11, 2024

Thanks for the detailed reproduce. I think the issue is on AMDGPU's side about the .kd symbol.

If we add --save-temps to get lld_lto_bug.so.lto.o and perform a link using the ELF relocatable object file, some_kernel.kd (.rodata) will be retained as expected.

% myld.lld -shared --gc-sections --print-gc-sections lld_lto_bug.so.lto.o
removing unused section lld_lto_bug.so.lto.o:(.text)

There is a difference with the bitcode compilation case.

LLVM LTO requires that each bitcode file lists all symbol resolution.
(Some backend may generate runtime library calls while these libcall symbols appear in the IR symtab but are not listed as referenced.)

(You can find a comment "changing the symbol table could break the semantic assumptions of LTO".)

However, the AMDGPU backend breaks this contract by not recording some_kernel.kd in the IR symbol table, so the LTO library doesn't know some_kernel.kd.

While a symbol some_kernel.kd is created as LTO compilation output, this is no guarantee on whether the symbol should be retained/discarded.

% myllvm-nm a.bc
---------------- W __oclc_ABI_version
---------------- T some_kernel

In older MarkLive.cpp, includeInDynsym returns true due to -shared (retain), and now isExported is false (discard).

To fix this, you can either ensure that the IR symbol table contains some_kernel.kd or synthesize a R_AMDGPU_NONE marker relocation from some_kernel to some_kernel.kd.

If you know the symbol name, you can additionally specify -u some_kernel.kd to retain the symbol.
--export-dynamic-symbol '*.kd' cannot retain the symbol because some_kernel.kd is unknown to lld LTO.

@llvmbot
Copy link
Member

llvmbot commented Dec 11, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Ben Vanik (benvanik)

The caching added in https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080 by @MaskRay seems to have broken LTO and `--gc-sections` for certain use cases. Specifically the change made in `lld/ELF/MarkLive.cpp` to use the cached `isExported` value instead of calling `includeInDynsym` seems to be causing additional sections to be dropped that should not be (or at least were not before the change): https://github.com/llvm/llvm-project/commit/3733ed6f1c6b0eef1e13e175ac81ad309fc0b080#diff-3c88c62d912008cc04f796b330a035ecda925645264eaef43185ad43991cb8e9L224)

The AMDGPU target inserts special kernel descriptor object symbols that must be preserved into the final ELF for the runtime to load. These match any exported kernel in name with a .kd suffix and are emitted by AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor. Prior to the referenced commit these symbols existed and after they don't.

By reverting the mentioned line in MarkLive.cpp the original behavior is restored. I'm not familiar with the codebase but I suspect isExported is not initialized or not safe to cache at that location.

The following repro shows the issue (lld_lto_bug.c):

[[clang::amdgpu_kernel, gnu::visibility("protected")]] void some_kernel(int n) {
  //
}

compiled using

$ clang \
  -x c -std=c23 \
  -target amdgcn-amd-amdhsa -march=gfx1100 \
  -nogpulib \
  -fgpu-rdc \
  -fno-ident \
  -fvisibility=hidden \
  -O3 \
  lld_lto_bug.c \
  -c -emit-llvm -o lld_lto_bug.bc

or since bc files cannot be attached:

; ModuleID = 'lld_lto_bug.bc'
source_filename = "lld_lto_bug.c"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
target triple = "amdgcn-amd-amdhsa"

@<!-- -->__oclc_ABI_version = weak_odr hidden local_unnamed_addr addrspace(4) constant i32 500

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define protected amdgpu_kernel void @<!-- -->some_kernel(i32 noundef %n) local_unnamed_addr #<!-- -->0 {
entry:
  ret void
}

attributes #<!-- -->0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) "amdgpu-no-agpr" "amdgpu-no-completion-action" "amdgpu-no-default-queue" "amdgpu-no-dispatch-id" "amdgpu-no-dispatch-ptr" "amdgpu-no-heap-ptr" "amdgpu-no-hostcall-ptr" "amdgpu-no-implicitarg-ptr" "amdgpu-no-lds-kernel-id" "amdgpu-no-multigrid-sync-arg" "amdgpu-no-queue-ptr" "amdgpu-no-workgroup-id-x" "amdgpu-no-workgroup-id-y" "amdgpu-no-workgroup-id-z" "amdgpu-no-workitem-id-x" "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="gfx1100" "target-features"="+16-bit-insts,+atomic-fadd-rtn-insts,+ci-insts,+dl-insts,+dot10-insts,+dot12-insts,+dot5-insts,+dot7-insts,+dot8-insts,+dot9-insts,+dpp,+gfx10-3-insts,+gfx10-insts,+gfx11-insts,+gfx8-insts,+gfx9-insts,+wavefrontsize32" "uniform-work-group-size"="false" }

!llvm.module.flags = !{!0, !1, !2}

!0 = !{i32 1, !"amdhsa_code_object_version", i32 500}
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}

Linking with LTO and gc-sections:

lld \
  -flavor gnu \
  -m elf64_amdgpu \
  -shared \
  -plugin-opt=mcpu=gfx1100 \
  -plugin-opt=O3 \
  --lto-CGO3 \
  --gc-sections \
  --print-gc-sections \
  --strip-debug \
  --discard-all \
  --discard-locals \
  -o lld_lto_bug.so \
  lld_lto_bug.bc

Before the commit this will print the expected output (no removal of the rodata):

removing unused section lld_lto_bug_patched.so.lto.o:(.text)

After the commit with the regression removing the rodata:

removing unused section lld_lto_bug.so.lto.o:(.text)
removing unused section lld_lto_bug.so.lto.o:(.rodata)

This can be verified with llvm-readelf as before:

Symbol table '.dynsym' contains 3 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   7 some_kernel
     2: 0000000000000480    64 OBJECT  GLOBAL PROTECTED   6 some_kernel.kd

The some_kernel.kd OBJECT is what is required at runtime to use the ELF.

And after:

Symbol table '.dynsym' contains 2 entries:
   Num:    Value          Size Type    Bind   Vis       Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   UND 
     1: 0000000000001500     4 FUNC    GLOBAL PROTECTED   6 some_kernel

@MaskRay MaskRay changed the title [lld] Regression in LTO/gc-sections for certain symbols impacting AMDGPU ELFs AMDGPUTargetStreamer generates .kd symbols, breaking LTO requirement, may be discarded by --gc-sections Dec 11, 2024
@arsenm
Copy link
Contributor

arsenm commented Dec 11, 2024

synthesize a R_AMDGPU_NONE marker relocation from some_kernel to some_kernel.kd.

How would you do this?

@benvanik
Copy link
Contributor Author

benvanik commented Dec 11, 2024

Thanks for looking into it!

As a user is there anything I can do for the R_AMDGPU_NONE relocation or whatnot when coming through clang? Shoving some attributes somewhere perhaps? I can certainly add a bunch of -u's to my compile commands but that's not great and will likely bite other users too :)

I understand the bug may be in the AMDGPU target but this change did break the existing (incorrect, but relied upon) behavior. Is there anyone you're aware of who knows that part of the system? (sounds like @arsenm may be the best contact!) If it's something missing in AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor that builds the MCSymbolELF is there any hints as to what it may be? I can't seem to find anywhere that tests this case (emitting the ELFs with LLD using LTO) but I'm not an LLVM dev and this is a tricky intersection of subprojects.

@rnk
Copy link
Collaborator

rnk commented Dec 11, 2024

However, the AMDGPU backend breaks this contract by not recording some_kernel.kd in the IR symbol table, so the LTO library doesn't know some_kernel.kd.

I'm not sure this contract is followed in practice. I can think of a wide variety of transforms that should run after LTO and can potentially introduce new symbols. Most instrumentation passes (sanitizers, PGO) run after LTO and generate new symbols that weren't in the IR symbol table. I believe in the last year coroutine splitting was moved after LTO, which splits coroutines into three symbols.

Perhaps the contract should be that post-LTO IR passes cannot introduce new exported, dynamic, non-hidden symbols. The AMD backend pass can still use other mechanisms to ensure that kernel descriptors are not GC'd like @llvm.used or referencing descriptors from a section used in combination with __start_/__end_ symbols or whatever the platform native registration discovery mechanism happens to be.

Instrumentation passes generally do not need to emit new dynamic symbols. They are pretty much always hidden internal symbols. I'm not sure about coroutines. It seems like you could get into trouble if a coroutine gets inlined into two DSOs and then each DSO references different resume and destroy functions.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 12, 2024

Does anyone know why we use the .kd thing in the first place? I always thought it was weird that we didn't just use the actual symbol, but I'm assuming that's far in the past.

@benvanik
Copy link
Contributor Author

(only recently digging into all this, so I don't know the history) the .kd symbol is a descriptor in .rodata that lets the runtime resolve metadata about the function - the non-.kd (original) symbol is the function itself in .text - it's unfortunate that it was named .kd as I'm finding most flags (__start/__end IIRC) seem to only accept valid C names.

@rnk
Copy link
Collaborator

rnk commented Dec 12, 2024

Sure, but how does the runtime discover the .kd symbols? Typically custom sections are used to implement these sorts of reflection and registration schemes, but there are other ways.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 12, 2024

(only recently digging into all this, so I don't know the history) the .kd symbol is a descriptor in .rodata that lets the runtime resolve metadata about the function - the non-.kd (original) symbol is the function itself in .text - it's unfortunate that it was named .kd as I'm finding most flags (__start/__end IIRC) seem to only accept valid C names.

Yeah, it's just confusing to me because we could've easily made a custom symbol type on the kernel functions (right now they're indistinguishable from normal functions). After that it would just be a key-value lookup in the .notes section where we store the metadata. I have no clue why we need an extra symbol, but at this point it's probably baked into the runtime too deep.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 12, 2024

Sure, but how does the runtime discover the .kd symbols? Typically custom sections are used to implement these sorts of reflection and registration schemes, but there are other ways.

https://github.com/ROCm/ROCR-Runtime/blob/e93efba9cc892e8ef878ef25ddea16c7773af51a/runtime/hsa-runtime/loader/executable.cpp#L1465 extremely trivial lookup of every symbol in the table it seems.

@benvanik
Copy link
Contributor Author

My total guess is tooling: if I wanted disassemblers/readelf/breakpoints in gdb/etc to work nicely I'd want native functions separated from the metadata and I'd want the functions with their original name. As much as .kd bothers me changing that would break compatibility so it's likely going to have to remain until the compatibility window elapses - today the issue is that with this new change things (sometimes, depending on flow through clang) don't work ;(

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 12, 2024

My total guess is tooling: if I wanted disassemblers/readelf/breakpoints in gdb/etc to work nicely I'd want native functions separated from the metadata and I'd want the functions with their original name. As much as .kd bothers me changing that would break compatibility so it's likely going to have to remain until the compatibility window elapses - today the issue is that with this new change things (sometimes, depending on flow through clang) don't work ;(

The .kd symbol seems to have been introduced in COV3, no reason we couldn't change it for COV7 but keep handling for backward compatibility. It seems that the way we currently handle this is to just scan every symbol for these 'kernel descriptors' which are basically just little structs that contain some basic information about it.

I know NVIDIA just uses a custom visibility in st_other to mark some functions as kernels. I'm not sure why we need the separate kernel descriptors, since it doesn't seem to contain the register usage information that the notes do, but the notes contain the same information this kernel descriptor does.

I'm not a member of the ROCr runtime team so I don't know the motivation behind it, but it feels like it could be better handled than scanning all the symbols, likely these should be in a special section at a minimum, that section can then be marked SHF_GNU_RETAIN which would resolve this whole problem.

@arsenm
Copy link
Contributor

arsenm commented Dec 12, 2024

Does anyone know why we use the .kd thing in the first place? I always thought it was weird that we didn't just use the actual symbol, but I'm assuming that's far in the past.

IIRC we used to jam this struct in the text section of the function, and then offset the actual code from it. That was considered wrong so we ended up with this instead

@benvanik
Copy link
Contributor Author

Personally the per-symbol .kd-like thing felt better to me vs. the global text notes - much easier to read struct fields than have an entire parser when building tooling around something and felt more in-line with ELF as a format (so not also needing a loosely-structured text parser on top of struct-based ELF) - but I'm not on the rocr team and just someone staring at big binaries and seeing more human-readable text than I'd like :)

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 13, 2024

I don't know how to reproduce it, can you tell me if this fixes it?

diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
index ffde4d33f134..6946827b48a1 100644
--- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
+++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp
@@ -1003,7 +1003,12 @@ void AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor(
   if (KernelCodeSymbol->getVisibility() == ELF::STV_DEFAULT)
     KernelCodeSymbol->setVisibility(ELF::STV_PROTECTED);
 
+  Streamer.pushSection();
+  Streamer.switchSection(Context.getELFSection(
+      ".amdhsa.kd", ELF::SHT_PROGBITS, ELF::SHF_ALLOC | ELF::SHF_GNU_RETAIN));
   Streamer.emitLabel(KernelDescriptorSymbol);
+  Streamer.emitValueToAlignment(Align(alignof(amdhsa::kernel_descriptor_t)), 0,
+                                1, 0);
   Streamer.emitValue(
       KernelDescriptor.group_segment_fixed_size,
       sizeof(amdhsa::kernel_descriptor_t::group_segment_fixed_size));
@@ -1043,4 +1048,5 @@ void AMDGPUTargetELFStreamer::EmitAmdhsaKernelDescriptor(
                      sizeof(amdhsa::kernel_descriptor_t::kernarg_preload));
   for (uint32_t i = 0; i < sizeof(amdhsa::kernel_descriptor_t::reserved3); ++i)
     Streamer.emitInt8(0u);
+  Streamer.popSection();
 }

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 20, 2024

@benvanik Any update? @arsenm @b-sumner do you think the above is reasonable? Pretty much just puts all the .kd symbols in their own section and keeps it from being deadstripped.

@b-sumner
Copy link

My only comment is that there is extreme sensitivity to kernel launch overhead. If we move the kernel descriptors we need to avoid affecting that latency.

@jhuber6
Copy link
Contributor

jhuber6 commented Dec 20, 2024

My only comment is that there is extreme sensitivity to kernel launch overhead. If we move the kernel descriptors we need to avoid affecting that latency.

This would just move it out of the data section and into its own section. The current code just scans the whole symbol table for these, so if anything it would be faster to just memcpy the whole thing from the section table.

@kuhar
Copy link
Member

kuhar commented Jan 13, 2025

cc: @bcahoon

@benvanik
Copy link
Contributor Author

benvanik commented Jan 13, 2025

@jhuber6:

I don't know how to reproduce it, can you tell me if this fixes it?

diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUTargetStreamer.cpp

Tested this and it works! (sorry, long winter vacation :)

Here's the diff between current main and this patch, which shows the OBJECT entries being restored:

2c2
< Symbol table '.dynsym' contains 18 entries:
---
> Symbol table '.dynsym' contains 35 entries:
5,21c5,38
<      1: 0000000000004700     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_fill_x4
<      2: 0000000000004800     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_fill_x8
<      3: 0000000000004b00     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_copy_x4
<      4: 0000000000004c00     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_copy_x8
<      5: 0000000000005600  1484 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_cmd_return
<      6: 0000000000004500     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_fill_x1
<      7: 0000000000004900     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_copy_x1
<      8: 0000000000009a00   908 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_queue_retire_entry
<      9: 0000000000009e00  5028 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_queue_scheduler_initialize
<     10: 0000000000004600     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_fill_x2
<     11: 0000000000004a00     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_copy_x2
<     12: 0000000000004d00     4 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_buffer_copy_x64
<     13: 0000000000004f00  1756 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_cmd_branch
<     14: 000000000000b200 46948 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_queue_scheduler_tick
<     15: 0000000000016a00   252 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_trace_buffer_initialize
<     16: 0000000000004e00   128 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_cmd_dispatch_indirect_update
<     17: 0000000000005c00 15628 FUNC    GLOBAL PROTECTED   8 iree_hal_amdgpu_device_cmd_block_issue
---
>      1: 0000000000005200     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_fill_x8
>      2: 0000000000005500     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_copy_x4
>      3: 0000000000004f00     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_fill_x1
>      4: 000000000000a400   908 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_queue_retire_entry
>      5: 0000000000003550    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_fill_x4.kd
>      6: 0000000000003690    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_copy_x8.kd
>      7: 0000000000003790    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_cmd_return.kd
>      8: 0000000000005000     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_fill_x2
>      9: 0000000000005900  1756 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_cmd_branch
>     10: 0000000000017400   252 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_trace_buffer_initialize
>     11: 00000000000035d0    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_copy_x1.kd
>     12: 0000000000003850    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_queue_scheduler_initialize.kd
>     13: 0000000000003610    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_copy_x2.kd
>     14: 00000000000036d0    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_copy_x64.kd
>     15: 0000000000003890    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_queue_scheduler_tick.kd
>     16: 0000000000005100     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_fill_x4
>     17: 0000000000005600     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_copy_x8
>     18: 0000000000006000  1484 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_cmd_return
>     19: 0000000000003710    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_cmd_dispatch_indirect_update.kd
>     20: 00000000000037d0    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_cmd_block_issue.kd
>     21: 0000000000005300     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_copy_x1
>     22: 000000000000a800  5028 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_queue_scheduler_initialize
>     23: 0000000000003590    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_fill_x8.kd
>     24: 0000000000003650    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_copy_x4.kd
>     25: 0000000000005400     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_copy_x2
>     26: 0000000000005700     4 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_buffer_copy_x64
>     27: 000000000000bc00 46948 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_queue_scheduler_tick
>     28: 00000000000034d0    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_fill_x1.kd
>     29: 0000000000003810    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_queue_retire_entry.kd
>     30: 0000000000005800   128 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_cmd_dispatch_indirect_update
>     31: 0000000000006600 15628 FUNC    GLOBAL PROTECTED   9 iree_hal_amdgpu_device_cmd_block_issue
>     32: 0000000000003510    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_buffer_fill_x2.kd
>     33: 0000000000003750    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_cmd_branch.kd
>     34: 00000000000038d0    64 OBJECT  GLOBAL PROTECTED   7 iree_hal_amdgpu_device_trace_buffer_initialize.kd

@jhuber6
Copy link
Contributor

jhuber6 commented Jan 13, 2025

Tested this and it works! (sorry, long winter vacation :)

Here's the diff between current main and this patch, which shows the OBJECT entries being restored:

Cool, @arsenm @b-sumner should we pursue this solution? The only thing missing is changing a bunch of tests, otherwise I don't think it'll affect anything terribly.

@arsenm
Copy link
Contributor

arsenm commented Jan 14, 2025

Cool, @arsenm @b-sumner should we pursue this solution? The only thing missing is changing a bunch of tests, otherwise I don't think it'll affect anything terribly.

Sounds fine to me but I'm no elf expert

jhuber6 added a commit to jhuber6/llvm-project that referenced this issue Jan 14, 2025
Summary:
This patch emits all kernel descriptors to a new section rather than
mixing  them in with the other symbols in `.rodata` or `.text`. The
advantage to this approach is that it will allow the runtime to memcpy
the whole section rather than copying individual symbols whose names end
in `.kd`.

Additionally, this fixes an issue where `--gc-sections` will trim these
symbols as they are not used by anything and simply exist as metadata.
Adding the `SHF_GNU_RETAIN` prevents this entirely.

Fixes: llvm#119479
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants