merge main into amd-staging by ronlieb · Pull Request #1675 · ROCm/llvm-project

ronlieb · 2026-03-07T01:43:48Z

No description provided.

…#183639) When a HIP kernel uses placement new with a function returning an aggregate via sret (e.g. `new (out) T(make_t())`), and the placement destination is in global memory (addrspace 1), the sret pointer was addrspacecast'd to addrspace 5 (private), producing an invalid pointer that faults at runtime. Instead of casting the caller's pointer directly, materialise a temporary alloca in the callee's expected address space, pass that as the sret argument, and copy the result back to the original destination after the call.

…est (llvm#185031) As per llvm#182730 (review)

@AaronBallman

Part of the implementation of [[RFC] Emitting Auditable SARIF Logs from Clang](https://discourse.llvm.org/t/rfc-emitting-auditable-sarif-logs-from-clang/88624) SARIF diagnostics require that each rule have a stable `id` property to identify that rule across runs, even when the compiler or analysis tool has changed. We were previously setting the `id` property to the numeric value of the enum value for that diagnostic within the Clang implementation; this value changes whenever an unrelated diagnostic is inserted or removed earlier in the list. This change sets the `id` property to the _text_ of that same enum value. This value would only change if someone renames the enum value for that diagnostic, which should happen much less frequently than renumbering. For now, we will just assume that renaming happens infrequently enough that existing consumers of SARIF will not notice. In the future, we could take advantage of SARIF's support for `deprecatedIds`, which let a rule specify the IDs by which it was previously known. This would let us rename, split, or combine diagnostics while still being able to correlate the new diagnostic IDs with older SARIF logs and/or suppressions. Nothing in this change affects how warnings are configured on the command line or in `#pragma clang diagnostic`. Those still use warning groups, not the stable IDs. ### Potential discussion topics From @AaronBallman on the RFC: >We believe some open questions remain (things like whether a unique ID is on the per-diagnostic level or on the diagnostic group level, whether the ID is explicitly spelled in the .td file or implicitly generated, whether we document the IDs, etc), but we think those questions are best decided in PR discussions with interested parties rather than an RFC. As a starting point, this PR proposes the following answers to those open questions: - _whether a unique ID is on the per-diagnostic level or on the diagnostic group level_ - per-diagnostic level. For my justification, see [this portion of the RFC discussion](https://discourse.llvm.org/t/rfc-emitting-auditable-sarif-logs-from-clang/88624/11?u=dbartol.). - _whether the ID is explicitly spelled in the .td file or implicitly generated_ - Implicitly generated, but I'd be happy to have a way to explicitly specify it. I just think that the in-code identifier is a reasonable default, and manually reviewing the IDs of thousands of existing diagnostics would add little benefit. - _whether we document the IDs_ - For now, the IDs are only exposed to the user (and other tools) in the SARIF file, so I don't think we need to document these. We could certainly add this information to the output of `diagtool` in the future if users find it relevant.

We would previously include the FP80 sources into the Windows build if we built with the GNU driver rather than the `cl` driver.

A bit of a small nitpick, close it if unnecessary. (clang-tidy warnings)

This fixes 3da28bf. Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>

…n EmitNullBaseClassInitialization (llvm#184558) When splitting memory stores around multiple virtual base pointers (vbptrs) in the Microsoft ABI, the calculation for the size of the memory region after each vbptr was incorrect. The bug/old calculation: SplitAfterSize = LastStoreSize - SplitAfterOffset This subtracts an absolute offset from a relative size, causing incorrect (too small) sizes after the second vbptr. The correct size should be: SplitAfterSize = (LastStoreOffset + LastStoreSize) - SplitAfterOffset Since all store regions extend to the end of the non-virtual portion (NVSize), this patch uses the simplified form: SplitAfterSize = NVSize - SplitAfterOffset The bug causes the assertion failure: "negative store size!" Fixes llvm#42101

Asyncmarks record the current wait state and so should not allow waitcnts that occur after them to be merged into waitcnts that occur before.

CI didn't flag that the benchmark was using the outdated Ctx call when landing the Mustache MD patch since this benchmark isn't tested. Also added missing libraries in CMake that prevented me from building the benchmark locally.

Design document for MLIR dialect-agnostic calling convention lowering that builds on the LLVM ABI Lowering Library (llvm/lib/ABI/) as the single source of truth for ABI classification. Dialects use the library via an adapter layer: ABITypeMapper maps dialect types to abi::Type*, the library classifies arguments and returns, and a dialect-specific ABIRewriteContext applies the decisions back to IR operations. Targets x86_64 and AArch64, with parity against Classic Clang CodeGen validated through differential testing.

When we call `getLoc()` with an invalid `SourceLocation` and `currSrcLoc` is also invalid, we were crashing or asserting. I tracked down one case where this was happening (generating an argument in a vtable thunk) and fixed that to provide a location. I also am updating the `getLoc()` implementation so that it will use an unknown location in release builds rather than crashing because the location isn't critical for correct compilation.

The test is failing on the lldb-x86_64-win buildbot.

…4730) This patch adds `flang -fc1` option `-ffp-maxmin-behavior` and propagates it throughout Flang, so that semantics context, lowering and the pass pipeline builder can use it. MAX/MIN intrinsic and OpenACC max/min reduction lowering are now controlled by the option. I kept the `Legacy` mode, which is the default and matches the current behavior. I am going to test and merge a follow-up patch that replaces `Legacy` with `Portable`. RFC: https://discourse.llvm.org/t/flang-canonical-and-optimizable-representation-for-min-max/90037

llvm#184886) Fixes cannot select errors for other types of shift amounts. I've made a new RISCVISD node that only allows an immediate operand. It's assumed that the lowering code will only allow valid immediates so I'm not using a TImmLeaf in the match.

llvm#185091) …et types Fix for buildbot crash on llvm#183639 The UseTemp path in AggExprEmitter::withReturnValueSlot copies back via EmitAggregateCopy, which asserts that the type has a trivial copy/move constructor or assignment operator. Gate the DestASMismatch condition on isTriviallyCopyableType so that non-trivially-copyable types (e.g. std::exception_ptr) fall through to the addrspacecast path instead. Fix buildbot crash: https://lab.llvm.org/buildbot/#/builders/73/builds/19803

No behavior change.

The user can now manually toggle the light or dark theme instead of waiting for the system theme to change. Also fixes a typo that caused some overflow issues even when there was no content to cause an overflow.

…lvm#182896) Current tooling for the WebAssembly component model uses import modules and names such as `$root` and `[thread-index]`. Importing these from assembly files requires support for non-valid identifiers in `.import_name` and `.import_module` directives. This PR adds support for specifying those as strings, e.g.: ```asm .import_module __wasm_component_model_builtin_thread_index, "$root" .import_name __wasm_component_model_builtin_thread_index, "[thread-index]" ```

The tests for mlir-reduce are currently scattered. To centralize the tests for mlir-reduce, I added the split-input-file feature to mlir-reduce.It is part of llvm#184974.

…ass` switch (llvm#185072) This removes the `wasm-disable-fix-irreducible-control-flow-pass` switch. It was originally added in llvm#67715 as a way to avoid the potentially absurd compile times the pass used to bring. However with the successful merge of llvm#184441, the pass itself has been fixed to avoid this issue. Given that, it is no longer necessary nor desirable to keep this switch.

…184902) This is a prerequisite for full ARM64 Windows ASan support. The runtime interception changes needed to make ASan functional end-to-end on ARM64 Windows will be opened separately. Motivated by microsoft/STL#6095 (more specifically [this reference to clang-cl](microsoft/STL#6095.)) The latest MSVC toolset includes ARM64 AddressSanitizer support. This change adds AArch64 to the Windows 64-bit shadow mapping condition when compiling with `-fsanitize=address` with `clang-cl`. Without this, consumers on Windows who target ARM64 with `clang-cl -fsanitize=address` and then link with `link.exe` will see this at runtime: ```text ERROR: AddressSanitizer: access-violation on unknown address ... ``` since the shadow memory offset is not properly assigned. Windows ARM64 uses the same dynamic shadow allocation strategy as x64 via `__asan_shadow_memory_dynamic_address`.

…4898) We have to materialize `fir.box` before adding a `fir.convert` to a memref type. Otherwise we get: `'fir.convert' op invalid type conversion'!fir.box<!fir.array<?xi32>>' / 'memref<?xi32, strided<[?], offset: ?>>'`

…5101) Reverts llvm#182532 to unblock CI. The original patch causes some test failures related to undef bits, as it incorrectly assumes `std::uniform_int_distribution` returns the same result with different C++ stdlib vendors.

When an op's assembly format prints an attribute via `printStrippedAttrOrType`, two independent space-emission mechanisms would fire: the op format generator emits a space before each argument, and the attribute's generated `print` method also emits a leading space (`shouldEmitSpace` initialized to true). This caused double spaces like `gpu.shuffle xor`. The usual workaround for this was to add double backticks to consume the leading space. Fixed by removing the leading space from generated attr/type `print()` methods and compensating in the print dispatcher by conditionally adding a space between the mnemonic and `print` call when the format starts with a name or keyword rather than punctuation. Also remove some workarounds for the double-spacing in op formats and fix tests that now don't have leading spaces. Assisted-by: claude

…lvm.matrix.multiply` (llvm#184882) Fixes llvm#99138 - Defines a `__builtin_hlsl_mul` clang builtin in `Builtins.td`. - Links the `__builtin_hlsl_mul` clang builtin with `hlsl_alias_intrinsics.h` under the name `mul` for matrix cases - Implement scalar and vector elementwise multiplication cases of the `mul` function in `hlsl_intrinsics.h` and `hlsl_intrinsic_helpers.h` - Adds sema for `__builtin_hlsl_mul` to `CheckBuiltinFunctionCall` in `SemaHLSL.cpp` - Adds codegen for `__builtin_hlsl_mul` to `EmitHLSLBuiltinExpr` in `CGHLSLBuiltins.cpp` - Vector-vector cases lower to `dot` (except double vectors, which expands to scalar multiply-adds). - Matrix-matrix, matrix-vector, and vector-matrix multiplication lower to the `llvm.matrix.multiply` intrinsic - Adds codegen tests to `clang/test/CodeGenHLSL/builtins/mul.hlsl` - Adds sema tests to `clang/test/SemaHLSL/BuiltIns/mul-errors.hlsl` - Implements lowering of the `llvm.matrix.multiply` intrinsic to DXIL in `DXILIntrinsicExpansion.cpp` Note: Currently the SPIRV backend does not support row-major matrix memory layouts when lowering matrix multiply, and just assumes column-major layout. Therefore this PR also makes the DirectX backend only assume column-major layout. Implementing support for row-major order shall be done in a separate PR. (Tracked by llvm#184906) This PR locally passes the `mul` offload tests in both DirectX 12 and Vulkan: llvm/offload-test-suite#941 Assisted-by: claude-opus-4.6

…nsfer op (llvm#185106) Add an attribute to signal the presence of managed or unified symbols in the data transfer. In some case, the presence of such symbols require to insert synchronization. Adding the attribute in the op during lowering facilitate the recognition of such data transfer.

llvm#185078) They are not allowed by the HW.

This commit simplifies the cumbersome process of swapping the respective layout members for `__split_buffer` and `vector`.

…ayout (llvm#184280) Fixes llvm#183127 and llvm#184371 This PR makes the matrix truncation cast implementation use the new matrix flattened index helper functions introduced by llvm#182904 so that it reads elements from the source matrix using the default matrix memory layout instead of always assuming column-major order. This PR also fixes a bug where matrix truncation truncated the wrong elements. Assisted-by: claude-opus-4.6

Fix more typos in the AArch64 codebase using the https://github.com/crate-ci/typos Rust package. commit-id:9f4d826d Reviewers: davemgreen Pull Request: llvm#183086

Fix more typos in the AArch64 codebase using the https://github.com/crate-ci/typos Rust package. commit-id:33a1bb8d Reviewers: davemgreen Reviewed By: davemgreen Pull Request: llvm#183087

Reverts llvm#180102

This lets us find functions where we pessimize codegen by removing lifetimes. Reviewers: vitalybuka Reviewed By: vitalybuka Pull Request: llvm#183858

This PR replaces the Get*CallbackAtIndex pattern in the PluginManager with returning a snapshot of callbacks that the caller can iterate over using a range-based for loop. This is a continuation of llvm#184452 which added thread safety by using snapshots. However, that introduced a bunch of unnecessary copies which are largely eliminated again by getting the snapshot once when gather all the callbacks, rather than doing that on each iteration when querying a plugin for a given index. It also eliminates the possibility of the snapshot changing underneath you when iterating over the plugins. This change was largely mechanical and I used Claude to do the menial work of updating the signatures and call sites.

…e Scanning (llvm#183396) This PR fixes two issues of the in-memory buffer we use for the input file when a dependency scanner performs by-name queries. First, it renames the buffer. The temporary file was named `ScanningByName-%%%%%%%%.input`, which leads to weird diagnostics such as ``` ScanningByName-2d42a1e9.input:1:1: fatal error: could not build module 'X' ``` This PR changes the name of the file buffer, so we get diagnostics such as ``` module-include.input:1:1: fatal error: could not build module 'X' ``` which is more indicative. Additionally, this PR fixes a bug where the source location may overflow the temporary buffer by creating a 64k empty string which the temporary buffer occupies. When the source location overflows, the diagnostics could point to some random file that comes after the fake file and is incorrect. Currently, the maximum number of unique names from Apple's SDKs is around 3000. A 64k buffer per dependency scanning worker gives us around 20x capacity per worker (which scans fewer names than 3000 when the scanning is done in parallel). A fatal error is added to catch overflows.

…-bit targets (llvm#181288) This PR optimizes 32-bit unsigned division by constants when the magic constant is 33 bits (IsAdd=true case in UnsignedDivisionByConstantInfo) on 64-bit targets. ## Overview Compiler optimization for constant division of `uint32_t` variables (such as `x / 7`) is based on the method proposed by Granlund and Montgomery in 1994 (hereafter referred to as the GM method). However, the GM method for the IsAdd=true case was optimized for 32-bit CPUs, not 64-bit CPUs. This patch provides optimizations specifically for 64-bit CPUs (such as x86_64 and Apple M-series). A simple benchmark demonstrates over 60% speedup on both Intel Xeon and Apple M4 processors. ## The GM Method The GM method for `x / 7` can be expressed in C code as follows, where the constants `c` and `a` are magic numbers determined by the divisor: ```cpp uint32_t udiv_original(uint32_t x) { uint64_t v = x * c; v >>= 32; uint32_t t = uint32_t(x) - uint32_t(v); t >>= 1; t += uint32_t(v); t >>= a - 33; return t; } ``` For example, division by 7 on x86_64 generates 7 instructions: ```asm movl %edi, %eax imulq $613566757, %rax, %rax shrq $32, %rax subl %eax, %edi shrl %edi addl %edi, %eax shrl $2, %eax ``` ## Proposed Solution This patch generates the following optimized code: ```cpp uint32_t udiv_optimized(uint32_t x) { uint128_t v = uint128_t(x) * ((c + 0x100000000) << (64 - a)); return uint32_t(v >> 64); } ``` Since a 64-bit right shift of a 128-bit variable extracts the upper 64 bits, this code eliminates the need for shifts after multiplication. The implementation pre-shifts the 33-bit magic constant `c = 2^32 + Magic` left by `(64-a)` bits and uses the high 64 bits of a 64 x 64 -> 128 bit multiplication directly. This eliminates the add/sub/shift sequence. After optimization, division by 7 becomes 4 instructions (or 3 with BMI2): ```asm # Standard (4 instructions) movl %edi, %eax movabsq $2635249153617166336, %rcx mulq %rcx movq %rdx, %rax # With BMI2 (3 instructions) movl %edi, %edx movabsq $2635249153617166336, %rax mulxq %rax, %rax, %rax ```

When input is zero or sign extended.

Most of the plugins have only a small number of instances. Use `llvm::SmallVector` instead of `std::vector`. Depends on llvm#184837

The default move constructor wasn't nulling out the callbacks. Combined with the fact that llvm::sys::DynamicLibrary has no explicit move constructor and hence library.isValid() still returned true after having moved-from, we would end up calling plugin_term_callback() when destroying the moved-from PluginInfo, calling it prematurely.

This header has a case sensitivity syntax error, delete it since it's unused

…put (llvm#185061) PR llvm#182083 forgot to switch over to use the newly added `DebugMapFilter` when parsing `--allow/--disallow` YAML input. It was still using `ObjectFileList`/`ObjectFileEntry`, which was added initially in the same PR and was later intended to be replaced by `DebugMapFilter`. This patch switches over to use `DebugMapFilter`, adds necessary YAML traits, and removes `ObjectFileList`/`ObjectFileEntry`.

Before the patch moved from object was in consistent state. For some types it resets contents and switch to T_Null, for others it preserves type and value. So make sure to set T_Null for all. When we set T_Null we need to destroy the value. It's important for particular types, like std::string. With Asan it must unpoison SSO buffer. Fixes false container overflows after llvm#184693: https://lab.llvm.org/buildbot/#/builders/169/builds/20655/steps/11/logs/stdio

Other individual feature tests appear before CPU tests, so this moves this test there to make it consistent.

…test coverage for llvm#184033 (llvm#185128) Some were just missing vector / demandedelts handling - other were missing entirely

…ules (llvm#184742) This PR enhances insert_strided_slice layout rules to handle slice layout and adjust the layout to fit the src shape. It adds dropDims as layout utility function.

Reported by buildbot: https://lab.llvm.org/buildbot/#/builders/55/builds/25078

Eventually, we want clang-doc to support arena allocation, but the widespread use of owning pointers in the data types prevents this. Rather than have wide scale refactoring, we can introduce a type alias that can be swapped out atomically to switch from smart pointers to raw pointers. This is the first of several refactorings that are intended to make the transition simpler.

…ds (llvm#184769) Explain how to use the `-std` flag in clang-tidy tests and reorganize the content on C++ pitfalls into a new subsection for better readability. Related discussion: llvm#184741 As of AI Usage: the documentation is partially rephrased by Gemini 3.

z1-cciauto · 2026-03-07T01:45:51Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/4517

ronlieb · 2026-03-07T04:48:42Z

!PSDB

z1-cciauto · 2026-03-07T04:50:28Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/4520

VigneshwarJ and others added 30 commits March 6, 2026 12:41

[DebugInfo][Reassociate] Use debug records instead of intrinsics in t…

c3f0a2c

…est (llvm#185031) As per llvm#182730 (review)

builtins: adjust FP80 source management (llvm#183871)

57f1ec6

We would previously include the FP80 sources into the Windows build if we built with the GNU driver rather than the `cl` driver.

[LoopFusion] remove else after return (NFC) (llvm#184993)

9cc615a

A bit of a small nitpick, close it if unnecessary. (clang-tidy warnings)

[Bazel] Fixes 3da28bf (llvm#185082)

216a3f1

This fixes 3da28bf. Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>

[AMDGPU] fix asyncmark soft waitcnt bug (llvm#184851)

918d0fe

Asyncmarks record the current wait state and so should not allow waitcnts that occur after them to be merged into waitcnts that occur before.

[clang-doc] Fix benchmark not compiling (llvm#185065)

2cb01dc

CI didn't flag that the benchmark was using the outdated Ctx call when landing the Mustache MD patch since this benchmark isn't tested. Also added missing libraries in CMake that prevented me from building the benchmark locally.

[lldb][bytecode] Disable bytecode.test on windows (llvm#185096)

a8783dc

The test is failing on the lldb-x86_64-win buildbot.

builtins: Make cmake formatting self-consistent aftr llvm#183871

4d53c42

No behavior change.

[clang-doc] Add button toggle for light/dark theme (llvm#181587)

eada0f5

The user can now manually toggle the light or dark theme instead of waiting for the system theme to change. Also fixes a typo that caused some overflow issues even when there was no content to cause an overflow.

[mlir][reducer] Add split-input-file to mlir-reduce (llvm#184970)

a99d4a6

The tests for mlir-reduce are currently scattered. To centralize the tests for mlir-reduce, I added the split-input-file feature to mlir-reduce.It is part of llvm#184974.

[gn] port 3da28bf (DiagnosticStableIDs)

bc55e5e

[AMDGPU] Disable negative imm offset for async load/store instructions (

ab5844d

llvm#185078) They are not allowed by the HW.

[libcxx] Add __split_buffer::__swap_layouts (llvm#180102)

65f39a1

This commit simplifies the cumbersome process of swapping the respective layout members for `__split_buffer` and `vector`.

[AArch64][GlobalISel] Add more gisel test coverage. NFC

89d6936

Icohedron and others added 22 commits March 6, 2026 14:16

[AArch64] Fix more typos (NFC)

43f7838

Fix more typos in the AArch64 codebase using the https://github.com/crate-ci/typos Rust package. commit-id:9f4d826d Reviewers: davemgreen Pull Request: llvm#183086

[ARM] Fix more typos (NFC)

cf21ea9

Fix more typos in the AArch64 codebase using the https://github.com/crate-ci/typos Rust package. commit-id:33a1bb8d Reviewers: davemgreen Reviewed By: davemgreen Pull Request: llvm#183087

Revert "[libcxx] adds __split_buffer::__swap_layouts" (llvm#185120)

01a9705

Reverts llvm#180102

[HWASan] add optimization remark for supported lifetimes

610ed83

This lets us find functions where we pessimize codegen by removing lifetimes. Reviewers: vitalybuka Reviewed By: vitalybuka Pull Request: llvm#183858

[RISCV][P-ext] Custom legalize i64 SHL to WSLL(I)/WSLA(I) (llvm#185079)

3164d54

When input is zero or sign extended.

[lldb] Use llvm::SmallVector in the PluginManager (NFC) (llvm#184912)

541d546

Most of the plugins have only a small number of instances. Use `llvm::SmallVector` instead of `std::vector`. Depends on llvm#184837

[bolt][NFC] Remove unused ReorderUtils.h (llvm#184642)

f540ad6

This header has a case sensitivity syntax error, delete it since it's unused

[NewPM] Port for AArch64A53Fix835769 (llvm#184965)

e435e07

[WebAssembly] Move a wide-arithmetic test (NFC) (llvm#184950)

16cf423

Other individual feature tests appear before CPU tests, so this moves this test there to make it consistent.

[X86] known-never-zero.ll - add ROTL/ROTR/BITREVERSE/BSWAP/CTPOP/ABS …

38b47c0

…test coverage for llvm#184033 (llvm#185128) Some were just missing vector / demandedelts handling - other were missing entirely

[MLIR][XeGPU] Enhancing insert_strided_slice layout setup and infer r…

fe11a43

…ules (llvm#184742) This PR enhances insert_strided_slice layout rules to handle slice layout and adjust the layout to fit the src shape. It adds dropDims as layout utility function.

[mlir][test] Fix memory leak after llvm#184202 (llvm#185142)

5bc0501

Reported by buildbot: https://lab.llvm.org/buildbot/#/builders/55/builds/25078

merge main into amd-staging

bf535ae

ronlieb requested review from a team and dpalermo March 7, 2026 01:43

ronlieb requested a review from fabianmcg as a code owner March 7, 2026 01:43

ronlieb removed the request for review from fabianmcg March 7, 2026 01:44

dpalermo approved these changes Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge main into amd-staging#1675

merge main into amd-staging#1675
ronlieb wants to merge 55 commits intoamd-stagingfrom
amd/merge/upstream_merge_20260306184437

ronlieb commented Mar 7, 2026

Uh oh!

z1-cciauto commented Mar 7, 2026

Uh oh!

ronlieb commented Mar 7, 2026

Uh oh!

z1-cciauto commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ronlieb commented Mar 7, 2026

Uh oh!

z1-cciauto commented Mar 7, 2026

Uh oh!

ronlieb commented Mar 7, 2026

Uh oh!

z1-cciauto commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants