gpl: opt-in HPWL GPU acceleration via Kokkos#10370
Conversation
Move NesterovBaseCommon::getHpwl()'s OpenMP loop body from nesterovBase.cpp into a dedicated translation unit (src/hpwl.cpp). Behavior is unchanged — same loop, same threading, same return. This isolates the HPWL kernel to a single TU so future work can swap the implementation as a build-system concern (without leaking preprocessor branches into the consumer-facing headers or sources). Signed-off-by: minjae <develop.minjae@gmail.com>
Add an ENABLE_GPU project option (default OFF) and a Kokkos parallel_reduce HPWL kernel that takes the place of the OpenMP loop when the option is on. The two implementations live in sibling translation units under src/gpl/src/ and src/gpl/src/gpu/; CMake links exactly one based on ENABLE_GPU, so the consumer-facing headers stay free of preprocessor branches. GPU kernel: * src/gpu/hpwl.cpp implements NesterovBaseCommon::getHpwl() with Kokkos parallel_for + parallel_reduce. Per-net bounding boxes are computed serially within each kernel iteration (sgizler 80b04e1 pattern) so the reduction is bit-identical to the CPU loop on both Serial and CUDA backends. * GNet::setBox() lets the GPU path push *already-computed* bboxes back into the host objects without re-iterating pins on the host (which would defeat the point of the GPU kernel) so that subsequent gNet->lx() / ly() / ux() / uy() consumers stay consistent with the CPU updateBox() side effect. * FP contraction is disabled (-ffp-contract=off, --fmad=false) for the gpu kernels to keep them bit-stable across compilers (mirrors sgizler 80b04e1 / 678a691 / Issue The-OpenROAD-Project#10336). * Kokkos is initialized lazily on the first GPU HPWL call and finalized via std::atexit, so the existing openroad main() stays untouched and the OpenMP-thread-binding warning is suppressed (golden-file regression-friendly). Build system: * root CMakeLists.txt probes the installed Kokkos (CUDA / HIP / SYCL / host-only) and switches CMake's language and CUDA arch to match. Modules (here: gpl) only key off ENABLE_GPU and Kokkos_ENABLE_*. * src/gpl/src/gpu/CMakeLists.txt encapsulates the GPU build settings; target_sources / set_source_files_properties use TARGET_DIRECTORY so source-file properties set in the subdirectory are visible to gpl_lib (created in the parent), and the parent src/ is added as a PRIVATE include so kernels can use plain "..h" lookups for private gpl headers. Build robustness across environments: * nvcc is auto-discovered from CUDA_HOME / standard install prefixes when CMake's enable_language(CUDA) does not see it on PATH (IDE-launched configures whose bundled CMake does not inherit the shell PATH; CI runners that scope environment per-step). Kokkos's find_package already located the toolkit at this point, so the legacy "No CMAKE_CUDA_COMPILER could be found" failure mode is avoided. * When the system C++ compiler is gcc 13+, an older g++ (g++-12 / g++-11) is auto-pinned as the CUDA host compiler. This sidesteps nvcc < 13's inability to parse glibc 2.38+'s _Float128 type that ships with gcc 13's standard library headers; the system C++ compiler stays unchanged for non-CUDA TUs. A WARNING surfaces clear remediation when neither fallback is installed. * nvcc 12.8 cannot parse fmt 11's nontype-template-parameter user-defined literals (fmt/bundled/format.h: operator""_a with fixed_string). FMT_USE_NONTYPE_TEMPLATE_ARGS=0 is defined for CUDA TUs so they fall back to fmt's legacy literal implementation; CXX TUs continue with the modern path. Default OFF: existing CPU-only builds and CI are unaffected. Signed-off-by: minjae <develop.minjae@gmail.com>
A GoogleTest binary that links gpl_lib + Kokkos and exercises the GPU HPWL backend's lifecycle end-to-end. Registered only when ENABLE_GPU=ON so default builds and CI are unaffected. * EmptyDesign covers the zero-net early exit and verifies Kokkos can be initialized inside a unit-test process without a NesterovBase fixture. * SingleNetThreePins / RandomMatchesCpu are skeletons that GTEST_SKIP for now — they require a synthetic GNet/GPin builder helper that does not yet exist outside the integration harness. Bit-exactness against the OpenMP path is asserted end-to-end by the existing gpl integration tests (ctest -R gpl) when the same tree is built with ENABLE_GPU=OFF vs ON; this unit test is intentionally narrower. Signed-off-by: minjae <develop.minjae@gmail.com>
The link-time dispatch refactor moved the OpenMP HPWL body into a new src/hpwl.cpp translation unit, but only the CMake build was updated. Bazel's //src/gpl srcs list now also enumerates that file so the macOS-Bazel CI link of //:openroad resolves NesterovBaseCommon::getHpwl(). Drop hpwl_gpu_test.cc as well: its single active test exercised a local CPU helper, not the GPU kernel, and the two GTEST_SKIP placeholders pending a synthetic GNet/GPin builder helper are not the prevailing OpenROAD style for unit tests. Bit-exactness across the OpenMP and Kokkos paths is already covered end-to-end by the gpl integration tests (ctest -R gpl, 60/60 PASS) when the same tree is built with ENABLE_GPU=OFF vs ON; a unit-level CPU/GPU comparison can land alongside the future helper. Signed-off-by: minjae <develop.minjae@gmail.com>
The nested src/gpl/src/gpu/CMakeLists.txt that the previous commit
introduced relied on set_source_files_properties' TARGET_DIRECTORY
option (CMake 3.18+) so source-file LANGUAGE=CUDA could reach the
gpl_lib target created in the parent directory. This conflicts with
the project's cmake_minimum_required(VERSION 3.16).
Inline the GPU mutex back into src/gpl/CMakeLists.txt and treat
src/gpu/ as a file-layout subdirectory only. This matches existing
single-file-deep gpl modules (src/gpl has no other nested CMake
files); larger sub-modules elsewhere in the tree (e.g. src/odb/src/db)
own their own targets and CMakeLists, which is the right pattern
once GPU file count grows but is over-engineering for one
translation unit. CPU sources continue to find their headers via
the compiler's same-directory default; sources under src/gpu/ get
src/ added as a PRIVATE include so plain "..h" lookups resolve
private gpl headers.
Two further build-system tightenings in the same commit, both
narrowing previously-broad behavior:
* CMAKE_CUDA_ARCHITECTURES "native" is dropped as the implicit
fallback: it required CMake 3.24+ and silently chose the configure
host's GPU when the Kokkos package did not pin an architecture.
When neither Kokkos_CUDA_ARCHITECTURES nor CMAKE_CUDA_ARCHITECTURES
is provided, FATAL_ERROR now surfaces clear remediation
("set -DCMAKE_CUDA_ARCHITECTURES=89 / 120 / ... or rebuild Kokkos
with the target architecture baked in").
* The gcc 13+ -> g++-12/g++-11 host compiler pin is now gated on
nvcc < 13, probed via execute_process(nvcc --version) before
enable_language(CUDA). Pairings that already work (gcc 13 + nvcc
13) no longer trigger the pin, and CMAKE_CUDA_HOST_COMPILER cache
mutation is reserved for the actually-broken combination.
Signed-off-by: minjae <develop.minjae@gmail.com>
Three small cleanups to the Kokkos HPWL kernel that came out of upstream review preparation: * ensureKokkosInitialized() is wrapped in std::call_once so the is_initialized() / initialize() / atexit() sequence stays safe if a future caller drops the master-thread invariant. The per-call cost is one acquired flag check. * Drop the OMP_PROC_BIND / OMP_PLACES setenv pair from the GPU TU. Modifying process-global environment from a library init is invasive (every other OpenMP user in the openroad process picks up the change); Kokkos's set_disable_warnings(true) already silences the "OMP_PROC_BIND not set" warning that the setenv was working around. Verified end-to-end on cpu-server with ENABLE_GPU=ON ctest -R gpl (60/60 PASS, golden files unaffected). * Drop the omp.h include and the assert(omp_get_thread_num() == 0) prologue. The assert was meaningful in the OpenMP path (where getHpwl() opens its own #pragma omp parallel and would be UB if called from inside another OpenMP region); the GPU TU runs no OpenMP region, and including omp.h in a TU compiled by nvcc relies on libgomp's stub omp.h being present which is fragile across toolchains. Also clarifies GNet::setBox's caller invariant: the values must equal what updateBox() would have produced from the same pin set; the function performs no validation, and exists only so the GPU path can avoid re-iterating the pin list on the host. Signed-off-by: minjae <develop.minjae@gmail.com>
…issing Replace the bare find_package(Kokkos REQUIRED) with a QUIET probe followed by an explicit FATAL_ERROR that names Kokkos_ROOT, CMAKE_PREFIX_PATH, and the upstream Kokkos repository, and points forward to the planned etc/DependencyInstaller.sh -gpu option. CMake's default "Could not find a package configuration file provided by 'Kokkos'" message is technically correct but does not explain what Kokkos is, why ENABLE_GPU=ON needs it, or the most common cache variable a user reaches for first (Kokkos_ROOT, not Kokkos_DIR). The new message is a no-op on the working path and only fires when the configure would have failed anyway. Signed-off-by: minjae <develop.minjae@gmail.com>
The opt-in GPU block had grown to ~110 lines in the root CMakeLists.txt (Kokkos discovery, nvcc auto-discovery, CUDA host compiler probing, architecture handling, fmt workaround, language enablement). Move it into a dedicated module file under cmake/, alongside the existing FindTCL.cmake and GetGitRevisionDescription.cmake helpers. The root CMakeLists.txt now keeps a single high-level entry point under if(ENABLE_GPU): include(KokkosBackend). Default-OFF builds do not load the file at all. Behavior is unchanged on both paths; the module is verified against the same fresh ENABLE_GPU=ON build and the Kokkos-not-found path that triggered the FATAL_ERROR introduced in the previous commit. Signed-off-by: minjae <develop.minjae@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces GPU acceleration via Kokkos for the Half-Perimeter Wirelength (HPWL) calculation in the Global Placement (GPL) module. It implements a link-time dispatch mechanism to switch between the existing OpenMP CPU implementation and a new Kokkos-based GPU implementation. Feedback focuses on performance optimizations for the GPU hot path, specifically recommending the reuse of persistent device views to avoid redundant allocations and transfers, and merging kernels to reduce launch overhead and improve cache locality.
| std::vector<int> h_net_off(n_nets + 1, 0); | ||
| for (int i = 0; i < n_nets; ++i) { | ||
| h_net_off[i + 1] | ||
| = h_net_off[i] + static_cast<int>(gNetStor_[i].getGPins().size()); | ||
| } | ||
| const int total_pins = h_net_off[n_nets]; | ||
|
|
||
| std::vector<int> h_pin_cx(total_pins); | ||
| std::vector<int> h_pin_cy(total_pins); | ||
| for (int i = 0; i < n_nets; ++i) { | ||
| int off = h_net_off[i]; | ||
| for (auto* gPin : gNetStor_[i].getGPins()) { | ||
| h_pin_cx[off] = gPin->cx(); | ||
| h_pin_cy[off] = gPin->cy(); | ||
| ++off; | ||
| } | ||
| } | ||
|
|
||
| // ---- 2. Mirror inputs to device ---- | ||
| using ExecSpace = Kokkos::DefaultExecutionSpace; | ||
| Kokkos::View<int*, ExecSpace> d_net_off("hpwl_net_off", n_nets + 1); | ||
| Kokkos::View<int*, ExecSpace> d_pin_cx("hpwl_pin_cx", total_pins); | ||
| Kokkos::View<int*, ExecSpace> d_pin_cy("hpwl_pin_cy", total_pins); | ||
|
|
||
| Kokkos::View<int*, Kokkos::HostSpace, Kokkos::MemoryUnmanaged> h_net_off_view( | ||
| h_net_off.data(), n_nets + 1); | ||
| Kokkos::View<int*, Kokkos::HostSpace, Kokkos::MemoryUnmanaged> h_pin_cx_view( | ||
| h_pin_cx.data(), total_pins); | ||
| Kokkos::View<int*, Kokkos::HostSpace, Kokkos::MemoryUnmanaged> h_pin_cy_view( | ||
| h_pin_cy.data(), total_pins); | ||
|
|
||
| Kokkos::deep_copy(d_net_off, h_net_off_view); | ||
| Kokkos::deep_copy(d_pin_cx, h_pin_cx_view); | ||
| Kokkos::deep_copy(d_pin_cy, h_pin_cy_view); | ||
|
|
||
| // Per-net bbox outputs (kept on device for reduction; mirrored back at end). | ||
| Kokkos::View<int*, ExecSpace> d_lx("hpwl_net_lx", n_nets); | ||
| Kokkos::View<int*, ExecSpace> d_ly("hpwl_net_ly", n_nets); | ||
| Kokkos::View<int*, ExecSpace> d_ux("hpwl_net_ux", n_nets); | ||
| Kokkos::View<int*, ExecSpace> d_uy("hpwl_net_uy", n_nets); |
There was a problem hiding this comment.
The current implementation performs multiple host and device memory allocations, data flattening, and H2D transfers on every call to getHpwl(). Since this function is a hot path called frequently during the Nesterov placement loop, these overheads (especially std::vector allocations and cudaMalloc calls hidden behind Kokkos::View constructors) are likely to dominate the computation time.
In accordance with performance guidelines for persistent state, consider the following optimizations:
- Reuse Device Views: Move
d_net_off,d_pin_cx, etc., to be persistent members so they are only allocated once or when the netlist size changes, rather than being treated as temporary scratch buffers. - Avoid Host Flattening: Maintain pin coordinates on the device throughout the placement loop to avoid the cost of flattening
gNetStor_and copying to the device in every iteration. - Reuse Mirror Views:
Kokkos::create_mirror_viewcan be called once and reused to avoid host-side allocations during the D2H copy.
References
- When creating worker objects, distinguish between persistent state that must be copied and scratch buffers. Persistent state should be reused to avoid redundant allocations and transfers.
- Performance optimizations for loops are necessary when the containing function is part of a hot path called frequently, such as the Nesterov placement loop.
There was a problem hiding this comment.
Thanks for catching this — the per-call allocation pattern is intentional for this PR's scope (pattern-establishing first cut). Persistent device views and amortized H2D transfers will be needed once multiple kernels share state across a Nesterov iteration (WLEN gradient + density gradient + HPWL all reading the same pin coordinates), and that's the natural place to introduce them. I'll revisit this getHpwl() to share the persistent state when the WLEN gradient PR lands.
For this PR, HPWL alone is small relative to placement gradient updates; the alloc overhead, while real, sits below the noise floor of an end-to-end placement run.
| Kokkos::parallel_for( | ||
| "hpwl_bbox", | ||
| Kokkos::RangePolicy<ExecSpace>(0, n_nets), | ||
| KOKKOS_LAMBDA(const int i) { | ||
| int lx = INT_MAX; | ||
| int ly = INT_MAX; | ||
| int ux = INT_MIN; | ||
| int uy = INT_MIN; | ||
| const int begin = d_net_off(i); | ||
| const int end = d_net_off(i + 1); | ||
| // Serial over pins for determinism (sgizler 80b04e1c1 pattern: do not | ||
| // rely on parallel_reduce ordering even though min/max are commutative | ||
| // — keeps results bit-identical to the CPU updateBox() loop). | ||
| for (int j = begin; j < end; ++j) { | ||
| const int x = d_pin_cx(j); | ||
| const int y = d_pin_cy(j); | ||
| if (x < lx) { | ||
| lx = x; | ||
| } | ||
| if (y < ly) { | ||
| ly = y; | ||
| } | ||
| if (x > ux) { | ||
| ux = x; | ||
| } | ||
| if (y > uy) { | ||
| uy = y; | ||
| } | ||
| } | ||
| d_lx(i) = lx; | ||
| d_ly(i) = ly; | ||
| d_ux(i) = ux; | ||
| d_uy(i) = uy; | ||
| }); | ||
|
|
||
| // ---- 4. Sum HPWL across nets (int64 reduction → backend-deterministic) ---- | ||
| int64_t total_hpwl = 0; | ||
| Kokkos::parallel_reduce( | ||
| "hpwl_sum", | ||
| Kokkos::RangePolicy<ExecSpace>(0, n_nets), | ||
| KOKKOS_LAMBDA(const int i, int64_t& acc) { | ||
| const int lx = d_lx(i); | ||
| const int ly = d_ly(i); | ||
| const int ux = d_ux(i); | ||
| const int uy = d_uy(i); | ||
| // Dangling net (no pins): GNet::getHpwl() returns 0 in this case. | ||
| if (ux < lx) { | ||
| return; | ||
| } | ||
| acc += static_cast<int64_t>(ux - lx) + static_cast<int64_t>(uy - ly); | ||
| }, | ||
| total_hpwl); |
There was a problem hiding this comment.
The hpwl_bbox and hpwl_sum kernels can be merged into a single parallel_reduce operation. This would reduce kernel launch overhead and improve cache locality by processing each net's pins and contributing to the total HPWL sum in a single pass. This is a recommended optimization given that this function is called repeatedly within the placement loop.
References
- A performance optimization for a loop is necessary if the containing function is called frequently (hot path).
There was a problem hiding this comment.
Good observation. One nuance to flag: the kernels are split because hpwl_bbox writes per-net bboxes to four View<int*> outputs (mirrored back to the host via GNet::setBox() so downstream consumers like routeBase and timing-driven weighting see consistent gNet->lx()/ly()/ux()/uy()), while hpwl_sum only produces the reduction scalar. A fused single-pass parallel_reduce is feasible but would need a custom reducer that emits both the per-net bboxes and the running sum total in one launch.
Will fold this in once the kernel-launch pattern stabilizes across the WLEN / density kernels — at that point the right shape (custom reducer vs. two passes vs. a Kokkos::TeamPolicy two-level decomposition) will be more obvious. Keeping the simpler two-kernel form here so the reviewable surface stays focused on the dispatch shape.
| // Determinism: integer arithmetic; bit-exact across Kokkos backends | ||
| // (Serial / OpenMP / Threads / CUDA) and against the OpenMP CPU loop. | ||
|
|
||
| #include <Kokkos_Core.hpp> |
There was a problem hiding this comment.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include <Kokkos_Core.hpp>
^| #include <Kokkos_Core.hpp> | ||
| #include <climits> | ||
| #include <cstdint> | ||
| #include <cstdlib> |
There was a problem hiding this comment.
warning: included header cstdint is not used directly [misc-include-cleaner]
| #include <cstdlib> | |
| #include <cstdlib> |
| #include <cstdlib> | ||
| #include <mutex> | ||
| #include <vector> | ||
|
|
There was a problem hiding this comment.
warning: included header vector is not used directly [misc-include-cleaner]
| #include <cstdint> | ||
|
|
||
| #include "nesterovBase.h" | ||
| #include "omp.h" |
There was a problem hiding this comment.
warning: 'omp.h' file not found [clang-diagnostic-error]
#include "omp.h"
^|
Note on the Clang-Tidy CI failures The four diagnostics surfaced by the Clang-Tidy job on this PR are environmental rather than source bugs:
The first two come from the Clang-Tidy job not having Kokkos / OpenMP headers on its include path; #5352 sees the same diagnostics from the same toolchain. The cascading "is not used directly" warnings on Happy to follow up if there's an upstream way to surface Kokkos to Clang-Tidy's effective include set (via |
|
@calewis review? |
Summary
This is the first PR in an incremental series moving GPU-amenable hot paths in
src/gpl/to Kokkos behind a singleENABLE_GPUopt-in flag. Each PR (HPWL here, then WLEN gradient / density gradient / Poisson FFT /etc/DependencyInstallerwiring) lands independently and follows the same link-time dispatch pattern. Direct continuation of the discussion in #5352.This PR adds an
ENABLE_GPUproject option (default OFF) and a Kokkosparallel_reduceHPWL kernel sitting in a sibling translation unit (src/gpl/src/gpu/hpwl.cpp) to the existing OpenMP loop (src/gpl/src/hpwl.cpp). CMake links exactly one based onENABLE_GPU; no preprocessor branching in consumer code. HPWL was chosen as the first kernel because integer-only arithmetic makes bit-exactness trivial to audit while the dispatch shape is being settled.Type of Change
Impact
Default builds (
ENABLE_GPU=OFF) are byte-identical to master; no Kokkos or CUDA dependency, CMake CI and Bazel//:openroadunaffected.ENABLE_GPU=ONrequires a Kokkos package on the host. CMake discovers Kokkos, switches on the matching language, and links the Kokkos kernel intogpl_lib. Implementation details (lazy lifecycle,GNet::setBox, determinism flags, host-compiler/nvcc compatibility helpers) are in commit 2's body andcmake/KokkosBackend.cmake.The Kokkos kernel is bit-exact against the OpenMP path: integer arithmetic, serial per-net bbox reduction,
int64_tcross-net sum.ctest -R gplis identical between OFF and ON builds (60/60 PASS each).Verification
etc/Build.shdefault build OK.ENABLE_GPU=ONbuild OK on Linux x86_64 + CUDA 12.8 + Kokkos 5.1.1 + sm_89 (gcc 13 / g++-12 host).ctest -R gpl: 60/60 PASS on both builds (12.2s OFF, 31.7s ON).clang-formatclean. All commits signed off (DCO).Test plan
A unit-level CPU/GPU comparison test belongs alongside a synthetic
GNet/GPinbuilder helper that doesn't yet exist in the test harness; deferred.Related Issues
gpl2/) — strategy context.