Open
Conversation
d6252d9 to
6f3c536
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In DPX (Dynamic Partition eXecution) mode, the visible CU count changes depending on the partition configuration. The previous device identification logic relied on hardcoded CU counts (multiProcessorCount == 304 for MI300, == 80 || == 64 for MI308) to select the correct pre-compiled kernel binary (.co file). When running in DPX mode, the CU count no longer matches these expected values, causing device identification to fail and the wrong kernel path to be selected -- resulting in "file not found" errors and kernel launch failures.
Technical Details
Replace CU-count-based MI300/MI308 GPU identification with PCI Chip ID detection via hipDeviceAttributePciChipId. PCI Chip ID is a hardware constant burned into the silicon that never changes regardless of DPX partition mode, CU masking, or container environments.
MI308 device IDs (0x74A2, 0x74A8, 0x74B6, 0x74BC) are identified from the official AMD device ID registry; all other gfx942 devices default to MI300.
csrc/include/aiter_hip_common.h: Add get_pci_chip_id() helper using hipDeviceGetAttribute(hipDeviceAttributePciChipId) and is_mi308_device() that checks against known MI308 chip IDs.
csrc/cpp_itfs/mha_fwd.cu: Update get_kernel_co_name() to use is_mi308_device() instead of CU count comparison for selecting the correct .co kernel binary path (MI308/ vs MI300/).
aiter/jit/utils/chip_info.py: Add _get_pci_chip_id() using ctypes to call hipDeviceGetAttribute directly, and update get_device_name() to use chip ID instead of CU count.
Test Plan
Test Result
Submission Checklist