Skip to content

[ROCm][Bugfix] Fix chunk alignment when using context parallelism with TRITON_MLA#46114

Open
micah-wil wants to merge 4 commits into
vllm-project:mainfrom
micah-wil:micah/fix-mla-attn
Open

[ROCm][Bugfix] Fix chunk alignment when using context parallelism with TRITON_MLA#46114
micah-wil wants to merge 4 commits into
vllm-project:mainfrom
micah-wil:micah/fix-mla-attn

Conversation

@micah-wil

@micah-wil micah-wil commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

There is a bug in mla_attention when using context parallelism on ROCm. max_context_chunk is being aligned properly on CUDA because of the self.aot_schedule path (which is CUDA-only). The chunk misalignment causes zero accuracy in test_context_parallel.py on ROCm with dcp_size=4 using TRITON_MLA.

tests/distributed/test_context_parallel.py::test_cp_generation[deepseek-ai/DeepSeek-V2-Lite-Chat-parallel_setup0-mp-auto-test_options0]

>               raise RuntimeError(
                    f"Test subprocess '{f.__name__}' failed "
                    f"({_format_subprocess_exit(result.returncode)}):\n{tb}"
                )
E               RuntimeError: Test subprocess 'test_cp_generation' failed (exit code 1):
E               Traceback (most recent call last):
E                 File "<string>", line 12, in <module>
E                 File "/projects/vllm/tests/utils.py", line 1739, in wrapper
E                   return f(*args, **kwargs)
E                          ^^^^^^^^^^^^^^^^^^
E                 File "/projects/vllm/tests/distributed/test_context_parallel.py", line 296, in test_cp_generation
E                   _test_cp_gsm8k(
E                 File "/projects/vllm/tests/distributed/test_context_parallel.py", line 255, in _test_cp_gsm8k
E                   assert accuracy >= min_accuracy, (
E                          ^^^^^^^^^^^^^^^^^^^^^^^^
E               AssertionError: TP+DCP accuracy too low: 0.000 < 0.500

tests/utils.py:1795: RuntimeError

This PR resolves the issue by shrinking max_context_chunk to the nearest size that divides evenly across GPUs and lands on a clean cache-block boundary. With this, the above test case is passing. I fixed the corresponding test to not use CUDA-only attention backends as well, and I set the baseline accuracy now that the test successfully runs on ROCm. It was passing but flaky with MIN_ACCURACY=0.64, and after 10 tries the lowest accuracy I saw was about 0.51.

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@mergify mergify Bot added rocm Related to AMD ROCm bug Something isn't working labels Jun 18, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 18, 2026
@micah-wil micah-wil changed the title [ROCm][Bugfix] [ROCm][Bugfix] Fix chunk alignment when using context parallelism with TRITON_ATTN Jun 19, 2026
@micah-wil micah-wil changed the title [ROCm][Bugfix] Fix chunk alignment when using context parallelism with TRITON_ATTN [ROCm][Bugfix] Fix chunk alignment when using context parallelism with TRITON_MLA Jun 19, 2026
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant