Commit 9b718e1
authored
Correct Attention FLOPS estimation in flops_counter.py (#4929)
### What does this PR do?
Fix Attention FLOPS Calculation for Causal LLMs
### Problem
The current attention FLOPS calculations for all causal LLMs are missing
the `/2` factor for causal (lower triangular) attention mask. This
causes **2× overestimation** of attention FLOPS.
Additionally, DeepSeek V3's MLA attention incorrectly uses the same
dimension for both Q@K^T and attn@V operations, when `v_head_dim`
differs from `q_head_dim`.
### Changes Summary
| Function | Models | Change |
|----------|--------|--------|
| `_estimate_qwen2_flops` | qwen2, llama, qwen3, mistral, etc. | `12 *`
→ `6 *` |
| `_estimate_qwen3_vl_flops` | qwen3_vl | `12 *` → `6 *` |
| `_estimate_qwen3_vl_moe_flops` | qwen3_vl_moe | `12 *` → `6 *` |
| `_estimate_qwen2_moe_flops` | qwen2_moe, qwen3_moe | `12 *` → `6 *` |
| `_estimate_gemma3_flops` | gemma3_text | `12 *` → `6 *` |
| `_estimate_apertus_flops` | apertus | `12 *` → `6 *` |
| `_estimate_gpt_oss_flops` | gpt_oss | `12 *` → `6 *` |
| `_estimate_deepseek_v3_flops` | deepseek_v3 | `12 * q` → `3 * (q + v)`
|
| `_estimate_qwen3_vit_flop` | ViT (vision) | **No change**
(bidirectional) |
For causal (autoregressive) attention, only the lower triangular portion
of the attention matrix is computed:
```
Attention Matrix (causal):
[✓ · · ·]
[✓ ✓ · ·]
[✓ ✓ ✓ ·]
[✓ ✓ ✓ ✓]
```
| Model Type | Before | After | Overestimation |
|------------|--------|-------|----------------|
| Standard GQA/MHA | `12 * seq² * d` | `6 * seq² * d` | **2.0×** |
| DeepSeek V3 MLA | `12 * seq² * q` | `3 * seq² * (q+v)` | **2.4×** |
### Reference
This fix aligns with [Megatron-LM's FLOPS
calculation](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/training.py):
- Uses `/2` for causal attention
- Separately accounts for `q_head_dim` and `v_head_dim` in MLA
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`,
`rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`,
`deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`,
`model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
- [ ] If your PR is related to the `recipe` submodule, please also
update the reference to the submodule commit via `git submodule update
--remote` or `cd recipe && git pull origin main`.1 parent 07d4033 commit 9b718e1
1 file changed
+11
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
115 | | - | |
| 115 | + | |
116 | 116 | | |
117 | 117 | | |
118 | 118 | | |
| |||
149 | 149 | | |
150 | 150 | | |
151 | 151 | | |
152 | | - | |
| 152 | + | |
153 | 153 | | |
154 | 154 | | |
155 | 155 | | |
| |||
197 | 197 | | |
198 | 198 | | |
199 | 199 | | |
200 | | - | |
| 200 | + | |
201 | 201 | | |
202 | 202 | | |
203 | 203 | | |
| |||
304 | 304 | | |
305 | 305 | | |
306 | 306 | | |
307 | | - | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
308 | 311 | | |
309 | 312 | | |
310 | 313 | | |
| |||
341 | 344 | | |
342 | 345 | | |
343 | 346 | | |
344 | | - | |
| 347 | + | |
345 | 348 | | |
346 | 349 | | |
347 | 350 | | |
| |||
409 | 412 | | |
410 | 413 | | |
411 | 414 | | |
412 | | - | |
| 415 | + | |
413 | 416 | | |
414 | 417 | | |
415 | 418 | | |
| |||
449 | 452 | | |
450 | 453 | | |
451 | 454 | | |
452 | | - | |
| 455 | + | |
453 | 456 | | |
454 | 457 | | |
455 | 458 | | |
| |||
520 | 523 | | |
521 | 524 | | |
522 | 525 | | |
523 | | - | |
| 526 | + | |
524 | 527 | | |
525 | 528 | | |
526 | 529 | | |
| |||
0 commit comments