Voxtral Realtime: enable bf16 for Metal backend with quantization by mergennachin · Pull Request #17845 · pytorch/executorch

mergennachin · 2026-03-04T14:28:58Z

The Metal AOTI backend already handles bf16 correctly (fp32 attention
masks, fp32 RoPE upcast, dtype-agnostic KV caches and SDPA). Enable
--dtype bf16 as the default recipe for Metal CI and update all
documentation to recommend bf16 with fpa4w quantization.

pytorch-bot · 2026-03-04T14:29:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17845

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⚠️ 1 Awaiting Approval, 21 Pending

As of commit 0e981f1 with merge base b40d6fe ():

AWAITING APPROVAL - The following workflow needs approval before CI can run:

Claude Code (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-04T14:29:43Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Enables and recommends bf16 for Voxtral Realtime exports on Metal when using quantization, updating CI export arguments and user-facing docs to reflect the preferred configuration for memory/throughput.

Changes:

Update Voxtral Realtime docs to include bf16 memory footprint numbers and recommend --dtype bf16 for Metal quantized exports.
Adjust example Metal export command(s) to include --dtype bf16 alongside fpa4w.
Update Metal CI export script to pass --dtype bf16 for the quantized-int4-metal configuration.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
examples/models/voxtral_realtime/model.md	Updates memory calculations and guidance around bf16 + quantization for Metal/CUDA.
examples/models/voxtral_realtime/export_voxtral_rt.py	Updates usage example to show Metal export with bf16 + `fpa4w`.
examples/models/voxtral_realtime/README.md	Updates Metal backend table and export examples to recommend bf16 with `fpa4w`.
.ci/scripts/export_model_artifact.sh	Ensures Metal int4 quantized CI export passes `--dtype bf16`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/models/voxtral_realtime/model.md

The Metal AOTI backend already handles bf16 correctly (fp32 attention masks, fp32 RoPE upcast, dtype-agnostic KV caches and SDPA). Enable --dtype bf16 as the default recipe for Metal CI and update all documentation to recommend bf16 with fpa4w quantization. Fix a Metal shader compilation bug in the streaming encoder where bool.to(bf16) generates `bfloat tmp = 0.0;` — Metal Shading Language doesn't support implicit float-to-bfloat literal conversion. Use .float() instead and let mul_ handle type promotion.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

examples/models/voxtral_realtime/model.md

+fp32: ≈ 832 MB, bf16: ≈ 416 MB. Encoder KV caches (streaming):
+32 layers × 2 × 1500 × 32 × 64 × bytes_per_elem. fp32: ≈ 786 MB,
+bf16: ≈ 393 MB.



examples/models/voxtral_realtime/model.md

 **Metal:** `MetalSDPA` uses `torch.ops.aten._scaled_dot_product_attention_math_for_mps`
-which handles GQA natively via `gqa_factor`, avoiding the memory bandwidth
-overhead of `repeat_interleave`. Uses explicit additive attention masks
+which handles GQA natively (the kernel infers the group ratio from differing
+Q vs K/V head counts), avoiding the memory bandwidth overhead of
+`repeat_interleave`. Uses explicit additive attention masks
 that must match the Q/K/V dtype (the kernel reads masks as `device T*`).


mergennachin requested a review from lucylq as a code owner March 4, 2026 14:28

Copilot AI review requested due to automatic review settings March 4, 2026 14:29

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 4, 2026

mergennachin requested a review from manuelcandales March 4, 2026 14:29

Copilot started reviewing on behalf of mergennachin March 4, 2026 14:29 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

examples/models/voxtral_realtime/model.md Show resolved Hide resolved

examples/models/voxtral_realtime/model.md Outdated Show resolved Hide resolved

examples/models/voxtral_realtime/model.md Outdated Show resolved Hide resolved

mergennachin marked this pull request as draft March 4, 2026 14:37

mergennachin force-pushed the bf16_voxtral_metal branch from 40b6144 to 52027ff Compare March 4, 2026 14:45

mergennachin temporarily deployed to upload-benchmark-results March 4, 2026 15:40 — with GitHub Actions Inactive

manuelcandales approved these changes Mar 16, 2026

View reviewed changes

mergennachin force-pushed the bf16_voxtral_metal branch from 52027ff to 77b74fd Compare March 16, 2026 21:00

mergennachin temporarily deployed to upload-benchmark-results March 16, 2026 22:03 — with GitHub Actions Inactive

Merge branch 'main' into bf16_voxtral_metal

0e981f1

mergennachin added the ciflow/metal label Mar 18, 2026

mergennachin marked this pull request as ready for review March 18, 2026 13:51

Copilot AI review requested due to automatic review settings March 18, 2026 13:51

Copilot started reviewing on behalf of mergennachin March 18, 2026 13:52 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

mergennachin merged commit 202c6af into main Mar 18, 2026
318 of 328 checks passed

mergennachin deleted the bf16_voxtral_metal branch March 18, 2026 14:54

mergennachin temporarily deployed to upload-benchmark-results March 18, 2026 14:56 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voxtral Realtime: enable bf16 for Metal backend with quantization#17845

Voxtral Realtime: enable bf16 for Metal backend with quantization#17845
mergennachin merged 2 commits intomainfrom
bf16_voxtral_metal

mergennachin commented Mar 4, 2026

Uh oh!

pytorch-bot bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mergennachin commented Mar 4, 2026

Uh oh!

pytorch-bot bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17845

⚠️ 1 Awaiting Approval, 21 Pending

Uh oh!

github-actions bot commented Mar 4, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Mar 4, 2026 •

edited

Loading

This PR needs a `release notes:` label