FP8 doesn't result in speedup over BF16

1. Verified Flash-RL is installed using the Tutorial.
2. Currently using #36 for experiments.
3. Setup: 70B LLM, 32k sequence length, 512 training bsz, 32 mini bsz, 32 H200 nodes

Performance (in seconds) per step between FP8 (blue) and BF16 (yellow) doesn't vary a lot

<img width="348" height="213" alt="Image" src="https://github.com/user-attachments/assets/1b603122-ef9d-4bcf-b5ab-6da387eb5be6" />

<img width="364" height="286" alt="Image" src="https://github.com/user-attachments/assets/4f645355-a038-470e-a4ef-a86801420587" />

The FP8 overhead is even more apparent in smaller models (8B) with 8k response length. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 doesn't result in speedup over BF16 #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

FP8 doesn't result in speedup over BF16 #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions