1. Verified Flash-RL is installed using the Tutorial. 2. Currently using #36 for experiments. 3. Setup: 70B LLM, 32k sequence length, 512 training bsz, 32 mini bsz, 32 H200 nodes Performance (in seconds) per step between FP8 (blue) and BF16 (yellow) doesn't vary a lot <img width="348" height="213" alt="Image" src="https://github.com/user-attachments/assets/1b603122-ef9d-4bcf-b5ab-6da387eb5be6" /> <img width="364" height="286" alt="Image" src="https://github.com/user-attachments/assets/4f645355-a038-470e-a4ef-a86801420587" /> The FP8 overhead is even more apparent in smaller models (8B) with 8k response length.