-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Open
Description
System Info
Two issues:
- It seems the current implementation of the flops counter is counting the total flops of all the training step which causes the tflops/s/gpu metric goes up with increasing number of training steps. (reference: flops counter error with PyTorch2.5 and 2.6 pytorch/pytorch#145947 (comment))
- actually, it's a question: in the training config, what is the impact of the default num_freeze_layers: int = 1 on mllama model (llama3.2-11b vision model)? I am suspecting if it is causing the underestimate of total model tflops issue.
Thank you!
Information
- The official example scripts
- My own modified scripts
🐛 Describe the bug
steps to reproduce can be found in this ticket: pytorch/pytorch#145947 (comment)
Error logs
Explained in above and also in this ticket: pytorch/pytorch#145947 (comment)
Expected behavior
The Flops counting is buggy.
Metadata
Metadata
Assignees
Labels
No labels