多机多卡训练容易超时,超时的话如何自动从已经保存的模型恢复训练? #5027
Unanswered
jiejie1993
asked this question in
Community | Q&A
Replies: 1 comment
-
any update? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
多机多卡训练过程中,发生NCCL timeout超时,在torch中有--max-restarts对训练进行重启,但是如何去自动加载最新的已经保存的模型?使用--load-checkpoint需要多节点都有这个保存的模型,但训练中只会在master节点保存模型,手动复制到所有节点的话无法实现训练自动重启,有没有什么办法实现自动重启中断的训练,并从已经保存的最新模型恢复的功能?
Beta Was this translation helpful? Give feedback.
All reactions