Skip to content

RAFT training reference Improvement #5590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 15, 2022

Conversation

YosuaMichael
Copy link
Contributor

@YosuaMichael YosuaMichael commented Mar 11, 2022

Do some of the task on: #5056

  • Change function name from validate to evaluate
  • Support --device and enable training on non-distributed mode
  • Include optimizer and scheduler in the checkpoint

Sample script to run on non-distributed mode and on cpu:

 python train.py \
    --dataset-root $dataset_root \
    --name $name_chairs \
    --model raft_small \
    --train-dataset chairs \
    --batch-size 2 \
    --lr 0.0004 \
    --weight-decay 0.0001 \
    --epochs 2 \
    --output-dir $out_chairs \
    --device cpu

To test on CPU, I run on a mock dataset by replacing https://github.com/pytorch/vision/blob/main/torchvision/datasets/_optical_flow.py with https://gist.github.com/YosuaMichael/9c49729243ff9d467ece06ab8641680d.

Note that as of now, if we run on distributed mode using torchrun, then it must use --device cuda.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 11, 2022

💊 CI failures summary and remediations

As of commit 0e7ab27 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @YosuaMichael. There are 2 minor issues (see below), but otherwise this looks great!

@YosuaMichael
Copy link
Contributor Author

Update:

Support saving of the optimizer and scheduler on the checkpoint.

@YosuaMichael
Copy link
Contributor Author

Hi @NicolasHug , I decided to put the commit for saving optimizer and scheduler in this PR as well: 09d78d1
Could you also help to review this? Thanks!

@YosuaMichael YosuaMichael changed the title Enable RAFT training reference to run on cpu and non-distributed mode RAFT training reference Improvement Mar 11, 2022
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @YosuaMichael , we're almost there :) . I made a few comments below

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @YosuaMichael, nice work ! There was a minor issue left, which I fixed in 2857e21: when no trainset is specified we want to directly go to evaluate, without worrying about train_dataset - the previous code would fail because it's None.

@NicolasHug NicolasHug merged commit 3aa2a93 into pytorch:main Mar 15, 2022
@github-actions
Copy link

Hey @NicolasHug!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request Apr 5, 2022
Summary:
* Change optical flow train.py function name from validate to evaluate so it is similar to other references

* Add --device as parameter and enable to run in non distributed mode

* Format with ufmt

* Fix unneccessary param and bug

* Enable saving the optimizer and scheduler on the checkpoint

* Fix bug when evaluate before resume and save or load model without ddp

* Fix case where --train-dataset is None

(Note: this ignores all push blocking failures!)

Reviewed By: YosuaMichael

Differential Revision: D35216768

fbshipit-source-id: 3b575d9f4a51caed920ff402e160a26ff6f3c2d4

Co-authored-by: Nicolas Hug <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants