Add pretrained weights on Chairs and Things for raft_large #5060

NicolasHug · 2021-12-08T14:57:42Z

Towards #4644

This PR:

adds pretrained weights for the raft_large arch (both from our training recipe and also from the original paper's repo)
add these weights on the prototype model builders
add a readme in the training ref
replace --small with --model in the training ref as per Follow-up improvements to RAFT training reference #5056
add --weights and --pretrained in the training ref as per Follow-up improvements to RAFT training reference #5056
fix a bug that would only allow the model to run on CUDA

The weights are trained on Chairs + Things and can be evaluated on the training set of Sintel or Kitti.

Some manual tests making sure all works fine:

Using --weights Raft_Large_Weights.C_T_V2

(raft) ➜  vision git:(raft_pretrained_CT) ✗ torchrun --nproc_per_node 8 --nnodes 1 references/optical_flow/train.py --val-dataset sintel --batch-size 10 --dataset-root /data/home/nicolashug/cluster/work/downloads --model raft_large --weights Raft_Large_Weights.C_T_V2
Sintel val clean Total time: 0:00:15
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.3825	1px: 0.9028	3px: 0.9573	5px: 0.9697	per_image_epe: 1.3782	f1: 4.0234
Sintel val final Total time: 0:00:12
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7148	1px: 0.8526	3px: 0.9203	5px: 0.9392	per_image_epe: 2.7199	f1: 7.6100

Using --pretrained

(raft) ➜  vision git:(raft_pretrained_CT) ✗ torchrun --nproc_per_node 8 --nnodes 1 references/optical_flow/train.py --val-dataset sintel --batch-size 10 --dataset-root /data/home/nicolashug/cluster/work/downloads --model raft_large --pretrained
Sintel val clean Total time: 0:00:14
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.3825	1px: 0.9028	3px: 0.9573	5px: 0.9697	per_image_epe: 1.3782	f1: 4.0234
Sintel val final Total time: 0:00:12
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7148	1px: 0.8526	3px: 0.9203	5px: 0.9392	per_image_epe: 2.7199	f1: 7.6100

Using --weights Raft_Large_Weights.C_T_V1 (Original weights)

Sintel val clean Total time: 0:04:06
Batch-processed 1041 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.4411	1px: 0.9016	3px: 0.9560	5px: 0.9684	per_image_epe: 1.4411	f1: 4.1593
Sintel val final Total time: 0:04:02
Batch-processed 1041 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7894	1px: 0.8528	3px: 0.9190	5px: 0.9381	per_image_epe: 2.7894	f1: 7.7217

from torchvision.prototype.models.optical_flow import raft_large, Raft_Large_Weights
assert not next(raft_large(weights="Raft_Large_Weights.C_T_V2").parameters()).is_cuda
assert not next(raft_large(weights=Raft_Large_Weights.C_T_V2).parameters()).is_cuda

For my own sanity as for reference, here's the slurm script that I used (obviously these weights corresponds to the ones in things/raft-things.pth)

#!/bin/bash
#SBATCH --partition=train
#SBATCH --cpus-per-task=96  # 12 CPUs per GPU
#SBATCH --gpus-per-node=8
#SBATCH --nodes=1
#SBATCH --time=70:00:00
#SBATCH --output=/data/home/nicolashug/cluster/experiments/slurm-%j.out
#SBATCH --error=/data/home/nicolashug/cluster/experiments/slurm-%j.err



n_gpus=8  # If you modify these, also update the equivalent above.
n_nodes=1

output_dir=~/cluster/experiments/id_$SLURM_JOB_ID
mkdir -p $output_dir

this_script=./train.sh  # depends where you call it from
cp $this_script $output_dir

function unused_port() {
    # Find a random unused port. It's needed if you run multiple sbatches on the same node
    N=${1:-1}
    comm -23 \
        <(seq "1025" "65535" | sort) \
        <(ss -Htan |
            awk '{print $4}' |
            cut -d':' -f2 |
            sort -u) |
        shuf |
        head -n "$N"
}
master_port=$(unused_port)

dataset_root=/data/home/nicolashug/cluster/work/downloads

# FlyingChairs
batch_size_chairs=2
lr_chairs=0.0004
num_steps_chairs=100000
name_chairs=raft_chairs
wdecay_chairs=0.0001

chairs_dir=$output_dir/chairs
mkdir -p $chairs_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_chairs \
    --train-dataset chairs \
    --batch-size $batch_size_chairs \
    --lr $lr_chairs \
    --weight-decay $wdecay_chairs \
    --num-steps $num_steps_chairs \
    --output-dir $chairs_dir

# FlyingThings3D
batch_size_things=2
lr_things=0.000125
num_steps_things=100000
name_things=raft_things
wdecay_things=0.0001

things_dir=$output_dir/things
mkdir -p $things_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_things \
    --train-dataset things \
    --batch-size $batch_size_things \
    --lr $lr_things \
    --weight-decay $wdecay_things \
    --num-steps $num_steps_things \
    --freeze-batch-norm \
    --output-dir $things_dir\
    --resume $chairs_dir/$name_chairs.pth

# Sintel S+K+H
batch_size_sintel_skh=2
lr_sintel_skh=0.000125
num_steps_sintel_skh=100000
name_sintel_skh=raft_sintel_skh
wdecay_sintel_skh=0.00001
gamma_sintel_skh=0.85

sintel_skh_dir=$output_dir/sintel_skh
mkdir -p $sintel_skh_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_sintel_skh \
    --train-dataset sintel_SKH \
    --batch-size $batch_size_sintel_skh \
    --lr $lr_sintel_skh \
    --weight-decay $wdecay_sintel_skh \
    --gamma $gamma_sintel_skh \
    --num-steps $num_steps_sintel_skh \
    --freeze-batch-norm \
    --output-dir $sintel_skh_dir\
    --resume $things_dir/$name_things.pth

# Kitti
batch_size_kitti=2
lr_kitti=0.0001
num_steps_kitti=50000
name_kitti=raft_kitti
wdecay_kitti=0.00001
gamma_kitti=0.85

kitti_dir=$output_dir/kitti
mkdir -p $kitti_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_kitti \
    --train-dataset kitti \
    --batch-size $batch_size_kitti \
    --lr $lr_kitti \
    --weight-decay $wdecay_kitti \
    --gamma $gamma_kitti \
    --num-steps $num_steps_kitti \
    --freeze-batch-norm \
    --output-dir $kitti_dir \
    --resume $sintel_skh_dir/$name_sintel_skh.pth

The code to map the original paper's weights to ours is

def map_orig_to_ours(orig, mine=None):
    # TODO: remove
    d = {}
    used_s_orig = set()
    used_s_mine = set()

    def assert_and_add(s_orig, s_mine):
        # print(s_orig, s_mine)
        # print(orig[s_orig].shape, mine[s_mine].shape)

        assert s_orig not in used_s_orig
        assert s_mine not in used_s_mine

        if mine is not None:
            assert s_mine in mine
        assert s_orig in orig
        if mine is not None:
            assert orig[s_orig].shape == mine[s_mine].shape
        d["module." + s_mine] = orig[s_orig]
        used_s_orig.add(s_orig)
        used_s_mine.add(s_mine)

    for encoder_orig, encoder_mine in (
        ("fnet", "feature_encoder"),
        ("cnet", "context_encoder"),
    ):
        for attr in ("bias", "weight"):
            s_orig = f"module.{encoder_orig}.conv1.{attr}"
            s_mine = f"{encoder_mine}.convnormrelu.0.{attr}"
            assert_and_add(s_orig, s_mine)

            s_orig = f"module.{encoder_orig}.conv2.{attr}"
            s_mine = f"{encoder_mine}.conv.{attr}"
            assert_and_add(s_orig, s_mine)

            for layer in (1, 2, 3):
                for block in (0, 1):
                    for conv in (1, 2):
                        s_orig = f"module.{encoder_orig}.layer{layer}.{block}.conv{conv}.{attr}"
                        s_mine = f"{encoder_mine}.layer{layer}.{block}.convnormrelu{conv}.0.{attr}"
                        assert_and_add(s_orig, s_mine)

            for layer in (2, 3):
                s_orig = f"module.{encoder_orig}.layer{layer}.0.downsample.0.{attr}"
                s_mine = f"{encoder_mine}.layer{layer}.0.downsample.0.{attr}"
                assert_and_add(s_orig, s_mine)

    encoder_orig, encoder_mine = "cnet", "context_encoder"
    for attr in (
        "bias",
        "weight",
        "running_mean",
        "running_var",
        "num_batches_tracked",
    ):
        s_orig = f"module.{encoder_orig}.norm1.{attr}"
        s_mine = f"{encoder_mine}.convnormrelu.1.{attr}"
        assert_and_add(s_orig, s_mine)
        for layer in (1, 2, 3):
            for block in (0, 1):
                for norm in (1, 2):
                    s_orig = f"module.{encoder_orig}.layer{layer}.{block}.norm{norm}.{attr}"
                    s_mine = f"{encoder_mine}.layer{layer}.{block}.convnormrelu{norm}.1.{attr}"
                    assert_and_add(s_orig, s_mine)
        for layer in (2, 3):
            s_orig = f"module.{encoder_orig}.layer{layer}.0.downsample.1.{attr}"
            s_mine = f"{encoder_mine}.layer{layer}.0.downsample.1.{attr}"
            assert_and_add(s_orig, s_mine)

    corr_orig, corr_mine = (
        "module.update_block.encoder.",
        "update_block.motion_encoder.",
    )
    for attr in ("bias", "weight"):
        for i in (1, 2):
            s_orig = f"{corr_orig}convc{i}.{attr}"
            s_mine = f"{corr_mine}convcorr{i}.0.{attr}"
            assert_and_add(s_orig, s_mine)
            s_orig = f"{corr_orig}convf{i}.{attr}"
            s_mine = f"{corr_mine}convflow{i}.0.{attr}"
            assert_and_add(s_orig, s_mine)
        s_orig = f"{corr_orig}conv.{attr}"
        s_mine = f"{corr_mine}conv.0.{attr}"
        assert_and_add(s_orig, s_mine)

    rec_orig, rec_mine = "module.update_block.gru", "update_block.recurrent_block"
    for attr in ("bias", "weight"):
        for i in (1, 2):
            for conv in ("convz", "convr", "convq"):
                s_orig = f"{rec_orig}.{conv}{i}.{attr}"
                s_mine = f"{rec_mine}.convgru{i}.{conv}.{attr}"
                assert_and_add(s_orig, s_mine)

    flow_orig, flow_mine = "module.update_block.flow_head", "update_block.flow_head"
    for attr in ("bias", "weight"):
        for i in (1, 2):
            s_orig = f"{flow_orig}.conv{i}.{attr}"
            s_mine = f"{flow_mine}.conv{i}.{attr}"
            assert_and_add(s_orig, s_mine)
    for s_orig, s_mine in zip(
        (
            "module.update_block.mask.0.weight",
            "module.update_block.mask.0.bias",
            "module.update_block.mask.2.weight",
            "module.update_block.mask.2.bias",
        ),
        (
            "mask_predictor.convrelu.0.weight",
            "mask_predictor.convrelu.0.bias",
            "mask_predictor.conv.weight",
            "mask_predictor.conv.bias",
        ),
    ):
        assert_and_add(s_orig, s_mine)

    if mine is not None:
        print(len(d), len(orig), len(mine))
        assert not (set(mine.keys()) - set(d.keys()))
    return d

cc @datumbox

facebook-github-bot · 2021-12-08T14:57:49Z

💊 CI failures summary and remediations

As of commit 57aff36 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torchvision/prototype/models/optical_flow/raft.py

NicolasHug · 2021-12-08T15:07:59Z

torchvision/models/optical_flow/raft.py

@@ -19,6 +20,9 @@
 )


+_MODELS_URLS = {"raft_large": "https://download.pytorch.org/models/raft_large_C_T_V2-1bb1363a.pth"}


Once PR is merged I will upload this to manifold

FYI: all current models use model_urls

datumbox

Thanks @NicolasHug, I've added a few minor comments and nits. Let me know what you think.

Also there are a few more tests on the prototype where you should add models.optical_flow, for example the test_schema_meta_validation.

There you need to add the schema for optical flow:

vision/test/test_prototype_models.py

Lines 96 to 105 in 5dc1e20

    
           def test_schema_meta_validation(model_fn): 
        
               classification_fields = ["size", "categories", "acc@1", "acc@5"] 
        
               defaults = { 
        
                   "all": ["interpolation", "recipe"], 
        
                   "models": classification_fields, 
        
                   "detection": ["categories", "map"], 
        
                   "quantization": classification_fields + ["backend", "quantization", "unquantized"], 
        
                   "segmentation": ["categories", "mIoU", "acc"], 
        
                   "video": classification_fields, 
        
               }

In your case it's going to be empty, unless you add epe or size.

references/optical_flow/README.md

datumbox · 2021-12-08T17:06:34Z

torchvision/models/optical_flow/raft.py

@@ -19,6 +20,9 @@
 )


+_MODELS_URLS = {"raft_large": "https://download.pytorch.org/models/raft_large_C_T_V2-1bb1363a.pth"}


FYI: all current models use model_urls

torchvision/models/optical_flow/raft.py

torchvision/prototype/models/optical_flow/raft.py

datumbox · 2021-12-08T17:25:20Z

torchvision/prototype/models/optical_flow/raft.py

+        transforms=RaftEval,
+        meta={
+            "recipe": "https://github.com/princeton-vl/RAFT",
+            "sintel_train_cleanpass_epe": 1.4411,


Does it make sense to rename one of them as the default epe? This will allow you to add the metric in the schema of meta-data for optical flow models. It's also worth considering introducing a dictionary entry in the meta-data that holds other epe values for for different datasets etc.

Does it make sense to rename one of them as the default epe?

Unfortunately no, because the rest of the weights will be trained on sintel, so reporting the epe on the trainset would not be relevant

I'm happy to have a dict or something else to properly keep track of the other metrics though - ultimately I think it would make sense to also have 1px, 3px etc. I think we'll have a better idea of what it should look like once the rest of the weights are available

Sounds good, no strong opinions. You could dump all the metrics in an epe dictionary. Then you would be able to include this on the schema. Up to you.

torchvision/prototype/models/optical_flow/raft.py

datumbox

LGTM, thanks @NicolasHug

…5060) Reviewed By: fmassa Differential Revision: D33185004 fbshipit-source-id: bdd968bd22775c2f63a8e67877b6482bfb58cc5a

Add pretrained weights on Chairs and Things for raft_large

dce22b3

NicolasHug added module: models module: reference scripts other if you have no clue or if you will manually handle the PR in the release notes labels Dec 8, 2021

pytorch-probot bot added the ciflow/default label Dec 8, 2021

facebook-github-bot added the cla signed label Dec 8, 2021

Merge branch 'main' of github.com:pytorch/vision into raft_pretrained_CT

228a17c

NicolasHug commented Dec 8, 2021

View reviewed changes

torchvision/prototype/models/optical_flow/raft.py Show resolved Hide resolved

NicolasHug commented Dec 8, 2021

View reviewed changes

Minor stuff

d244401

NicolasHug mentioned this pull request Dec 8, 2021

RAFT model and training reference #4644

Closed

12 tasks

NicolasHug added 2 commits December 8, 2021 16:34

Merge branch 'main' of github.com:pytorch/vision into raft_pretrained_CT

0406c83

Add pretrained weights from paper's repo as V1

f186973

datumbox reviewed Dec 8, 2021

View reviewed changes

Address comments

57aff36

datumbox approved these changes Dec 8, 2021

View reviewed changes

NicolasHug merged commit 849d02b into pytorch:main Dec 8, 2021

facebook-github-bot pushed a commit that referenced this pull request Dec 17, 2021

[fbsync] Add pretrained weights on Chairs and Things for raft_large (#…

edd06f7

…5060) Reviewed By: fmassa Differential Revision: D33185004 fbshipit-source-id: bdd968bd22775c2f63a8e67877b6482bfb58cc5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pretrained weights on Chairs and Things for raft_large #5060

Add pretrained weights on Chairs and Things for raft_large #5060

NicolasHug commented Dec 8, 2021 •

edited

Loading

facebook-github-bot commented Dec 8, 2021 •

edited

Loading

NicolasHug Dec 8, 2021

datumbox Dec 8, 2021

datumbox left a comment

datumbox Dec 8, 2021

datumbox Dec 8, 2021

NicolasHug Dec 8, 2021

NicolasHug Dec 8, 2021

datumbox Dec 8, 2021

datumbox left a comment

		@@ -19,6 +20,9 @@
		)


		_MODELS_URLS = {"raft_large": "https://download.pytorch.org/models/raft_large_C_T_V2-1bb1363a.pth"}

	def test_schema_meta_validation(model_fn):
	classification_fields = ["size", "categories", "acc@1", "acc@5"]
	defaults = {
	"all": ["interpolation", "recipe"],
	"models": classification_fields,
	"detection": ["categories", "map"],
	"quantization": classification_fields + ["backend", "quantization", "unquantized"],
	"segmentation": ["categories", "mIoU", "acc"],
	"video": classification_fields,
	}

Add pretrained weights on Chairs and Things for raft_large #5060

Add pretrained weights on Chairs and Things for raft_large #5060

Conversation

NicolasHug commented Dec 8, 2021 • edited Loading

facebook-github-bot commented Dec 8, 2021 • edited Loading

💊 CI failures summary and remediations

NicolasHug Dec 8, 2021

Choose a reason for hiding this comment

datumbox Dec 8, 2021

Choose a reason for hiding this comment

datumbox left a comment

Choose a reason for hiding this comment

datumbox Dec 8, 2021

Choose a reason for hiding this comment

datumbox Dec 8, 2021

Choose a reason for hiding this comment

NicolasHug Dec 8, 2021

Choose a reason for hiding this comment

NicolasHug Dec 8, 2021

Choose a reason for hiding this comment

datumbox Dec 8, 2021

Choose a reason for hiding this comment

datumbox left a comment

Choose a reason for hiding this comment

NicolasHug commented Dec 8, 2021 •

edited

Loading

facebook-github-bot commented Dec 8, 2021 •

edited

Loading