Skip to content

Add pretrained weights on Chairs and Things for raft_large #5060

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 8, 2021

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Dec 8, 2021

Towards #4644

This PR:

The weights are trained on Chairs + Things and can be evaluated on the training set of Sintel or Kitti.


Some manual tests making sure all works fine:

Using --weights Raft_Large_Weights.C_T_V2

(raft) ➜  vision git:(raft_pretrained_CT) ✗ torchrun --nproc_per_node 8 --nnodes 1 references/optical_flow/train.py --val-dataset sintel --batch-size 10 --dataset-root /data/home/nicolashug/cluster/work/downloads --model raft_large --weights Raft_Large_Weights.C_T_V2
Sintel val clean Total time: 0:00:15
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.3825	1px: 0.9028	3px: 0.9573	5px: 0.9697	per_image_epe: 1.3782	f1: 4.0234
Sintel val final Total time: 0:00:12
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7148	1px: 0.8526	3px: 0.9203	5px: 0.9392	per_image_epe: 2.7199	f1: 7.6100

Using --pretrained

(raft) ➜  vision git:(raft_pretrained_CT) ✗ torchrun --nproc_per_node 8 --nnodes 1 references/optical_flow/train.py --val-dataset sintel --batch-size 10 --dataset-root /data/home/nicolashug/cluster/work/downloads --model raft_large --pretrained
Sintel val clean Total time: 0:00:14
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.3825	1px: 0.9028	3px: 0.9573	5px: 0.9697	per_image_epe: 1.3782	f1: 4.0234
Sintel val final Total time: 0:00:12
Batch-processed 1040 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7148	1px: 0.8526	3px: 0.9203	5px: 0.9392	per_image_epe: 2.7199	f1: 7.6100

Using --weights Raft_Large_Weights.C_T_V1 (Original weights)

Sintel val clean Total time: 0:04:06
Batch-processed 1041 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val clean epe: 1.4411	1px: 0.9016	3px: 0.9560	5px: 0.9684	per_image_epe: 1.4411	f1: 4.1593
Sintel val final Total time: 0:04:02
Batch-processed 1041 / 1041 samples. Going to process the remaining samples individually, if any.
Sintel val final epe: 2.7894	1px: 0.8528	3px: 0.9190	5px: 0.9381	per_image_epe: 2.7894	f1: 7.7217
from torchvision.prototype.models.optical_flow import raft_large, Raft_Large_Weights
assert not next(raft_large(weights="Raft_Large_Weights.C_T_V2").parameters()).is_cuda
assert not next(raft_large(weights=Raft_Large_Weights.C_T_V2).parameters()).is_cuda

For my own sanity as for reference, here's the slurm script that I used (obviously these weights corresponds to the ones in things/raft-things.pth)

#!/bin/bash
#SBATCH --partition=train
#SBATCH --cpus-per-task=96  # 12 CPUs per GPU
#SBATCH --gpus-per-node=8
#SBATCH --nodes=1
#SBATCH --time=70:00:00
#SBATCH --output=/data/home/nicolashug/cluster/experiments/slurm-%j.out
#SBATCH --error=/data/home/nicolashug/cluster/experiments/slurm-%j.err



n_gpus=8  # If you modify these, also update the equivalent above.
n_nodes=1

output_dir=~/cluster/experiments/id_$SLURM_JOB_ID
mkdir -p $output_dir

this_script=./train.sh  # depends where you call it from
cp $this_script $output_dir

function unused_port() {
    # Find a random unused port. It's needed if you run multiple sbatches on the same node
    N=${1:-1}
    comm -23 \
        <(seq "1025" "65535" | sort) \
        <(ss -Htan |
            awk '{print $4}' |
            cut -d':' -f2 |
            sort -u) |
        shuf |
        head -n "$N"
}
master_port=$(unused_port)

dataset_root=/data/home/nicolashug/cluster/work/downloads

# FlyingChairs
batch_size_chairs=2
lr_chairs=0.0004
num_steps_chairs=100000
name_chairs=raft_chairs
wdecay_chairs=0.0001

chairs_dir=$output_dir/chairs
mkdir -p $chairs_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_chairs \
    --train-dataset chairs \
    --batch-size $batch_size_chairs \
    --lr $lr_chairs \
    --weight-decay $wdecay_chairs \
    --num-steps $num_steps_chairs \
    --output-dir $chairs_dir

# FlyingThings3D
batch_size_things=2
lr_things=0.000125
num_steps_things=100000
name_things=raft_things
wdecay_things=0.0001

things_dir=$output_dir/things
mkdir -p $things_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_things \
    --train-dataset things \
    --batch-size $batch_size_things \
    --lr $lr_things \
    --weight-decay $wdecay_things \
    --num-steps $num_steps_things \
    --freeze-batch-norm \
    --output-dir $things_dir\
    --resume $chairs_dir/$name_chairs.pth

# Sintel S+K+H
batch_size_sintel_skh=2
lr_sintel_skh=0.000125
num_steps_sintel_skh=100000
name_sintel_skh=raft_sintel_skh
wdecay_sintel_skh=0.00001
gamma_sintel_skh=0.85

sintel_skh_dir=$output_dir/sintel_skh
mkdir -p $sintel_skh_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_sintel_skh \
    --train-dataset sintel_SKH \
    --batch-size $batch_size_sintel_skh \
    --lr $lr_sintel_skh \
    --weight-decay $wdecay_sintel_skh \
    --gamma $gamma_sintel_skh \
    --num-steps $num_steps_sintel_skh \
    --freeze-batch-norm \
    --output-dir $sintel_skh_dir\
    --resume $things_dir/$name_things.pth

# Kitti
batch_size_kitti=2
lr_kitti=0.0001
num_steps_kitti=50000
name_kitti=raft_kitti
wdecay_kitti=0.00001
gamma_kitti=0.85

kitti_dir=$output_dir/kitti
mkdir -p $kitti_dir
torchrun --nproc_per_node $n_gpus --nnodes $n_nodes --master_port $master_port references/optical_flow/train.py \
    --dataset-root $dataset_root \
    --name $name_kitti \
    --train-dataset kitti \
    --batch-size $batch_size_kitti \
    --lr $lr_kitti \
    --weight-decay $wdecay_kitti \
    --gamma $gamma_kitti \
    --num-steps $num_steps_kitti \
    --freeze-batch-norm \
    --output-dir $kitti_dir \
    --resume $sintel_skh_dir/$name_sintel_skh.pth

The code to map the original paper's weights to ours is

def map_orig_to_ours(orig, mine=None):
    # TODO: remove
    d = {}
    used_s_orig = set()
    used_s_mine = set()

    def assert_and_add(s_orig, s_mine):
        # print(s_orig, s_mine)
        # print(orig[s_orig].shape, mine[s_mine].shape)

        assert s_orig not in used_s_orig
        assert s_mine not in used_s_mine

        if mine is not None:
            assert s_mine in mine
        assert s_orig in orig
        if mine is not None:
            assert orig[s_orig].shape == mine[s_mine].shape
        d["module." + s_mine] = orig[s_orig]
        used_s_orig.add(s_orig)
        used_s_mine.add(s_mine)

    for encoder_orig, encoder_mine in (
        ("fnet", "feature_encoder"),
        ("cnet", "context_encoder"),
    ):
        for attr in ("bias", "weight"):
            s_orig = f"module.{encoder_orig}.conv1.{attr}"
            s_mine = f"{encoder_mine}.convnormrelu.0.{attr}"
            assert_and_add(s_orig, s_mine)

            s_orig = f"module.{encoder_orig}.conv2.{attr}"
            s_mine = f"{encoder_mine}.conv.{attr}"
            assert_and_add(s_orig, s_mine)

            for layer in (1, 2, 3):
                for block in (0, 1):
                    for conv in (1, 2):
                        s_orig = f"module.{encoder_orig}.layer{layer}.{block}.conv{conv}.{attr}"
                        s_mine = f"{encoder_mine}.layer{layer}.{block}.convnormrelu{conv}.0.{attr}"
                        assert_and_add(s_orig, s_mine)

            for layer in (2, 3):
                s_orig = f"module.{encoder_orig}.layer{layer}.0.downsample.0.{attr}"
                s_mine = f"{encoder_mine}.layer{layer}.0.downsample.0.{attr}"
                assert_and_add(s_orig, s_mine)

    encoder_orig, encoder_mine = "cnet", "context_encoder"
    for attr in (
        "bias",
        "weight",
        "running_mean",
        "running_var",
        "num_batches_tracked",
    ):
        s_orig = f"module.{encoder_orig}.norm1.{attr}"
        s_mine = f"{encoder_mine}.convnormrelu.1.{attr}"
        assert_and_add(s_orig, s_mine)
        for layer in (1, 2, 3):
            for block in (0, 1):
                for norm in (1, 2):
                    s_orig = f"module.{encoder_orig}.layer{layer}.{block}.norm{norm}.{attr}"
                    s_mine = f"{encoder_mine}.layer{layer}.{block}.convnormrelu{norm}.1.{attr}"
                    assert_and_add(s_orig, s_mine)
        for layer in (2, 3):
            s_orig = f"module.{encoder_orig}.layer{layer}.0.downsample.1.{attr}"
            s_mine = f"{encoder_mine}.layer{layer}.0.downsample.1.{attr}"
            assert_and_add(s_orig, s_mine)

    corr_orig, corr_mine = (
        "module.update_block.encoder.",
        "update_block.motion_encoder.",
    )
    for attr in ("bias", "weight"):
        for i in (1, 2):
            s_orig = f"{corr_orig}convc{i}.{attr}"
            s_mine = f"{corr_mine}convcorr{i}.0.{attr}"
            assert_and_add(s_orig, s_mine)
            s_orig = f"{corr_orig}convf{i}.{attr}"
            s_mine = f"{corr_mine}convflow{i}.0.{attr}"
            assert_and_add(s_orig, s_mine)
        s_orig = f"{corr_orig}conv.{attr}"
        s_mine = f"{corr_mine}conv.0.{attr}"
        assert_and_add(s_orig, s_mine)

    rec_orig, rec_mine = "module.update_block.gru", "update_block.recurrent_block"
    for attr in ("bias", "weight"):
        for i in (1, 2):
            for conv in ("convz", "convr", "convq"):
                s_orig = f"{rec_orig}.{conv}{i}.{attr}"
                s_mine = f"{rec_mine}.convgru{i}.{conv}.{attr}"
                assert_and_add(s_orig, s_mine)

    flow_orig, flow_mine = "module.update_block.flow_head", "update_block.flow_head"
    for attr in ("bias", "weight"):
        for i in (1, 2):
            s_orig = f"{flow_orig}.conv{i}.{attr}"
            s_mine = f"{flow_mine}.conv{i}.{attr}"
            assert_and_add(s_orig, s_mine)
    for s_orig, s_mine in zip(
        (
            "module.update_block.mask.0.weight",
            "module.update_block.mask.0.bias",
            "module.update_block.mask.2.weight",
            "module.update_block.mask.2.bias",
        ),
        (
            "mask_predictor.convrelu.0.weight",
            "mask_predictor.convrelu.0.bias",
            "mask_predictor.conv.weight",
            "mask_predictor.conv.bias",
        ),
    ):
        assert_and_add(s_orig, s_mine)

    if mine is not None:
        print(len(d), len(orig), len(mine))
        assert not (set(mine.keys()) - set(d.keys()))
    return d

cc @datumbox

@NicolasHug NicolasHug added module: models module: reference scripts other if you have no clue or if you will manually handle the PR in the release notes labels Dec 8, 2021
@facebook-github-bot
Copy link

facebook-github-bot commented Dec 8, 2021

💊 CI failures summary and remediations

As of commit 57aff36 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@@ -19,6 +20,9 @@
)


_MODELS_URLS = {"raft_large": "https://download.pytorch.org/models/raft_large_C_T_V2-1bb1363a.pth"}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once PR is merged I will upload this to manifold

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: all current models use model_urls

@NicolasHug NicolasHug mentioned this pull request Dec 8, 2021
12 tasks
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicolasHug, I've added a few minor comments and nits. Let me know what you think.

Also there are a few more tests on the prototype where you should add models.optical_flow, for example the test_schema_meta_validation.

There you need to add the schema for optical flow:

def test_schema_meta_validation(model_fn):
classification_fields = ["size", "categories", "acc@1", "acc@5"]
defaults = {
"all": ["interpolation", "recipe"],
"models": classification_fields,
"detection": ["categories", "map"],
"quantization": classification_fields + ["backend", "quantization", "unquantized"],
"segmentation": ["categories", "mIoU", "acc"],
"video": classification_fields,
}

In your case it's going to be empty, unless you add epe or size.

@@ -19,6 +20,9 @@
)


_MODELS_URLS = {"raft_large": "https://download.pytorch.org/models/raft_large_C_T_V2-1bb1363a.pth"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: all current models use model_urls

transforms=RaftEval,
meta={
"recipe": "https://github.com/princeton-vl/RAFT",
"sintel_train_cleanpass_epe": 1.4411,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to rename one of them as the default epe? This will allow you to add the metric in the schema of meta-data for optical flow models. It's also worth considering introducing a dictionary entry in the meta-data that holds other epe values for for different datasets etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to rename one of them as the default epe?

Unfortunately no, because the rest of the weights will be trained on sintel, so reporting the epe on the trainset would not be relevant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to have a dict or something else to properly keep track of the other metrics though - ultimately I think it would make sense to also have 1px, 3px etc. I think we'll have a better idea of what it should look like once the rest of the weights are available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, no strong opinions. You could dump all the metrics in an epe dictionary. Then you would be able to include this on the schema. Up to you.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @NicolasHug

@NicolasHug NicolasHug merged commit 849d02b into pytorch:main Dec 8, 2021
facebook-github-bot pushed a commit that referenced this pull request Dec 17, 2021
…5060)

Reviewed By: fmassa

Differential Revision: D33185004

fbshipit-source-id: bdd968bd22775c2f63a8e67877b6482bfb58cc5a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/default cla signed module: models module: reference scripts other if you have no clue or if you will manually handle the PR in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants