Skip to content

Commit 32e5700

Browse files
bjuncekvfdev-5
andauthored
[docs] minor README changes for VideoReference PR (#2957)
* removing the tab? * initial commit * Addressing Victor's comments Co-authored-by: vfdev <[email protected]>
1 parent 3298a96 commit 32e5700

File tree

1 file changed

+6
-7
lines changed

1 file changed

+6
-7
lines changed
Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# Video Classification
22

3-
TODO: Add some info about the context, dataset we use etc
3+
We present a simple training script that can be used for replicating the result of [resenet-based video models](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf). All models are trained on [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics), a benchmark dataset for human-action recognition. The accuracy is reported on the traditional validation split.
44

55
## Data preparation
66

77
If you already have downloaded [Kinetics400 dataset](https://deepmind.com/research/open-source/kinetics),
88
please proceed directly to the next section.
99

10-
To download videos, one can use https://github.com/Showmax/kinetics-downloader
10+
To download videos, one can use https://github.com/Showmax/kinetics-downloader. Please note that the dataset can take up upwards of 400GB, depending on the quality setting during download.
1111

1212
## Training
1313

1414
We assume the training and validation AVI videos are stored at `/data/kinectics400/train` and
15-
`/data/kinectics400/val`.
15+
`/data/kinectics400/val`. For training we suggest starting with the hyperparameters reported in the [paper](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf), in order to match the performance of said models. Clip sampling strategy is a particularly important parameter during training, and we suggest using random temporal jittering during training - in other words sampling multiple training clips from each video with random start times during at every epoch. This functionality is built into our training script, and optimal hyperparameters are set by default.
1616

1717
### Multiple GPUs
1818

@@ -21,7 +21,8 @@ Run the training on a single node with 8 GPUs:
2121
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=16 --cache-dataset --sync-bn --apex
2222
```
2323

24-
24+
**Note:** all our models were trained on 8 nodes with 8 V100 GPUs each for a total of 64 GPUs. Expected training time for 64 GPUs is 24 hours, depending on the storage solution.
25+
**Note 2:** hyperparameters for exact replication of our training can be found [here](https://github.com/pytorch/vision/blob/master/torchvision/models/video/README.md). Some hyperparameters such as learning rate are scaled linearly in proportion to the number of GPUs.
2526

2627
### Single GPU
2728

@@ -30,6 +31,4 @@ python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --data-
3031

3132
```bash
3233
python train.py --data-path=/data/kinectics400 --train-dir=train --val-dir=val --batch-size=8 --cache-dataset
33-
```
34-
35-
34+
```

0 commit comments

Comments
 (0)