@@ -143,6 +143,60 @@ torchrun --nproc_per_node=8 train.py\
143
143
```
144
144
Here ` $MODEL ` is one of ` regnet_x_32gf ` , ` regnet_y_16gf ` and ` regnet_y_32gf ` .
145
145
146
+ ### Vision Transformer
147
+
148
+ #### vit_b_16
149
+ ```
150
+ torchrun --nproc_per_node=8 train.py\
151
+ --model vit_b_16 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
152
+ --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
153
+ --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
154
+ --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
155
+ ```
156
+
157
+ Note that the above command corresponds to training on a single node with 8 GPUs.
158
+ For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
159
+ and ` --batch_size 64 ` .
160
+
161
+ #### vit_b_32
162
+ ```
163
+ torchrun --nproc_per_node=8 train.py\
164
+ --model vit_b_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
165
+ --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
166
+ --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment imagenet\
167
+ --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
168
+ ```
169
+
170
+ Note that the above command corresponds to training on a single node with 8 GPUs.
171
+ For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs),
172
+ and ` --batch_size 256 ` .
173
+
174
+ #### vit_l_16
175
+ ```
176
+ torchrun --nproc_per_node=8 train.py\
177
+ --model vit_l_16 --epochs 600 --batch-size 128 --lr 0.5 --lr-scheduler cosineannealinglr\
178
+ --lr-warmup-method linear --lr-warmup-epochs 5 --label-smoothing 0.1 --mixup-alpha 0.2\
179
+ --auto-augment ta_wide --random-erase 0.1 --weight-decay 0.00002 --norm-weight-decay 0.0\
180
+ --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema --val-resize-size 232
181
+ ```
182
+
183
+ Note that the above command corresponds to training on a single node with 8 GPUs.
184
+ For generatring the pre-trained weights, we trained with 2 nodes, each with 8 GPUs (for a total of 16 GPUs),
185
+ and ` --batch_size 64 ` .
186
+
187
+ #### vit_l_32
188
+ ```
189
+ torchrun --nproc_per_node=8 train.py\
190
+ --model vit_l_32 --epochs 300 --batch-size 512 --opt adamw --lr 0.003 --wd 0.3\
191
+ --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\
192
+ --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\
193
+ --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema
194
+ ```
195
+
196
+ Note that the above command corresponds to training on a single node with 8 GPUs.
197
+ For generatring the pre-trained weights, we trained with 8 nodes, each with 8 GPUs (for a total of 64 GPUs),
198
+ and ` --batch_size 64 ` .
199
+
146
200
## Mixed precision training
147
201
Automatic Mixed Precision (AMP) training on GPU for Pytorch can be enabled with the [ torch.cuda.amp] ( https://pytorch.org/docs/stable/amp.html?highlight=amp#module-torch.cuda.amp ) .
148
202
0 commit comments