7. Fine-tuning SOTA video models on your own dataset¶

This is a video action recognition tutorial using Gluon CV toolkit, a step-by-step example. The readers should have basic knowledge of deep learning and should be familiar with Gluon API. New users may first go through A 60-minute Gluon Crash Course. You can Start Training Now or `Dive into Deep`_.

Fine-tuning is an important way to obtain good video models on your own data when you don’t have large annotated dataset or don’t have the computing resources to train a model from scratch for your use case. In this tutorial, we provide a simple unified solution. The only thing you need to prepare is a text file containing the information of your videos (e.g., the path to your videos), we will take care of the rest. You can start fine-tuning from many popular pre-trained models (e.g., I3D, I3D-nonlocal, SlowFast) using a single command line.

Start Training Now¶

Note

Feel free to skip the tutorial because the training script is self-complete and ready to launch.

Download Full Python Script: train_recognizer.py

For more training command options, please run python train_recognizer.py -h Please checkout the model_zoo for training commands of reproducing the pretrained model.

First, let’s import the necessary libraries into python.

from __future__ import division

import argparse, time, logging, os, sys, math

import numpy as np
import mxnet as mx
import gluoncv as gcv
from mxnet import gluon, nd, init, context
from mxnet import autograd as ag
from mxnet.gluon import nn
from mxnet.gluon.data.vision import transforms

from gluoncv.data.transforms import video
from gluoncv.data import VideoClsCustom
from gluoncv.model_zoo import get_model
from gluoncv.utils import makedirs, LRSequential, LRScheduler, split_and_load, TrainingHistory

Custom DataLoader¶

We provide a general dataloader for you to use on your own dataset. Your data can be stored in any hierarchy, can be stored in either video format or already decoded to frames. The only thing you need to prepare is a text file, train.txt.

If your data is stored in image format (already decoded to frames). Your train.txt should look like:

video_001 200 0
video_001 200 0
video_002 300 0
video_003 100 1
video_004 400 2
......
video_100 200 10

There are three items in each line, separated by spaces. The first item is the path to your training videos, e.g., video_001. It should be a folder containing the frames of video_001.mp4. The second item is the number of frames in each video, e.g., 200. The third item is the label of the videos, e.g., 0.

If your data is stored in video format. Your train.txt should look like:

video_001.mp4 200 0
video_001.mp4 200 0
video_002.mp4 300 0
video_003.mp4 100 1
video_004.mp4 400 2
......
video_100.mp4 200 10

Similarly, there are three items in each line, separated by spaces. The first item is the path to your training videos, e.g., video_001.mp4. The second item is the number of frames in each video. But you can put any number here because our video loader will compute the number of frames again automatically during training. The third item is the label of that video, e.g., 0.

Once you prepare the train.txt, you are good to go. Just use our general dataloader VideoClsCustom to load your data.

In this tutorial, we will use UCF101 dataset as an example. For your own dataset, you can just replace the value of root and setting to your data directory and your prepared text file. Let’s first define some basics.

num_gpus = 1
ctx = [mx.gpu(i) for i in range(num_gpus)]
transform_train = video.VideoGroupTrainTransform(size=(224, 224), scale_ratios=[1.0, 0.8], mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
per_device_batch_size = 5
num_workers = 0
batch_size = per_device_batch_size * num_gpus

train_dataset = VideoClsCustom(root=os.path.expanduser('~/.mxnet/datasets/ucf101/rawframes'),
                               setting=os.path.expanduser('~/.mxnet/datasets/ucf101/ucfTrainTestlist/ucf101_train_split_1_rawframes.txt'),
                               train=True,
                               new_length=32,
                               transform=transform_train)
print('Load %d training samples.' % len(train_dataset))
train_data = gluon.data.DataLoader(train_dataset, batch_size=batch_size,
                                   shuffle=True, num_workers=num_workers)

Out:

Load 9537 training samples.

Custom Network¶

You can always define your own network architecture. Here, we want to show how to fine-tune on a pre-trained model. Since I3D model is a very popular network, we will use I3D with ResNet50 backbone trained on Kinetics400 dataset (i.e., i3d_resnet50_v1_kinetics400) as an example.

For simple fine-tuning, people usually just replace the last classification (dense) layer to the number of classes in your dataset without changing other things. In GluonCV, you can get your customized model with one line of code.

net = get_model(name='i3d_resnet50_v1_custom', nclass=101)
net.collect_params().reset_ctx(ctx)
print(net)

Out:

conv14_weight is done with shape:  (64, 3, 5, 7, 7)
batchnorm5_gamma is done with shape:  (64,)
batchnorm5_beta is done with shape:  (64,)
batchnorm5_running_mean is done with shape:  (64,)
batchnorm5_running_var is done with shape:  (64,)
layer1_0_conv0_weight is done with shape:  (64, 64, 3, 1, 1)
layer1_0_batchnorm0_gamma is done with shape:  (64,)
layer1_0_batchnorm0_beta is done with shape:  (64,)
layer1_0_batchnorm0_running_mean is done with shape:  (64,)
layer1_0_batchnorm0_running_var is done with shape:  (64,)
layer1_0_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_0_batchnorm1_gamma is done with shape:  (64,)
layer1_0_batchnorm1_beta is done with shape:  (64,)
layer1_0_batchnorm1_running_mean is done with shape:  (64,)
layer1_0_batchnorm1_running_var is done with shape:  (64,)
layer1_0_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_0_batchnorm2_gamma is done with shape:  (256,)
layer1_0_batchnorm2_beta is done with shape:  (256,)
layer1_0_batchnorm2_running_mean is done with shape:  (256,)
layer1_0_batchnorm2_running_var is done with shape:  (256,)
layer1_downsample_conv0_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_downsample_batchnorm0_gamma is done with shape:  (256,)
layer1_downsample_batchnorm0_beta is done with shape:  (256,)
layer1_downsample_batchnorm0_running_mean is done with shape:  (256,)
layer1_downsample_batchnorm0_running_var is done with shape:  (256,)
layer1_1_conv0_weight is done with shape:  (64, 256, 3, 1, 1)
layer1_1_batchnorm0_gamma is done with shape:  (64,)
layer1_1_batchnorm0_beta is done with shape:  (64,)
layer1_1_batchnorm0_running_mean is done with shape:  (64,)
layer1_1_batchnorm0_running_var is done with shape:  (64,)
layer1_1_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_1_batchnorm1_gamma is done with shape:  (64,)
layer1_1_batchnorm1_beta is done with shape:  (64,)
layer1_1_batchnorm1_running_mean is done with shape:  (64,)
layer1_1_batchnorm1_running_var is done with shape:  (64,)
layer1_1_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_1_batchnorm2_gamma is done with shape:  (256,)
layer1_1_batchnorm2_beta is done with shape:  (256,)
layer1_1_batchnorm2_running_mean is done with shape:  (256,)
layer1_1_batchnorm2_running_var is done with shape:  (256,)
layer1_2_conv0_weight is done with shape:  (64, 256, 3, 1, 1)
layer1_2_batchnorm0_gamma is done with shape:  (64,)
layer1_2_batchnorm0_beta is done with shape:  (64,)
layer1_2_batchnorm0_running_mean is done with shape:  (64,)
layer1_2_batchnorm0_running_var is done with shape:  (64,)
layer1_2_conv1_weight is done with shape:  (64, 64, 1, 3, 3)
layer1_2_batchnorm1_gamma is done with shape:  (64,)
layer1_2_batchnorm1_beta is done with shape:  (64,)
layer1_2_batchnorm1_running_mean is done with shape:  (64,)
layer1_2_batchnorm1_running_var is done with shape:  (64,)
layer1_2_conv2_weight is done with shape:  (256, 64, 1, 1, 1)
layer1_2_batchnorm2_gamma is done with shape:  (256,)
layer1_2_batchnorm2_beta is done with shape:  (256,)
layer1_2_batchnorm2_running_mean is done with shape:  (256,)
layer1_2_batchnorm2_running_var is done with shape:  (256,)
layer2_0_conv0_weight is done with shape:  (128, 256, 3, 1, 1)
layer2_0_batchnorm0_gamma is done with shape:  (128,)
layer2_0_batchnorm0_beta is done with shape:  (128,)
layer2_0_batchnorm0_running_mean is done with shape:  (128,)
layer2_0_batchnorm0_running_var is done with shape:  (128,)
layer2_0_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_0_batchnorm1_gamma is done with shape:  (128,)
layer2_0_batchnorm1_beta is done with shape:  (128,)
layer2_0_batchnorm1_running_mean is done with shape:  (128,)
layer2_0_batchnorm1_running_var is done with shape:  (128,)
layer2_0_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_0_batchnorm2_gamma is done with shape:  (512,)
layer2_0_batchnorm2_beta is done with shape:  (512,)
layer2_0_batchnorm2_running_mean is done with shape:  (512,)
layer2_0_batchnorm2_running_var is done with shape:  (512,)
layer2_downsample_conv0_weight is done with shape:  (512, 256, 1, 1, 1)
layer2_downsample_batchnorm0_gamma is done with shape:  (512,)
layer2_downsample_batchnorm0_beta is done with shape:  (512,)
layer2_downsample_batchnorm0_running_mean is done with shape:  (512,)
layer2_downsample_batchnorm0_running_var is done with shape:  (512,)
layer2_1_conv0_weight is done with shape:  (128, 512, 1, 1, 1)
layer2_1_batchnorm0_gamma is done with shape:  (128,)
layer2_1_batchnorm0_beta is done with shape:  (128,)
layer2_1_batchnorm0_running_mean is done with shape:  (128,)
layer2_1_batchnorm0_running_var is done with shape:  (128,)
layer2_1_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_1_batchnorm1_gamma is done with shape:  (128,)
layer2_1_batchnorm1_beta is done with shape:  (128,)
layer2_1_batchnorm1_running_mean is done with shape:  (128,)
layer2_1_batchnorm1_running_var is done with shape:  (128,)
layer2_1_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_1_batchnorm2_gamma is done with shape:  (512,)
layer2_1_batchnorm2_beta is done with shape:  (512,)
layer2_1_batchnorm2_running_mean is done with shape:  (512,)
layer2_1_batchnorm2_running_var is done with shape:  (512,)
layer2_2_conv0_weight is done with shape:  (128, 512, 3, 1, 1)
layer2_2_batchnorm0_gamma is done with shape:  (128,)
layer2_2_batchnorm0_beta is done with shape:  (128,)
layer2_2_batchnorm0_running_mean is done with shape:  (128,)
layer2_2_batchnorm0_running_var is done with shape:  (128,)
layer2_2_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_2_batchnorm1_gamma is done with shape:  (128,)
layer2_2_batchnorm1_beta is done with shape:  (128,)
layer2_2_batchnorm1_running_mean is done with shape:  (128,)
layer2_2_batchnorm1_running_var is done with shape:  (128,)
layer2_2_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_2_batchnorm2_gamma is done with shape:  (512,)
layer2_2_batchnorm2_beta is done with shape:  (512,)
layer2_2_batchnorm2_running_mean is done with shape:  (512,)
layer2_2_batchnorm2_running_var is done with shape:  (512,)
layer2_3_conv0_weight is done with shape:  (128, 512, 1, 1, 1)
layer2_3_batchnorm0_gamma is done with shape:  (128,)
layer2_3_batchnorm0_beta is done with shape:  (128,)
layer2_3_batchnorm0_running_mean is done with shape:  (128,)
layer2_3_batchnorm0_running_var is done with shape:  (128,)
layer2_3_conv1_weight is done with shape:  (128, 128, 1, 3, 3)
layer2_3_batchnorm1_gamma is done with shape:  (128,)
layer2_3_batchnorm1_beta is done with shape:  (128,)
layer2_3_batchnorm1_running_mean is done with shape:  (128,)
layer2_3_batchnorm1_running_var is done with shape:  (128,)
layer2_3_conv2_weight is done with shape:  (512, 128, 1, 1, 1)
layer2_3_batchnorm2_gamma is done with shape:  (512,)
layer2_3_batchnorm2_beta is done with shape:  (512,)
layer2_3_batchnorm2_running_mean is done with shape:  (512,)
layer2_3_batchnorm2_running_var is done with shape:  (512,)
layer3_0_conv0_weight is done with shape:  (256, 512, 3, 1, 1)
layer3_0_batchnorm0_gamma is done with shape:  (256,)
layer3_0_batchnorm0_beta is done with shape:  (256,)
layer3_0_batchnorm0_running_mean is done with shape:  (256,)
layer3_0_batchnorm0_running_var is done with shape:  (256,)
layer3_0_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_0_batchnorm1_gamma is done with shape:  (256,)
layer3_0_batchnorm1_beta is done with shape:  (256,)
layer3_0_batchnorm1_running_mean is done with shape:  (256,)
layer3_0_batchnorm1_running_var is done with shape:  (256,)
layer3_0_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_0_batchnorm2_gamma is done with shape:  (1024,)
layer3_0_batchnorm2_beta is done with shape:  (1024,)
layer3_0_batchnorm2_running_mean is done with shape:  (1024,)
layer3_0_batchnorm2_running_var is done with shape:  (1024,)
layer3_downsample_conv0_weight is done with shape:  (1024, 512, 1, 1, 1)
layer3_downsample_batchnorm0_gamma is done with shape:  (1024,)
layer3_downsample_batchnorm0_beta is done with shape:  (1024,)
layer3_downsample_batchnorm0_running_mean is done with shape:  (1024,)
layer3_downsample_batchnorm0_running_var is done with shape:  (1024,)
layer3_1_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_1_batchnorm0_gamma is done with shape:  (256,)
layer3_1_batchnorm0_beta is done with shape:  (256,)
layer3_1_batchnorm0_running_mean is done with shape:  (256,)
layer3_1_batchnorm0_running_var is done with shape:  (256,)
layer3_1_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_1_batchnorm1_gamma is done with shape:  (256,)
layer3_1_batchnorm1_beta is done with shape:  (256,)
layer3_1_batchnorm1_running_mean is done with shape:  (256,)
layer3_1_batchnorm1_running_var is done with shape:  (256,)
layer3_1_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_1_batchnorm2_gamma is done with shape:  (1024,)
layer3_1_batchnorm2_beta is done with shape:  (1024,)
layer3_1_batchnorm2_running_mean is done with shape:  (1024,)
layer3_1_batchnorm2_running_var is done with shape:  (1024,)
layer3_2_conv0_weight is done with shape:  (256, 1024, 3, 1, 1)
layer3_2_batchnorm0_gamma is done with shape:  (256,)
layer3_2_batchnorm0_beta is done with shape:  (256,)
layer3_2_batchnorm0_running_mean is done with shape:  (256,)
layer3_2_batchnorm0_running_var is done with shape:  (256,)
layer3_2_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_2_batchnorm1_gamma is done with shape:  (256,)
layer3_2_batchnorm1_beta is done with shape:  (256,)
layer3_2_batchnorm1_running_mean is done with shape:  (256,)
layer3_2_batchnorm1_running_var is done with shape:  (256,)
layer3_2_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_2_batchnorm2_gamma is done with shape:  (1024,)
layer3_2_batchnorm2_beta is done with shape:  (1024,)
layer3_2_batchnorm2_running_mean is done with shape:  (1024,)
layer3_2_batchnorm2_running_var is done with shape:  (1024,)
layer3_3_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_3_batchnorm0_gamma is done with shape:  (256,)
layer3_3_batchnorm0_beta is done with shape:  (256,)
layer3_3_batchnorm0_running_mean is done with shape:  (256,)
layer3_3_batchnorm0_running_var is done with shape:  (256,)
layer3_3_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_3_batchnorm1_gamma is done with shape:  (256,)
layer3_3_batchnorm1_beta is done with shape:  (256,)
layer3_3_batchnorm1_running_mean is done with shape:  (256,)
layer3_3_batchnorm1_running_var is done with shape:  (256,)
layer3_3_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_3_batchnorm2_gamma is done with shape:  (1024,)
layer3_3_batchnorm2_beta is done with shape:  (1024,)
layer3_3_batchnorm2_running_mean is done with shape:  (1024,)
layer3_3_batchnorm2_running_var is done with shape:  (1024,)
layer3_4_conv0_weight is done with shape:  (256, 1024, 3, 1, 1)
layer3_4_batchnorm0_gamma is done with shape:  (256,)
layer3_4_batchnorm0_beta is done with shape:  (256,)
layer3_4_batchnorm0_running_mean is done with shape:  (256,)
layer3_4_batchnorm0_running_var is done with shape:  (256,)
layer3_4_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_4_batchnorm1_gamma is done with shape:  (256,)
layer3_4_batchnorm1_beta is done with shape:  (256,)
layer3_4_batchnorm1_running_mean is done with shape:  (256,)
layer3_4_batchnorm1_running_var is done with shape:  (256,)
layer3_4_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_4_batchnorm2_gamma is done with shape:  (1024,)
layer3_4_batchnorm2_beta is done with shape:  (1024,)
layer3_4_batchnorm2_running_mean is done with shape:  (1024,)
layer3_4_batchnorm2_running_var is done with shape:  (1024,)
layer3_5_conv0_weight is done with shape:  (256, 1024, 1, 1, 1)
layer3_5_batchnorm0_gamma is done with shape:  (256,)
layer3_5_batchnorm0_beta is done with shape:  (256,)
layer3_5_batchnorm0_running_mean is done with shape:  (256,)
layer3_5_batchnorm0_running_var is done with shape:  (256,)
layer3_5_conv1_weight is done with shape:  (256, 256, 1, 3, 3)
layer3_5_batchnorm1_gamma is done with shape:  (256,)
layer3_5_batchnorm1_beta is done with shape:  (256,)
layer3_5_batchnorm1_running_mean is done with shape:  (256,)
layer3_5_batchnorm1_running_var is done with shape:  (256,)
layer3_5_conv2_weight is done with shape:  (1024, 256, 1, 1, 1)
layer3_5_batchnorm2_gamma is done with shape:  (1024,)
layer3_5_batchnorm2_beta is done with shape:  (1024,)
layer3_5_batchnorm2_running_mean is done with shape:  (1024,)
layer3_5_batchnorm2_running_var is done with shape:  (1024,)
layer4_0_conv0_weight is done with shape:  (512, 1024, 1, 1, 1)
layer4_0_batchnorm0_gamma is done with shape:  (512,)
layer4_0_batchnorm0_beta is done with shape:  (512,)
layer4_0_batchnorm0_running_mean is done with shape:  (512,)
layer4_0_batchnorm0_running_var is done with shape:  (512,)
layer4_0_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_0_batchnorm1_gamma is done with shape:  (512,)
layer4_0_batchnorm1_beta is done with shape:  (512,)
layer4_0_batchnorm1_running_mean is done with shape:  (512,)
layer4_0_batchnorm1_running_var is done with shape:  (512,)
layer4_0_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_0_batchnorm2_gamma is done with shape:  (2048,)
layer4_0_batchnorm2_beta is done with shape:  (2048,)
layer4_0_batchnorm2_running_mean is done with shape:  (2048,)
layer4_0_batchnorm2_running_var is done with shape:  (2048,)
layer4_downsample_conv0_weight is done with shape:  (2048, 1024, 1, 1, 1)
layer4_downsample_batchnorm0_gamma is done with shape:  (2048,)
layer4_downsample_batchnorm0_beta is done with shape:  (2048,)
layer4_downsample_batchnorm0_running_mean is done with shape:  (2048,)
layer4_downsample_batchnorm0_running_var is done with shape:  (2048,)
layer4_1_conv0_weight is done with shape:  (512, 2048, 3, 1, 1)
layer4_1_batchnorm0_gamma is done with shape:  (512,)
layer4_1_batchnorm0_beta is done with shape:  (512,)
layer4_1_batchnorm0_running_mean is done with shape:  (512,)
layer4_1_batchnorm0_running_var is done with shape:  (512,)
layer4_1_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_1_batchnorm1_gamma is done with shape:  (512,)
layer4_1_batchnorm1_beta is done with shape:  (512,)
layer4_1_batchnorm1_running_mean is done with shape:  (512,)
layer4_1_batchnorm1_running_var is done with shape:  (512,)
layer4_1_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_1_batchnorm2_gamma is done with shape:  (2048,)
layer4_1_batchnorm2_beta is done with shape:  (2048,)
layer4_1_batchnorm2_running_mean is done with shape:  (2048,)
layer4_1_batchnorm2_running_var is done with shape:  (2048,)
layer4_2_conv0_weight is done with shape:  (512, 2048, 1, 1, 1)
layer4_2_batchnorm0_gamma is done with shape:  (512,)
layer4_2_batchnorm0_beta is done with shape:  (512,)
layer4_2_batchnorm0_running_mean is done with shape:  (512,)
layer4_2_batchnorm0_running_var is done with shape:  (512,)
layer4_2_conv1_weight is done with shape:  (512, 512, 1, 3, 3)
layer4_2_batchnorm1_gamma is done with shape:  (512,)
layer4_2_batchnorm1_beta is done with shape:  (512,)
layer4_2_batchnorm1_running_mean is done with shape:  (512,)
layer4_2_batchnorm1_running_var is done with shape:  (512,)
layer4_2_conv2_weight is done with shape:  (2048, 512, 1, 1, 1)
layer4_2_batchnorm2_gamma is done with shape:  (2048,)
layer4_2_batchnorm2_beta is done with shape:  (2048,)
layer4_2_batchnorm2_running_mean is done with shape:  (2048,)
layer4_2_batchnorm2_running_var is done with shape:  (2048,)
dense2_weight is skipped with shape:  (101, 2048)
dense2_bias is skipped with shape:  (101,)
Downloading /root/.mxnet/models/i3d_resnet50_v1_kinetics400-568a722e.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/i3d_resnet50_v1_kinetics400-568a722e.zip...

  0%|          | 0/208483 [00:00<?, ?KB/s]
  0%|          | 102/208483 [00:00<04:14, 818.65KB/s]
  0%|          | 508/208483 [00:00<01:32, 2239.15KB/s]
  1%|1         | 2182/208483 [00:00<00:28, 7283.02KB/s]
  4%|3         | 7858/208483 [00:00<00:08, 24167.70KB/s]
  7%|6         | 14006/208483 [00:00<00:05, 36394.45KB/s]
 11%|#         | 21910/208483 [00:00<00:03, 47696.20KB/s]
 15%|#4        | 30424/208483 [00:00<00:03, 59037.01KB/s]
 18%|#8        | 37624/208483 [00:00<00:02, 62951.34KB/s]
 22%|##1       | 45767/208483 [00:00<00:02, 68519.31KB/s]
 25%|##5       | 53098/208483 [00:01<00:02, 69959.23KB/s]
 29%|##9       | 60963/208483 [00:01<00:02, 72176.72KB/s]
 33%|###3      | 69027/208483 [00:01<00:01, 74707.15KB/s]
 37%|###6      | 76800/208483 [00:01<00:01, 75608.75KB/s]
 41%|####      | 85047/208483 [00:01<00:01, 77663.29KB/s]
 45%|####4     | 92840/208483 [00:01<00:01, 77245.53KB/s]
 48%|####8     | 100584/208483 [00:01<00:01, 75109.40KB/s]
 52%|#####2    | 109213/208483 [00:01<00:01, 78380.57KB/s]
 56%|#####6    | 117077/208483 [00:01<00:01, 77444.67KB/s]
 60%|#####9    | 124984/208483 [00:02<00:01, 77913.63KB/s]
 64%|######3   | 132893/208483 [00:02<00:00, 78260.72KB/s]
 68%|######7   | 140730/208483 [00:02<00:00, 78194.39KB/s]
 72%|#######1  | 149137/208483 [00:02<00:00, 79941.64KB/s]
 75%|#######5  | 157138/208483 [00:02<00:00, 77379.70KB/s]
 79%|#######9  | 165429/208483 [00:02<00:00, 78992.85KB/s]
 83%|########3 | 173348/208483 [00:02<00:00, 77917.39KB/s]
 87%|########6 | 181320/208483 [00:02<00:00, 78442.90KB/s]
 91%|######### | 189177/208483 [00:02<00:00, 78220.59KB/s]
 95%|#########4| 197352/208483 [00:02<00:00, 79260.65KB/s]
 98%|#########8| 205286/208483 [00:03<00:00, 78307.90KB/s]
100%|##########| 208483/208483 [00:03<00:00, 67850.67KB/s]
I3D_ResNetV1(
  (first_stage): HybridSequential(
    (0): Conv3D(3 -> 64, kernel_size=(5, 7, 7), stride=(2, 2, 2), padding=(2, 3, 3), bias=False)
    (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
    (2): Activation(relu)
    (3): MaxPool3D(size=(1, 3, 3), stride=(2, 2, 2), padding=(0, 1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCDHW)
  )
  (pool2): MaxPool3D(size=(2, 1, 1), stride=(2, 1, 1), padding=(0, 0, 0), ceil_mode=False, global_pool=False, pool_type=max, layout=NCDHW)
  (res_layers): HybridSequential(
    (0): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(64 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(64 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=256)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (2): Activation(relu)
          (3): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
          (5): Activation(relu)
          (6): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        )
        (conv1): Conv3D(256 -> 64, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(64 -> 64, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
        (conv3): Conv3D(64 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (relu): Activation(relu)
      )
    )
    (1): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(256 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(256 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(256 -> 512, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=512)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
      (3): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (2): Activation(relu)
          (3): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
          (5): Activation(relu)
          (6): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        )
        (conv1): Conv3D(512 -> 128, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(128 -> 128, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
        (conv3): Conv3D(128 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (relu): Activation(relu)
      )
    )
    (2): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(512 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(512 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(512 -> 1024, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=1024)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (3): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (4): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
      (5): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (2): Activation(relu)
          (3): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
          (5): Activation(relu)
          (6): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        )
        (conv1): Conv3D(1024 -> 256, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(256 -> 256, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)
        (conv3): Conv3D(256 -> 1024, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=1024)
        (relu): Activation(relu)
      )
    )
    (3): HybridSequential(
      (0): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(1024 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(1024 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
        (downsample): HybridSequential(
          (0): Conv3D(1024 -> 2048, kernel_size=(1, 1, 1), stride=(1, 2, 2), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=True, in_channels=2048)
        )
      )
      (1): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(2048 -> 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(2048 -> 512, kernel_size=(3, 1, 1), stride=(1, 1, 1), padding=(1, 0, 0), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
      )
      (2): Bottleneck(
        (bottleneck): HybridSequential(
          (0): Conv3D(2048 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (2): Activation(relu)
          (3): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
          (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
          (5): Activation(relu)
          (6): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
          (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        )
        (conv1): Conv3D(2048 -> 512, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (conv2): Conv3D(512 -> 512, kernel_size=(1, 3, 3), stride=(1, 1, 1), padding=(0, 1, 1), bias=False)
        (bn1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (bn2): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=512)
        (conv3): Conv3D(512 -> 2048, kernel_size=(1, 1, 1), stride=(1, 1, 1), bias=False)
        (bn3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=2048)
        (relu): Activation(relu)
      )
    )
  )
  (st_avg): GlobalAvgPool3D(size=(1, 1, 1), stride=(1, 1, 1), padding=(0, 0, 0), ceil_mode=True, global_pool=True, pool_type=avg, layout=NCDHW)
  (head): HybridSequential(
    (0): Dropout(p = 0.8, axes=())
    (1): Dense(2048 -> 101, linear)
  )
  (fc): Dense(2048 -> 101, linear)
)

We also provide other customized network architectures for you to use on your own dataset. You can simply change the dataset part in any pretrained model name to custom, e.g., slowfast_4x16_resnet50_kinetics400 to slowfast_4x16_resnet50_custom.

Once you have the dataloader and network for your own dataset, the rest is the same as in previous tutorials. Just define the optimizer, loss and metric, and kickstart the training.

Optimizer, Loss and Metric¶

# Learning rate decay factor
lr_decay = 0.1
# Epochs where learning rate decays
lr_decay_epoch = [40, 80, 100]

# Stochastic gradient descent
optimizer = 'sgd'
# Set parameters
optimizer_params = {'learning_rate': 0.001, 'wd': 0.0001, 'momentum': 0.9}

# Define our trainer for net
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params)

In order to optimize our model, we need a loss function. For classification tasks, we usually use softmax cross entropy as the loss function.

loss_fn = gluon.loss.SoftmaxCrossEntropyLoss()

For simplicity, we use accuracy as the metric to monitor our training process. Besides, we record metric values, and will print them at the end of training.

train_metric = mx.metric.Accuracy()
train_history = TrainingHistory(['training-acc'])

Training¶

After all the preparations, we can finally start training! Following is the script.

Note

In order to finish the tutorial quickly, we only fine tune for 3 epochs, and 100 iterations per epoch for UCF101. In your experiments, you can set the hyper-parameters depending on your dataset.

epochs = 0
lr_decay_count = 0

for epoch in range(epochs):
    tic = time.time()
    train_metric.reset()
    train_loss = 0

    # Learning rate decay
    if epoch == lr_decay_epoch[lr_decay_count]:
        trainer.set_learning_rate(trainer.learning_rate*lr_decay)
        lr_decay_count += 1

    # Loop through each batch of training data
    for i, batch in enumerate(train_data):
        # Extract data and label
        data = split_and_load(batch[0], ctx_list=ctx, batch_axis=0)
        label = split_and_load(batch[1], ctx_list=ctx, batch_axis=0)

        # AutoGrad
        with ag.record():
            output = []
            for _, X in enumerate(data):
                X = X.reshape((-1,) + X.shape[2:])
                pred = net(X)
                output.append(pred)
            loss = [loss_fn(yhat, y) for yhat, y in zip(output, label)]

        # Backpropagation
        for l in loss:
            l.backward()

        # Optimize
        trainer.step(batch_size)

        # Update metrics
        train_loss += sum([l.mean().asscalar() for l in loss])
        train_metric.update(label, output)

        if i == 100:
            break

    name, acc = train_metric.get()

    # Update history and print metrics
    train_history.update([acc])
    print('[Epoch %d] train=%f loss=%f time: %f' %
        (epoch, acc, train_loss / (i+1), time.time()-tic))

# We can plot the metric scores with:
train_history.plot()

We can see that the training accuracy increase quickly. Actually, if you look back tutorial 4 (Dive Deep into Training I3D mdoels on Kinetcis400) and compare the training curve, you will see fine-tuning can achieve much better result using much less time. Try fine-tuning other SOTA video models on your own dataset and see how it goes.

Total running time of the script: ( 0 minutes 5.860 seconds)

Gallery generated by Sphinx-Gallery