1. Getting Started with Pre-trained I3D Models on Kinetcis400¶

Kinetics400 is an action recognition dataset of realistic action videos, collected from YouTube. With 306,245 short trimmed videos from 400 action categories, it is one of the largest and most widely used dataset in the research community for benchmarking state-of-the-art video action recognition models.

I3D (Inflated 3D Networks) is a widely adopted 3D video classification network. It uses 3D convolution to learn spatiotemporal information directly from videos. I3D is proposed to improve C3D (Convolutional 3D Networks) by inflating from 2D models. We can not only reuse the 2D models’ architecture (e.g., ResNet, Inception), but also bootstrap the model weights from 2D pretrained models. In this manner, training 3D networks for video classification is feasible and getting much better results.

In this tutorial, we will demonstrate how to load a pre-trained I3D model from gluoncv-model-zoo and classify a video clip from the Internet or your local disk into one of the 400 action classes.

Step by Step¶

We will try out a pre-trained I3D model on a single video clip.

First, please follow the installation guide to install PyTorch and GluonCV if you haven’t done so yet.

import numpy as np
import decord
import torch

from gluoncv.torch.utils.model_utils import download
from gluoncv.torch.data.transforms.videotransforms import video_transforms, volume_transforms
from gluoncv.torch.engine.config import get_cfg_defaults
from gluoncv.torch.model_zoo import get_model

Then, we download a video and extract a 32-frame clip from it.

url = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4'
video_fname = download(url)
vr = decord.VideoReader(video_fname)
frame_id_list = range(0, 64, 2)
video_data = vr.get_batch(frame_id_list).asnumpy()

Out:

Downloading abseiling_k400.mp4 from https://github.com/bryanyzhu/tiny-ucf101/raw/master/abseiling_k400.mp4...

  0%|          | 0/782 [00:00<?, ?KB/s]
100%|##########| 782/782 [00:00<00:00, 71165.48KB/s]

Now we define transformations for the video clip. This transformation function does four things: (1) resize the shorter side of video clip to short_side_size, (2) center crop the video clip to crop_size x crop_size, (3) transpose the video clip to num_channels*num_frames*height*width, and (4) normalize it with mean and standard deviation calculated across all ImageNet images.

crop_size = 224
short_side_size = 256
transform_fn = video_transforms.Compose([video_transforms.Resize(short_side_size, interpolation='bilinear'),
                                         video_transforms.CenterCrop(size=(crop_size, crop_size)),
                                         volume_transforms.ClipToTensor(),
                                         video_transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])


clip_input = transform_fn(video_data)
print('Video data is downloaded and preprocessed.')

Out:

Video data is downloaded and preprocessed.

Next, we load a pre-trained I3D model. Make sure to change the pretrained in the configuration file to True.

config_file = '../../../scripts/action-recognition/configuration/i3d_resnet50_v1_kinetics400.yaml'
cfg = get_cfg_defaults()
cfg.merge_from_file(config_file)
model = get_model(cfg)
model.eval()
print('%s model is successfully loaded.' % cfg.CONFIG.MODEL.NAME)

Out:

Downloading /root/.torch/models/i3d_resnet50_v1_kinetics400-18545497.pth from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/torch/models/i3d_resnet50_v1_kinetics400-18545497.pth...

  0%|          | 0/109861 [00:00<?, ?KB/s]
  0%|          | 100/109861 [00:00<02:12, 827.08KB/s]
  0%|          | 514/109861 [00:00<00:46, 2342.01KB/s]
  2%|1         | 2186/109861 [00:00<00:14, 7515.11KB/s]
  7%|6         | 7145/109861 [00:00<00:04, 22159.51KB/s]
 14%|#3        | 15332/109861 [00:00<00:02, 41988.64KB/s]
 22%|##1       | 23738/109861 [00:00<00:01, 55123.61KB/s]
 29%|##9       | 32130/109861 [00:00<00:01, 64143.80KB/s]
 37%|###6      | 40311/109861 [00:00<00:00, 69604.32KB/s]
 43%|####3     | 47428/109861 [00:01<00:01, 59725.99KB/s]
 51%|#####     | 55626/109861 [00:01<00:01, 50064.20KB/s]
 56%|#####5    | 61165/109861 [00:01<00:01, 48168.13KB/s]
 60%|######    | 66330/109861 [00:01<00:00, 45263.23KB/s]
 67%|######7   | 73714/109861 [00:01<00:00, 50162.35KB/s]
 75%|#######4  | 81905/109861 [00:01<00:00, 57233.82KB/s]
 80%|########  | 88400/109861 [00:01<00:00, 47284.46KB/s]
 85%|########5 | 93592/109861 [00:02<00:00, 46318.31KB/s]
 90%|########9 | 98533/109861 [00:02<00:00, 38239.91KB/s]
 95%|#########5| 104792/109861 [00:02<00:00, 43493.32KB/s]
100%|#########9| 109617/109861 [00:02<00:00, 35356.92KB/s]
100%|##########| 109861/109861 [00:02<00:00, 43007.25KB/s]
i3d_resnet50_v1_kinetics400 model is successfully loaded.

Finally, we prepare the video clip and feed it to the model.

with torch.no_grad():
    pred = model(torch.unsqueeze(clip_input, dim=0)).numpy()
print('The input video clip is classified to be class %d' % (np.argmax(pred)))

Out:

The input video clip is classified to be class 0

We can see that our pre-trained model predicts this video clip to be abseiling action with high confidence.

Next Step¶

If you would like to dive deeper into finetuing SOTA video models on your datasets, feel free to read the next tutorial on finetuning.

Total running time of the script: ( 0 minutes 6.021 seconds)

Gallery generated by Sphinx-Gallery