3. Extracting video features from pre-trained models

Feature extraction is a very useful tool when you don’t have large annotated dataset or don’t have the computing resources to train a model from scratch for your use case. It’s also useful to visualize what the model have learned. In this tutorial, we provide a simple unified solution. The only thing you need to prepare is a text file containing the information of your videos (e.g., the path to your videos), we will take care of the rest. You can extract strong video features from many popular pre-trained models in the GluonCV video model zoo using a single command line.


Feel free to skip the tutorial because the feature extraction script is self-complete and ready to launch.

Download Full Python Script: feat_extract_pytorch.py

Please checkout the model_zoo to select your preferred pretrained model.

python feat_extract_pytorch.py --config-file CONFIG

Prepare Data

Your data can be stored in any hierarchy. Just use the format we adopt for training models in the previous tutorial and save the data annotation file as video.txt.

/home/ubuntu/your_data/video_001.mp4 200 0
/home/ubuntu/your_data/video_001.mp4 300 1
/home/ubuntu/your_data/video_002.mp4 100 2
/home/ubuntu/your_data/video_003.mp4 400 2
/home/ubuntu/your_data/video_004.mp4 200 1
/home/ubuntu/your_data/video_100.mp4.100 3

Each line has three things, the path to each video, the number of video frames and the video label. However, the second and third things are not gonna used in the code, they are just a placeholder. So you can put any postive number in these two places.

Note that, at this moment, we only support extracting features from videos directly.

Once you prepare the video.txt, you can start extracting feature by:

python feat_extract_pytorch.py --config-file ./scripts/action-recognition/configuration/i3d_resnet50_v1_feat.yaml

The extracted features will be saved to a directory defined in the config file. Each video will have one feature file. For example, video_001.mp4 will have a feature named i3d_resnet50_v1_kinetics400_video_001_feat.npy. The feature is extracted from the center of the video by using a 32-frames clip.

There are many other options and other models you can choose, e.g., resnet50_v1b_feat.yaml, slowfast_4x16_resnet50_feat.yaml, tpn_resnet50_f32s2_feat.yaml, r2plus1d_v1_resnet50_feat.yaml, i3d_slow_resnet50_f32s2_feat.yaml. Try extracting features from these SOTA video models on your own dataset and see which one performs better.

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery