Action Recognition¶

MXNet Pytorch

MXNet¶

Here is the model zoo for video action recognition task. We first show a visualization in the graph below, describing the inference throughputs vs. validation accuracy of Kinetics400 pre-trained models.

action_recognition

Hint

Training commands work with this script: Download train_recognizer.py

A model can have differently trained parameters with different hashtags. Parameters with a grey name can be downloaded by passing the corresponding hashtag.

Download default pretrained weights: net = get_model('i3d_resnet50_v1_kinetics400', pretrained=True)
Download weights given a hashtag: net = get_model('i3d_resnet50_v1_kinetics400', pretrained='568a722e')

The test script Download test_recognizer.py can be used for evaluating the models on various datasets.

The inference script Download inference.py can be used for inferencing on a list of videos (demo purpose).

Hint

Training commands work with this script: Download train_recognizer.py

A model can have differently trained parameters with different hashtags. Parameters with a grey name can be downloaded by passing the corresponding hashtag.

Download default pretrained weights: net = get_model('i3d_resnet50_v1_kinetics400', pretrained=True)
Download weights given a hashtag: net = get_model('i3d_resnet50_v1_kinetics400', pretrained='568a722e')

The test script Download test_recognizer.py can be used for evaluating the models on various datasets.

The inference script Download inference.py can be used for inferencing on a list of videos (demo purpose).

Kinetics400 Dataset¶

The following table lists pre-trained models trained on Kinetics400.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

All models are trained using input size 224x224, except InceptionV3 is trained and evaluated with input size of 299x299, C3D and R2+1D models are trained and evaluated with input size of 112x112.

Clip Length is the number of frames within an input clip. 32 (64/2) means we use 32 frames, but actually the frames are formed by randomly selecting 64 consecutive frames from the video and then skipping every other frame. This strategy is widely adopted to reduce computation and memory cost.

Segments is the number of segments used during training. For testing (reporting these numbers), we use 250 views for 2D networks (25 frames and 10-crop) and 30 views for 3D networks (10 clips and 3-crop) following the convention.

For SlowFast family of networks, our performance has a small gap to the numbers reported in the paper. This is because the official SlowFast implementation forces re-encoding every video to a fixed frame rate of 30. For fair comparison to other methods, we do not adopt that strategy, which leads to the small gap.

Name	Pretrained	Segments	Clip Length	Top-1	Hashtag	Train Command	Train Log
inceptionv1_kinetics400 3	ImageNet	7	1	69.1	6dcdafb1	shell script	log
inceptionv3_kinetics400 3	ImageNet	7	1	72.5	8a4a6946	shell script	log
resnet18_v1b_kinetics400 3	ImageNet	7	1	65.5	46d5a985	shell script	log
resnet34_v1b_kinetics400 3	ImageNet	7	1	69.1	8a8d0d8d	shell script	log
resnet50_v1b_kinetics400 3	ImageNet	7	1	69.9	cc757e5c	shell script	log
resnet101_v1b_kinetics400 3	ImageNet	7	1	71.3	5bb6098e	shell script	log
resnet152_v1b_kinetics400 3	ImageNet	7	1	71.5	9bc70c66	shell script	log
c3d_kinetics400 2	Scratch	1	16 (32/2)	59.5	a007b5fa	shell script	log
p3d_resnet50_kinetics400 5	Scratch	1	16 (32/2)	71.6	671ba81c	shell script	log
p3d_resnet101_kinetics400 5	Scratch	1	16 (32/2)	72.6	b30e3a63	shell script	log
r2plus1d_resnet18_kinetics400 6	Scratch	1	16 (32/2)	70.8	5a14d1f9	shell script	log
r2plus1d_resnet34_kinetics400 6	Scratch	1	16 (32/2)	71.6	de2e592b	shell script	log
r2plus1d_resnet50_kinetics400 6	Scratch	1	16 (32/2)	73.9	deaefb14	shell script	log
i3d_inceptionv1_kinetics400 4	ImageNet	1	32 (64/2)	71.8	81e0be10	shell script	log
i3d_inceptionv3_kinetics400 4	ImageNet	1	32 (64/2)	73.6	f14f8a99	shell script	log
i3d_resnet50_v1_kinetics400 4	ImageNet	1	32 (64/2)	74.0	568a722e	shell script	log
i3d_resnet101_v1_kinetics400 4	ImageNet	1	32 (64/2)	75.1	6b69f655	shell script	log
i3d_nl5_resnet50_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.2	3c0e47ea	shell script	log
i3d_nl10_resnet50_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.3	bfb58c41	shell script	log
i3d_nl5_resnet101_v1_kinetics400 7	ImageNet	1	32 (64/2)	76.0	fbfc1d30	shell script	log
i3d_nl10_resnet101_v1_kinetics400 7	ImageNet	1	32 (64/2)	76.1	59186c31	shell script	log
slowfast_4x16_resnet50_kinetics400 8	Scratch	1	36 (64/1)	75.3	9d650f51	shell script	log
slowfast_8x8_resnet50_kinetics400 8	Scratch	1	40 (64/1)	76.6	d6b25339	shell script	log
slowfast_8x8_resnet101_kinetics400 8	Scratch	1	40 (64/1)	77.2	fbde1a7c	shell script	log

Kinetics700 Dataset¶

The following table lists our trained models on Kinetics700.

Name	Pretrained	Segments	Clip Length	Top-1	Hashtag	Train Command	Train Log
i3d_slow_resnet101_f16s4_kinetics700 8	Scratch	1	16 (64/4)	67.65	299b1d9d	NA	NA

UCF101 Dataset¶

The following table lists pre-trained models trained on UCF101.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

The top-1 accuracy number shown below is for official split 1 of UCF101 dataset, not the average of 3 splits.

InceptionV3 is trained and evaluated with input size of 299x299.

K400 is Kinetics400 dataset, which means we use model pretrained on Kinetics400 as weights initialization.

Name	Pretrained	Segments	Clip Length	Top-1	Hashtag	Train Command	Train Log
vgg16_ucf101 3	ImageNet	3	1	83.4	d6dc1bba	shell script	log
vgg16_ucf101 1	ImageNet	1	1	81.5	05e319d4	shell script	log
inceptionv3_ucf101 3	ImageNet	3	1	88.1	13ef5c3b	shell script	log
inceptionv3_ucf101 1	ImageNet	1	1	85.6	0c453da8	shell script	log
i3d_resnet50_v1_ucf101 4	ImageNet	1	32 (64/2)	83.9	7afc7286	shell script	log
i3d_resnet50_v1_ucf101 4	ImageNet, K400	1	32 (64/2)	95.4	760d0981	shell script	log

HMDB51 Dataset¶

The following table lists pre-trained models trained on HMDB51.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

The top-1 accuracy number shown below is for official split 1 of HMDB51 dataset, not the average of 3 splits.

Name	Pretrained	Segments	Clip Length	Top-1	Hashtag	Train Command	Train Log
resnet50_v1b_hmdb51 3	ImageNet	3	1	55.2	682591e2	shell script	log
resnet50_v1b_hmdb51 1	ImageNet	1	1	52.2	ba66ee4b	shell script	log
i3d_resnet50_v1_hmdb51 4	ImageNet	1	32 (64/2)	48.5	0d0ad559	shell script	log
i3d_resnet50_v1_hmdb51 4	ImageNet, K400	1	32 (64/2)	70.9	2ec6bf01	shell script	log

Something-Something-V2 Dataset¶

The following table lists pre-trained models trained on Something-Something-V2.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

Name	Pretrained	Segments	Clip Length	Top-1	Hashtag	Train Command	Train Log
resnet50_v1b_sthsthv2 3	ImageNet	8	1	35.5	80ee0c6b	shell script	log
i3d_resnet50_v1_sthsthv2 4	ImageNet	1	16 (32/2)	50.6	01961e4c	shell script	log

PyTorch¶

Here is the PyTorch model zoo for video action recognition task.

Hint

Training commands work with this script: Download train_ddp_pytorch.py

python train_ddp_pytorch.py --config-file CONFIG

The test script Download test_ddp_pytorch.py can be used for performance evaluation on various datasets. Please set MODEL.PRETRAINED = True in the configuration file if you would like to use the trained models in our model zoo.

python test_ddp_pytorch.py --config-file CONFIG

Kinetics400 Dataset¶

The following table lists our trained models on Kinetics400.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

All models are trained using input size 224x224, except R2+1D models are trained and evaluated with input size of 112x112.

Clip Length is the number of frames within an input clip. 32 (64/2) means we use 32 frames, but actually the frames are formed by randomly selecting 64 consecutive frames from the video and then skipping every other frame. This strategy is widely adopted to reduce computation and memory cost.

Segment is the number of segments used during training. For testing (reporting these numbers), we use 250 views for 2D networks (25 frames and 10-crop) and 30 views for 3D networks (10 clips and 3-crop) following the convention.

The model weights of r2plus1d_v2_resnet152_kinetics400, ircsn_v2_resnet152_f32s2_kinetics400 and TPN family are ported from VMZ and TPN repository. You may ignore the training config of these models for now.

Name	Pretrained	Segment	Clip Length	Top-1	Hashtag	Config
resnet18_v1b_kinetics400 3	ImageNet	7	1	66.73	854b23e4	config
resnet34_v1b_kinetics400 3	ImageNet	7	1	69.85	124a2fa4	config
resnet50_v1b_kinetics400 3	ImageNet	7	1	70.88	9939dbdf	config
resnet101_v1b_kinetics400 3	ImageNet	7	1	72.25	172afa3b	config
resnet152_v1b_kinetics400 3	ImageNet	7	1	72.45	3dedb835	config
r2plus1d_v1_resnet18_kinetics400 6	Scratch	1	16 (32/2)	71.72	340a5952	config
r2plus1d_v1_resnet34_kinetics400 6	Scratch	1	16 (32/2)	72.63	5102fd17	config
r2plus1d_v1_resnet50_kinetics400 6	Scratch	1	16 (32/2)	74.92	9a3b665c	config
r2plus1d_v2_resnet152_kinetics400 6	IG65M	1	16 (32/2)	81.34	42707ffc	config
ircsn_v2_resnet152_f32s2_kinetics400 10	IG65M	1	32 (64/2)	83.18	82855d2c	config
i3d_resnet50_v1_kinetics400 4	ImageNet	1	32 (64/2)	74.87	18545497	config
i3d_resnet101_v1_kinetics400 4	ImageNet	1	32 (64/2)	75.1	a9bb4f89	config
i3d_nl5_resnet50_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.17	9df1e103	config
i3d_nl10_resnet50_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.93	281e1e8a	config
i3d_nl5_resnet101_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.81	2cea8edd	config
i3d_nl10_resnet101_v1_kinetics400 7	ImageNet	1	32 (64/2)	75.93	526a2ed0	config
slowfast_4x16_resnet50_kinetics400 8	Scratch	1	32 (64/2)	75.25	1d1eadb2	config
slowfast_8x8_resnet50_kinetics400 8	Scratch	1	32 (64/2)	76.66	e94e9a57	config
slowfast_8x8_resnet101_kinetics400 8	Scratch	1	32 (64/2)	76.95	db5e9fef	config
i3d_slow_resnet50_f32s2_kinetics400 8	Scratch	1	32 (64/2)	77.89	078c817b	config
i3d_slow_resnet50_f16s4_kinetics400 8	Scratch	1	16 (64/4)	76.36	a3e419f1	config
i3d_slow_resnet50_f8s8_kinetics400 8	Scratch	1	8 (64/8)	74.41	1c3d98a1	config
i3d_slow_resnet101_f32s2_kinetics400 8	Scratch	1	32 (64/2)	78.57	db37cd51	config
i3d_slow_resnet101_f16s4_kinetics400 8	Scratch	1	16 (64/4)	77.11	cb6b78d9	config
i3d_slow_resnet101_f8s8_kinetics400 8	Scratch	1	8 (64/8)	76.15	82e399c1	config
tpn_resnet50_f8s8_kinetics400 9	Scratch	1	8 (64/8)	77.04	368108eb	config
tpn_resnet50_f16s4_kinetics400 9	Scratch	1	16 (64/4)	77.33	6bf899df	config
tpn_resnet50_f32s2_kinetics400 9	Scratch	1	32 (64/2)	78.9	27710ce8	config
tpn_resnet101_f8s8_kinetics400 9	Scratch	1	8 (64/8)	78.1	092c2f7f	config
tpn_resnet101_f16s4_kinetics400 9	Scratch	1	16 (64/4)	79.39	647080df	config
tpn_resnet101_f32s2_kinetics400 9	Scratch	1	32 (64/2)	79.7	a94422a9	config

Kinetics700 Dataset¶

The following table lists our trained models on Kinetics700.

Name	Pretrained	Segment	Clip Length	Top-1	Hashtag	Config
i3d_slow_resnet101_f16s4_kinetics700 8	Scratch	1	16 (64/4)	67.65	b5be1a2e	config

Something-Something-V2 Dataset¶

The following table lists our trained models on Something-Something-V2.

Note

Our pre-trained models reproduce results from recent state-of-the-art approaches. Please check the reference paper for further information.

Name	Pretrained	Segment	Clip Length	Top-1	Hashtag	Config
resnet50_v1b_sthsthv2 3	ImageNet	8	1	35.16	cbb9167b	config
i3d_resnet50_v1_sthsthv2 4	ImageNet	1	16 (32/2)	49.61	e975d989	config

Reference¶

1(1,2,3): Limin Wang, Yuanjun Xiong, Zhe Wang and Yu Qiao. “Towards Good Practices for Very Deep Two-Stream ConvNets.” arXiv preprint arXiv:1507.02159, 2015.
2: Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani and Manohar Paluri. “Learning Spatiotemporal Features with 3D Convolutional Networks.” In International Conference on Computer Vision (ICCV), 2015.
3(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17): Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang and Luc Van Gool. “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition.” In European Conference on Computer Vision (ECCV), 2016.
4(1,2,3,4,5,6,7,8,9,10,11,12): Joao Carreira and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” In Computer Vision and Pattern Recognition (CVPR), 2017.
5(1,2): Zhaofan Qiu, Ting Yao and Tao Mei. “Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks.” In International Conference on Computer Vision (ICCV), 2017.
6(1,2,3,4,5,6,7): Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun and Manohar Paluri. “A Closer Look at Spatiotemporal Convolutions for Action Recognition.” In Computer Vision and Pattern Recognition (CVPR), 2018.
7(1,2,3,4,5,6,7,8): Xiaolong Wang, Ross Girshick, Abhinav Gupta and Kaiming He. “Non-local Neural Networks.” In Computer Vision and Pattern Recognition (CVPR), 2018.
8(1,2,3,4,5,6,7,8,9,10,11,12,13,14): Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik and Kaiming He. “SlowFast Networks for Video Recognition.” In International Conference on Computer Vision (ICCV), 2019.
9(1,2,3,4,5,6): Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei. “Temporal Pyramid Network for Action Recognition.” In Computer Vision and Pattern Recognition (CVPR), 2020.
10: Du Tran, Heng Wang, Lorenzo Torresani and Matt Feiszli. “Video Classification with Channel-Separated Convolutional Networks.” In International Conference on Computer Vision (ICCV), 2019.