5. DistributedDataParallel (DDP) Framework

Training deep neural networks on videos is very time consuming. For example, training a state-of-the-art SlowFast network on Kinetics400 dataset (with 240K 10-seconds short videos) using a server with 8 V100 GPUs takes more than 10 days. Slow training causes long research cycles and is not friendly for new comers and students to work on video related problems. Using distributed training is a natural choice. Spreading the huge computation over multiple machines can speed up training a lot. However, only a few open sourced Github repositories on video understanding support distributed training, and they often lack documentation for this feature. Besides, there is not much information/tutorial online on how to perform distributed training for deep video models.

Hence, we provide a simple tutorial here to demonstrate how to use our DistributedDataParallel (DDP) framework to perform efficient distributed training. Note that, even in a single instance with multiple GPUs, DDP should be used and is much more efficient that vanilla dataparallel.

Distributed training

There are two ways in which we can distribute the workload of training a neural network across multiple devices, data parallelism and model parallelism. Data parallelism refers to the case where each device stores a complete copy of the model. Each device works with a different part of the dataset, and the devices collectively update a shared model. When models are so large that they don’t fit into device memory, then model parallelism is useful. Here, different devices are assigned the task of learning different parts of the model. In this tutorial, we describe how to train a model with devices distributed across machines in a data parallel way. To be specific, we adopt DistributedDataParallel (DDP), which implements data parallelism at the module level that can be applied across multiple machines. DDP spawn multiple processes and create a single GPU instance per process. It can spread the computation more evenly and particular be useful for deep video model training.

In order to keep this tutorial concise, I wouldn’t go into details of what is DDP. Readers can refer to Pytorch Official Tutorials for more information.

How to use our DDP framework?

In order to perform distributed training, you need to (1) prepare the cluster; (2) prepare environment; and (3) prepare your code and data.

We need a cluster that each node can communicate with each other. The first step is to generate ssh keys for each machine. For better illustration, let’s assume we have 2 machines, node1 and node2.

First, ssh into node1 and type

ssh-keygen -t rsa

Just follow the default, you will have a file named id_rsa and a file named id_rsa.pub, both under the ~/.ssh/ folder. id_rsa is the private RSA key, and id_rsa.pub is its public key.

Second, copy both files (id_rsa and id_rsa.pub) of node1 to all other machines. For each machine, you will find an authorized_keys file under ~/.ssh/ folder as well. Append authorized_keys with the content of id_rsa.pub This step will make sure all the machines in the cluster is able to communicate with each other.

Before moving on to next step, it is better to perform some sanity checks to make sure the communication is good. For example, if you can successfully ssh into other machines, it means they can communicate with each other now. You are good to go. If there is any error during ssh, you can use option -vvv to get verbose information for debugging.

Once you get the cluster ready, it is time to prepare the enviroment. Please check GluonCV installation guide for more information. Every machine should have the same enviroment, such as CUDA, PyTorch and GluonCV, so that the code is runnable.

Now it is time to prepare your code and data in each node. In terms of code change, you only need to modify the DDP_CONFIG part in the yaml configuration file. For example, in our case we have 2 nodes, then change WORLD_SIZE to 2. WOLRD_URLS contains all machines’ IP used for training (only support LAN IP for now), you can put their IP addresses in the list. If AUTO_RANK_MATCH is True, the launcher will automatically assign a world rank number to each machine in WOLRD_URLS, and consider the first machine as the root. Please make sure to use root’s IP for DIST_URL. If AUTO_RANK_MATCH is False, you need to manually set a ranking number to each instance. The instance assigned with rank=0 will be considered as the root machine. We suggest always enable AUTO_RANK_MATCH. An example configuration look like below,

DDP_CONFIG:
  AUTO_RANK_MATCH: True
  WORLD_SIZE: 2 # Total Number of machines
  WORLD_RANK: 0 # Rank of this machine
  DIST_URL: 'tcp://172.31.72.195:23456'
  WOLRD_URLS: ['172.31.72.195', '172.31.72.196']

  GPU_WORLD_SIZE: 8 # Total Number of GPUs, will be assigned automatically
  GPU_WORLD_RANK: 0 # Rank of GPUs, will be assigned automatically
  DIST_BACKEND: 'nccl'
  GPU: 0 # Rank of GPUs in the machine, will be assigned automatically
  DISTRIBUTED: True

Once it is done, you can kickstart the training on each machine. Simply run train_ddp_pytorch.py/test_ddp_pytorch.py with the desire configuration file on each instance, e.g.,

python train_ddp_pytorch.py --config-file XXX.yaml

If you are using multiple instances for training, we suggest you start running on the root instance firstly and then start the code on other instances. The log will only be shown on the root instance by default.

In the end, we want to point out that we have integrated dataloader and training/testing loop in our DDP framework. If you simply want to try out our model zoo on your dataset/usecase, please see previous tutorial on how to finetune. If you have your new video model, you can add it to the model zoo (e.g., a single .py file) and enjoy the speed up brought by our DDP framework. You don’t need to handle the multiprocess dataloading and the underlying distributed training setup.

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery