Table Of Contents
Table Of Contents

Prepare COCO datasets

COCO is a large-scale object detection, segmentation, and captioning datasetself. This tutorial will walk through the steps of preparing this dataset for GluonCV.


You need 42.7 GB disk space to download and extract this dataset. SSD is preferred over HDD because of its better performance.

The total time to prepare the dataset depends on your Internet speed and disk performance. For example, it often takes 20 min on AWS EC2 with EBS.

Prepare the dataset

We need the following four files from COCO:




18 GB


778 MB


241 MB


401 MB


The easiest way to download and unpack these files is to download helper script and run the following command:

which will automatically download and extract the data into ~/.mxnet/datasets/coco.

If you already have the above files sitting on your disk, you can set --download-dir to point to them. For example, assuming the files are saved in ~/coco/, you can run:

python --download-dir ~/coco

Read with GluonCV

Loading images and labels is straight-forward with

from gluoncv import data, utils
from matplotlib import pyplot as plt

train_dataset = data.COCODetection(splits=['instances_train2017'])
val_dataset = data.COCODetection(splits=['instances_val2017'])
print('Num of training images:', len(train_dataset))
print('Num of validation images:', len(val_dataset))


loading annotations into memory...
Done (t=16.19s)
creating index...
index created!
loading annotations into memory...
Done (t=0.46s)
creating index...
index created!
Num of training images: 117266
Num of validation images: 4952

Now let’s visualize one example.

train_image, train_label = train_dataset[0]
bounding_boxes = train_label[:, :4]
class_ids = train_label[:, 4:5]
print('Image size (height, width, RGB):', train_image.shape)
print('Num of objects:', bounding_boxes.shape[0])
print('Bounding boxes (num_boxes, x_min, y_min, x_max, y_max):\n',
print('Class IDs (num_boxes, ):\n', class_ids)

utils.viz.plot_bbox(train_image.asnumpy(), bounding_boxes, scores=None,
                    labels=class_ids, class_names=train_dataset.classes)


Image size (height, width, RGB): (480, 640, 3)
Num of objects: 8
Bounding boxes (num_boxes, x_min, y_min, x_max, y_max):
 [[  1.08 187.69 611.67 472.53]
 [311.73   4.31 630.01 231.99]
 [249.6  229.27 564.84 473.35]
 [  0.    13.51 433.48 387.63]
 [376.2   40.36 450.75  85.89]
 [465.78  38.97 522.85  84.64]
 [385.7   73.66 468.72 143.17]
 [364.05   2.49 457.81  72.56]]
Class IDs (num_boxes, ):

Finally, to use both train_dataset and val_dataset for training, we can pass them through data transformations and load with, see for more information.

Total running time of the script: ( 0 minutes 39.066 seconds)

Gallery generated by Sphinx-Gallery