02. Predict depth from an image sequence or a video with pre-trained Monodepth2 models

This article will demonstrate how to estimate depth from your image sequence or video stream.

Please follow the installation guide to install MXNet and GluonCV if not yet.

First, import the necessary modules.

import os
import argparse
import time
import PIL.Image as pil
import numpy as np

import mxnet as mx
from mxnet.gluon.data.vision import transforms

import gluoncv
from gluoncv.model_zoo.monodepthv2.layers import disp_to_depth

import matplotlib as mpl
import matplotlib.cm as cm
import cv2

# using cpu
ctx = mx.cpu(0)

Prepare the data

In this tutorial, we use one sequence of KITTI RAW datasets as an example. Because the KITTI RAW dataset only provides image sequences, the input format is image sequences in this tutorial.

Follow the command to download example data:

cd ~/.mxnet/datasets/kitti/examples
wget https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/2011_09_26_drive_0095/2011_09_26_drive_0095_sync.zip
unzip 2011_09_26_drive_0095_sync.zip

After getting the dataset, we can easily load images with PIL.

data_path = os.path.expanduser("~/.mxnet/datasets/kitti/example/2011_09_26/2011_09_26_drive_0095_sync/image_02/data")

files = os.listdir(os.path.expanduser(data_path))

raw_img_sequences = []
for file in files:
    file = os.path.join(data_path, file)
    img = pil.open(file).convert('RGB')

original_width, original_height = raw_img_sequences[0].size

Loading the model

In this tutorial we feed frames from the image sequences into a depth estimation model, then we could get the depth map of the input frame.

For the model, we use monodepth2_resnet18_kitti_mono_stereo_640x192 as it is accurate and could recover the scaling factor of stereo baseline.

model_zoo = 'monodepth2_resnet18_kitti_mono_stereo_640x192'
model = gluoncv.model_zoo.get_model(model_zoo, pretrained_base=False, ctx=ctx, pretrained=True)


Downloading /root/.mxnet/models/monodepth2_resnet18_kitti_mono_stereo_640x192-9515c219.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/monodepth2_resnet18_kitti_mono_stereo_640x192-9515c219.zip...

  0%|          | 0/70343 [00:00<?, ?KB/s]
  0%|          | 241/70343 [00:00<00:40, 1725.55KB/s]
  1%|1         | 993/70343 [00:00<00:14, 4655.27KB/s]
  4%|3         | 2598/70343 [00:00<00:07, 8929.17KB/s]
 12%|#1        | 8090/70343 [00:00<00:02, 25570.67KB/s]
 20%|##        | 14117/70343 [00:00<00:01, 34560.44KB/s]
 30%|###       | 21370/70343 [00:00<00:01, 46341.44KB/s]
 39%|###8      | 27361/70343 [00:00<00:00, 50507.97KB/s]
 49%|####9     | 34804/70343 [00:00<00:00, 57802.21KB/s]
 59%|#####8    | 41403/70343 [00:00<00:00, 59285.24KB/s]
 70%|######9   | 49071/70343 [00:01<00:00, 64483.38KB/s]
 80%|########  | 56337/70343 [00:01<00:00, 64583.97KB/s]
 90%|######### | 63490/70343 [00:01<00:00, 66604.29KB/s]
70344KB [00:01, 50839.77KB/s]

Prediction loop

For each frame, we perform the following steps:

  • loading a frame from the image sequence

  • pre-process the image

  • estimate the disparity for the image

  • transfer the disparity to a depth map

  • store the depth map to the prediction sequence

min_depth = 0.1
max_depth = 100

# while use stereo or mono+stereo model, we could get real depth value
scale_factor = 5.4
MIN_DEPTH = 1e-3

feed_height = 192
feed_width = 640

pred_depth_sequences = []
pred_disp_sequences = []
for img in raw_img_sequences:
    img = img.resize((feed_width, feed_height), pil.LANCZOS)
    img = transforms.ToTensor()(mx.nd.array(img)).expand_dims(0).as_in_context(context=ctx)

    outputs = model.predict(img)
    pred_disp, _ = disp_to_depth(outputs[("disp", 0)], min_depth, max_depth)
    t = time.time()
    pred_disp = pred_disp.squeeze().as_in_context(mx.cpu()).asnumpy()
    pred_disp = cv2.resize(src=pred_disp, dsize=(original_width, original_height))

    pred_depth = 1 / pred_disp
    pred_depth *= scale_factor
    pred_depth[pred_depth < MIN_DEPTH] = MIN_DEPTH
    pred_depth[pred_depth > MAX_DEPTH] = MAX_DEPTH

Store results

Here, we provide an example of storing the prediction results. Including:

  • store depth map

  • store disparity and save it to a video

rgb_path = os.path.join(output_path, 'rgb')
if not os.path.exists(rgb_path):

output_sequences = []
for raw_img, pred, file in zip(raw_img_sequences, pred_disp_sequences, files):
    vmax = np.percentile(pred, 95)
    normalizer = mpl.colors.Normalize(vmin=pred.min(), vmax=vmax)
    mapper = cm.ScalarMappable(norm=normalizer, cmap='magma')
    colormapped_im = (mapper.to_rgba(pred)[:, :, :3] * 255).astype(np.uint8)
    im = pil.fromarray(colormapped_im)

    raw_img = np.array(raw_img)
    pred = np.array(im)
    output = np.concatenate((raw_img, pred), axis=0)

    pred_out_file = os.path.join(rgb_path, file)
    cv2.imwrite(pred_out_file, cv2.cvtColor(pred, cv2.COLOR_RGB2BGR))

width = int(output_sequences[0].shape[1] + 0.5)
height = int(output_sequences[0].shape[0] + 0.5)
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(
    os.path.join(output_path, 'demo.mp4'), fourcc, 20.0, (width, height))

for frame in output_sequences:
    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)

    # uncomment to display the frames
    # cv2.imshow('demo', frame)

    # if cv2.waitKey(25) & 0xFF == ord('q'):
    #    break

We release the webcam before exiting:

# cv2.destroyAllWindows()

The result video for the example:


You can start with the example code.

Download the script to run the demo

Download cam_demo.py

This example command will load an image sequence then store a video:

python demo.py --model_zoo monodepth2_resnet18_kitti_mono_stereo_640x192 --input_format image --data_path ~/.mxnet/datasets/kitti/example/2011_09_26/2011_09_26_drive_0095_sync/image_02/data --output_format video

This example command will load an image sequence then store the corresponding colorized disparity sequence:

python demo.py --model_zoo monodepth2_resnet18_kitti_mono_stereo_640x192 --input_format image --data_path ~/.mxnet/datasets/kitti/example/2011_09_26/2011_09_26_drive_0095_sync/image_02/data --output_format image

For more demo command options, please run python demo.py -h


This tutorial directly loads the image sequence or video into a list, so it cannot work when the image sequence or video is large. Here is just provide an example about using a pretrained monodepth2 model to do a prediction for users.

Total running time of the script: ( 0 minutes 38.922 seconds)

Gallery generated by Sphinx-Gallery