OpenDV-YouTube

Due to YouTube License, we could not directly offer our processed data. However, you can follow the steps below to download the raw data and process it by yourself.

[ NEW❗️]: We just released the OpenDV-mini subset! Please feel free to try the mini subset by following steps. Necessary information is also contained in our OpenDV-YouTube Google Sheet (marked as Mini in the column Mini / Full Set).

About OpenDV-YouTube and OpenDV-mini

The complete dataset OpenDV-YouTube is the largest driving video dataset to date, containing more than 1700 hours of real-world driving videos and being 300 times larger than the widely used nuScenes dataset.
The mini subset, OpenDV-mini, contains about 28 hours of videos, with diverse geographical distribution and various camera settings. Among these videos, 25 hours are used as mini-train and the other 3 hours are used as mini-val.

Environment Setup

We recommend to process the dataset in Linux environment since Windows may have issues with the file paths.

Install the required packages by running the following command.

conda create -n opendv python=3.10 -y
conda activate opendv
pip install -r requirements.txt

In case the meta data of videos downloaded are fragmented, we recommend installing ffmpeg<=3.4.9. Instead of using the following commands, you can also directly clone and build from their official repository.

# 1. prepare yasm for ffmpeg. If it is already satisfied by your machine, skip to the next step.
wget https://tortall.net/projects/yasm/releases/yasm-1.3.0.tar.gz
tar -xzvf yasm-1.3.0.tar.gz
cd yasm-1.3.0
./configure
make
make install

# 2. install ffmpeg<=3.4.9.
wget https://ffmpeg.org/releases/ffmpeg-3.4.9.tar.gz
tar -xzvf ffmpeg-3.4.9.tar.gz
cd ffmpeg-3.4.9
./configure
make
make install

# 3. check the installation. Sometimes you may need to reactivate the conda environment to see it working.
ffprobe

Meta Data Preparation

First, download the OpenDV-YouTube Google Sheet as a csv file. For default setting, you should save the file as meta/OpenDV-YouTube.csv. You could change it to whatever path you want as long as you change the csv_path in the command in the next step.

Then, run the following command to preprocess the meta data. The default value for --csv_path (or -i) and --json_path (or -o) are meta/OpenDV-YouTube.csv and meta/OpenDV-YouTube.json respectively.

python scripts/meta_preprocess.py -i CSV_PATH -o JSON_PATH

Raw Data Download (Raw videos)

To download the raw data from YouTube, you should first change the configures in configs/download.json.

Note that the script supports multi-threading download, so please set the num_workers to a proper value according to your hardware and network condition.

Also, the format key in the config file should strictly obey the format selection rules of the youtube-dl package. We do not recommend changing it unless you are familiar with the package.

Now you can run the following command to download the raw video data.

python scripts/youtube_download.py >> download_output.txt

The download will take about $2000/\mathrm{NUM_{WORKERS}}$ hours, which may vary your network condition. The default $\mathrm{NUM_{WORKERS}} = 90$, and you can adjust it in config. The data will take about 3TB of disk space.

If you wish to use the mini subset, just simply add the mini option in your command, i.e. run the following command.

python scripts/youtube_download.py --mini >> download_output.txt

You may refer to the download_exceptions.txt to check whether the download is successful or not. The file will be automatically generated by the script in the root of the opendv codebase.

If downloading with youtube-dl is not successful, you can change the method in config from youtube-dl to yt-dlp.

Data Preprocessing (Converting videos to images)

When the download is finished, you can first set the configures in configs/video2img.json to those you expect. The script also supports multi-threading processing, so you can set the num_workers to a proper value according to your hardware condition.

Note that if you want to align with the annotations we provide, frame_rate should not be changed.

Then, you can run the following command to preprocess the raw video data.

python scripts/video2img.py >> vid2img_output.txt

The preprocessing will take about $8000/\mathrm{NUM_{WORKERS}}$ hours, which may vary your network condition. The default $\mathrm{NUM_{WORKERS}} = 90$, and you can adjust it in config. Resulting images will take about 25TB of disk space.

If you wish to use the mini subset, just simply add the mini option in your command, i.e. run the following command.

python scripts/video2img.py --mini >> vid2img_output.txt

You may refer to the vid2img_exceptions.txt to check the status.

Language Annotations

The full annotation data, including commands and contexts of video clips, is available at OpenDV-YouTube-Language. The files are in json format, with total size of about 14GB.

The annotation data is aligned with the structure of the preprocessed data. You can use the following code to load in annotations respectively.

import json

# for train
full_annos = []
for split_id in range(10):
  split = json.load(open("10hz_YouTube_train_split{}.json".format(str(split_id)), "r"))
  full_annos.extend(split)

# for val
val_annos = json.load(open("10hz_YouTube_val.json", "r"))

Annotations will be loaded in full_annos as a list where each element contains annotations for one video clip. All elements in the list are dictionaries of the following structure.

{
  "cmd": <int> -- command, i.e. the command of the ego vehicle in the video clip.
  "blip": <str> -- context, i.e. the BLIP description of the center frame in the video clip.
  "folder": <str> -- the relative path from the processed OpenDV-YouTube dataset root to the image folder of the video clip.
  "first_frame": <str> -- the filename of the first frame in the clip. Note that this file is included in the video clip.
  "last_frame": <str> -- the filename of the last frame in the clip. Note that this file is included in the video clip.
}

The command, i.e. the cmd field, can be converted to natural language using the map_category_to_caption function. You may refer to cmd2caption.py for details.

The context, i.e. the blip field, is the description of the center frame in the video generated by BLIP2.

Citation

If you find our work helpful, please cite the following paper.

@inproceedings{yang2024genad,
  title={Generalized Predictive Model for Autonomous Driving},
  author={Jiazhi Yang and Shenyuan Gao and Yihang Qiu and Li Chen and Tianyu Li and Bo Dai and Kashyap Chitta and Penghao Wu and Jia Zeng and Ping Luo and Jun Zhang and Andreas Geiger and Yu Qiao and Hongyang Li},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OpenDV-YouTube

About OpenDV-YouTube and OpenDV-mini

Environment Setup

Meta Data Preparation

Raw Data Download (Raw videos)

Data Preprocessing (Converting videos to images)

Language Annotations

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

OpenDV-YouTube

About OpenDV-YouTube and OpenDV-mini

Environment Setup

Meta Data Preparation

Raw Data Download (Raw videos)

Data Preprocessing (Converting videos to images)

Language Annotations

Citation