-
Notifications
You must be signed in to change notification settings - Fork 82
data pipeline
data pipeline provides main functionalities including packing data, data sampling (dataset), augmentation and data loading. This part is crucial for training efficiency, because the speed of data loading is the bottleneck in many training cases. To this end, our data pipeline has the following key features:
- easily adapt to any annotation formats by writing your own annatation parsers
- do not rely on any training frameworks, like PyTorch, MXNet, for easy porting
- support multiple dataset types for highly efficiency data accessing
- libjpeg-turbo is adopted for very fast decoding
- multi-threading for sufficiently taking advantages of multi-cores
As we know, different datasets have different annotation formats. COCO-style is conventional, so you have to convert your annotations to COCO-style usually. Here, you don't have to do that. A parser is the interpreter to understand annotation format and generate samples for datasets. In general, you need to write your own parser inherited from the base parser. We have implemented three parsers for WIDERFACE, TT100K and COCO for your reference.
Our dataset acts closely like dataset class in PyTorch. Generally, dataset has three types:
- disk-based dataset
- memory-based compressed dataset
- memory-based uncompressed dataset
This kind of dataset only keeps image paths for subsequent data access. In order to obtain final image, two steps are performed: 1) read the raw compressed bytes into memory (disk I/O); 2) decode the bytes to recover the final image. Obviously, the whole process is time-consuming. The advantage is that the dataset is memory-friendly.
How to build a disk-based dataset: make sure that the key 'image_path' is set in your samples generated by your parser. Refer to tt100k_parser.py for reference.
This kind of dataset keeps compressed data bytes of each sample. To recover the final image, only one step is needed: decompress the bytes (in most cases, the image is compressed in jpeg format). Disk I/O is avoided, and the efficiency is much better than disk-based dataset. However, you have to make sure that the memory can fully hold all the compressed bytes of the dataset.
How to build a memory-based compressed dataset: make sure that the key 'image_bytes' is filled in your samples generated by your parser. Refer to widerface_parser.py for reference.
This kind of dataset stores the original images. So disk I/O and decompressing are both avoided. But you need a very large amount of memory to hold all the data. It is not feasible in most cases unless you have enough memory.
How to build a memory-based uncompressed dataset: make sure that the key 'image' is filled in all samples generated by your parser.
In summary, the first two kinds of dataset are often used. Of cause, you can create a hybrid dataset as your needs. All you have to do is write a proper parser to generate samples.
Once you finish the parser, you should pack you dataset next. Packing is very simple, just read the code for reference.
In the data pipeline, samplers are crucial. We have two types of samplers: dataset sampler and region sampler.
A dataset sampler decides the order of each sample sent to the training precess. It is easy to understand the function of a dataset sampler, so please check the code for reference. If you have more complex ideas, you can implement a new dataset sampler.
A region sampler determines which part of the image is cropped for use. This operation will make great impact on data distribution, and a appropriate region sampler can boost the performance evidently. Someone may interpret region samplers as a way of data augmentation. That's right. Here, we combine region samplers together with dataset samplers for building a more logical data pipeline. We have implemented some region samplers for your reference. In most cases, RandomBBoxCropRegionSampler
is enough. Of cause, you can write your own region samplers.
We purely rely on albumentation
for data augmentation. In our code, some easy augmentations are employed like horizon flip and image norm. Composing theses augmentation operators to form your augmentation pipeline. Here, we only provide two pipelines: coco train/val pipelines and widerface train/val pipelines. In TT100K_augmentation_pipeline.py, we create an augmentation pipeline for TT100K in a independent .py file, and it is also an example for you to learn how to write your own augmentation pipeline.
We implement a multi-threading data loader for fast batch producing. (In fact, we tried multi-process and found it's slower than multi-threading) The data loader combines all above modules to prepare data batches. The num_workers
must be set to a proper value, say 4-10, according to the max number of physical cpu cores.
- you can put any information that you need in the whole training process into sample objects by adding key-value pairs. Also be aware that Sample has some reserved keys.