Warning: Not official code, not achieving promising results yet and somewhat slow
Issue or Contact me directly by yhshi@bupt.edu.cn
-
This repo attempts to reproduce Tree-Structured Reinforcement Learning for Sequential Object Localization with pytorch 1.4.0.
-
Requirements(higher or lower versions might also work well)
- pytorch 1.4.0
- numpy 1.18.1
- Prefetch version of DataLoader: IgorSusmelj/pytorch-styleguide#5
- PIL 7.1.2
- matplotlib 3.2.1
- tensorboard 2.1.0
-
Branches
-
main: (Try this first)
- basic code
- load images from disk
- fix ratio to 1:1 and image size to (224, 224)
- support voc2007 and voc2012 combined dataset: training on 07+12 trainval(~16000 pics), testing on 07 test(~5000 pics)
-
faster:
- pre-extracted feature map to speed up
- support loading all feature map to memory
- keep original ratio and resize longest edge to defined size (eg. 500)
- support voc2007 only: training on 07 trainval(~5000 pics), testing on 07 test(~5000 pics)
-
-
Usage
-
training
mkdir model_params
- change settings by
set_args()
and global variables python train.py
-
evaluating
- eg.
python evaluate.py --model_name debug --model_check_point epoch_x_iter_x.pth.tar
result.json
will save to./model_params/debug
- eg.
-
visualizing
- eg.
python visualize_tree.py --json_path ./model_params/debug/results.json --vis_dir vis --num_images 50
./model_params/debug/vis
will be created and stores bbox-tree pictures
- eg.
-
extracting
- eg.
python preprocess.py --year 2007 --image_set trainval --max_size 500
- see
preprocess.py
in branch faster for more details
- eg.
-
-
Dataset
torchvision.datasets.VOCDetection
support downloading VOC2007 and VOC2012 by changedownload=True
- Or
mkdir data
and follow the file structures shown below:
data ├── feature_map_2007_trainval.h5 ├── img_info_2007_trainval.pkl ├── voc2007 │ └── VOCdevkit └── voc2012 └── VOCdevkit
-
-
Overview of the model and search method from original paper:
-
Major implementation difference between this repo and original paper: (For the sake of speed)
-
(ResNet50 + average pooling) instead of (VGG16 + flatten) is used in the feature map extractor. see
model/encoder.py
for more details. -
Only 4 layers of the tree is generated (including the root node) during evaluate searching.
-
Reduce action history in states from 50 to 3.*
-
Reduce maximum searching steps during training from 50 to 15.
* : In my opinion, since I adopt 4-layer tree which requires only 3 actions search from root to leaf, setting action history to larger size simply adds additional zeros to the input.
-
-
Results (based on Branch faster)
-
Problems(Help wanted!)
- Huge gap between my implementation and original paper.
- Loss is really hard to converge.
- Reward seems improving slowly.
- Visualization of bbox scaling and moving is not very convincing.
- Scaled up dataset (07+12) shows little improvement to final performance but achieves 3x training time.