Error at: caffe2/core/context_gpu.cu:343: out of memory

Hi, thanks for the great work! 

Ran to an out of memory issue when we were running test_net.py on COCO dataset with 2 TITAN X set up. Installations were fine and coco datasets were included in /lib/datasets/data/coco. A line was added in test_net.py to facilitate the use of 3rd and 4th GPUs.
os.environ['CUDA_VISIBLE_DEVICES'] = "2,3"

We tried to run the code:
`./tools/test_net.py     --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml     --multi-gpu-testing     TEST.WEIGHTS https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl     NUM_GPUS 2`

and encountered error as such:
```
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.cu:343] error == cudaSuccess. 2 vs 0. Error at: /home/user/default/caffe2/caffe2/core/context_gpu.cu:343: out of memory Error from operator:
input: "gpu_0/roi_feat_fpn2" input: "gpu_0/roi_feat_fpn3" input: "gpu_0/roi_feat_fpn4" input: "gpu_0/roi_feat_fpn5" output: "gpu_0/roi_feat_shuffled" output: "gpu_0/_concat_roi_feat" name: "" type: "Concat" arg { name: "axis" i: 0 } device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1516700071 (unix time) try "date -d @1516700071" if you are using GNU date ***
PC: @     0x7faf48c0c428 gsignal
*** SIGABRT (@0x3e800001c16) received by PID 7190 (TID 0x7fae72ffd700) from PID 7190; stack trace: ***
    @     0x7faf48fb2390 (unknown)
    @     0x7faf48c0c428 gsignal
    @     0x7faf48c0e02a abort
    @     0x7faf45bf484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7faf45bf26b6 (unknown)
    @     0x7faf45bf2701 std::terminate()
    @     0x7faf45c1dd38 (unknown)
    @     0x7faf48fa86ba start_thread
    @     0x7faf48cde3dd clone
    @                0x0 (unknown)
Aborted (core dumped)
Traceback (most recent call last):
  File "./tools/test_net.py", line 168, in <module>
    main(ind_range=args.range, multi_gpu_testing=args.multi_gpu_testing)
  File "./tools/test_net.py", line 133, in main
    results = parent_func(multi_gpu=multi_gpu_testing)
  File "/home/user/default/Detectron/lib/core/test_engine.py", line 59, in test_net_on_dataset
    num_images, output_dir
  File "/home/user/default/Detectron/lib/core/test_engine.py", line 82, in multi_gpu_test_net_on_dataset
    'detection', num_images, binary, output_dir
  File "/home/user/default/Detectron/lib/utils/subprocess.py", line 83, in process_in_parallel
    log_subprocess_output(i, p, output_dir, tag, start, end)
  File "/home/user/default/Detectron/lib/utils/subprocess.py", line 121, in log_subprocess_output
    assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 134)
```

Is there any settings that we are missing? Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error at: caffe2/core/context_gpu.cu:343: out of memory #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error at: caffe2/core/context_gpu.cu:343: out of memory #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions