Skip to content
This repository was archived by the owner on Nov 21, 2023. It is now read-only.
This repository was archived by the owner on Nov 21, 2023. It is now read-only.

Error at: caffe2/core/context_gpu.cu:343: out of memory #5

Closed
@ghost

Description

Hi, thanks for the great work!

Ran to an out of memory issue when we were running test_net.py on COCO dataset with 2 TITAN X set up. Installations were fine and coco datasets were included in /lib/datasets/data/coco. A line was added in test_net.py to facilitate the use of 3rd and 4th GPUs.
os.environ['CUDA_VISIBLE_DEVICES'] = "2,3"

We tried to run the code:
./tools/test_net.py --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml --multi-gpu-testing TEST.WEIGHTS https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl NUM_GPUS 2

and encountered error as such:

terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.cu:343] error == cudaSuccess. 2 vs 0. Error at: /home/user/default/caffe2/caffe2/core/context_gpu.cu:343: out of memory Error from operator:
input: "gpu_0/roi_feat_fpn2" input: "gpu_0/roi_feat_fpn3" input: "gpu_0/roi_feat_fpn4" input: "gpu_0/roi_feat_fpn5" output: "gpu_0/roi_feat_shuffled" output: "gpu_0/_concat_roi_feat" name: "" type: "Concat" arg { name: "axis" i: 0 } device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1516700071 (unix time) try "date -d @1516700071" if you are using GNU date ***
PC: @     0x7faf48c0c428 gsignal
*** SIGABRT (@0x3e800001c16) received by PID 7190 (TID 0x7fae72ffd700) from PID 7190; stack trace: ***
    @     0x7faf48fb2390 (unknown)
    @     0x7faf48c0c428 gsignal
    @     0x7faf48c0e02a abort
    @     0x7faf45bf484d __gnu_cxx::__verbose_terminate_handler()
    @     0x7faf45bf26b6 (unknown)
    @     0x7faf45bf2701 std::terminate()
    @     0x7faf45c1dd38 (unknown)
    @     0x7faf48fa86ba start_thread
    @     0x7faf48cde3dd clone
    @                0x0 (unknown)
Aborted (core dumped)
Traceback (most recent call last):
  File "./tools/test_net.py", line 168, in <module>
    main(ind_range=args.range, multi_gpu_testing=args.multi_gpu_testing)
  File "./tools/test_net.py", line 133, in main
    results = parent_func(multi_gpu=multi_gpu_testing)
  File "/home/user/default/Detectron/lib/core/test_engine.py", line 59, in test_net_on_dataset
    num_images, output_dir
  File "/home/user/default/Detectron/lib/core/test_engine.py", line 82, in multi_gpu_test_net_on_dataset
    'detection', num_images, binary, output_dir
  File "/home/user/default/Detectron/lib/utils/subprocess.py", line 83, in process_in_parallel
    log_subprocess_output(i, p, output_dir, tag, start, end)
  File "/home/user/default/Detectron/lib/utils/subprocess.py", line 121, in log_subprocess_output
    assert ret == 0, 'Range subprocess failed (exit code: {})'.format(ret)
AssertionError: Range subprocess failed (exit code: 134)

Is there any settings that we are missing? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions