Skip to content
This repository has been archived by the owner on Sep 16, 2024. It is now read-only.

Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048] #179

Open
coderclear opened this issue Apr 26, 2018 · 1 comment

Comments

@coderclear
Copy link

coderclear commented Apr 26, 2018

I only set batch_size=1,if batch_size>1 ,error occured:
when batch_size=1,and other parameters is default, the loss does not down,below is loss
step 19960 loss = 1.322, (0.337 sec/step)
step 19961 loss = 1.341, (0.336 sec/step)
step 19962 loss = 1.302, (0.336 sec/step)
step 19963 loss = 1.324, (0.337 sec/step)
step 19964 loss = 1.317, (0.335 sec/step)
step 19965 loss = 1.298, (0.337 sec/step)
step 19966 loss = 1.319, (0.336 sec/step)
step 19967 loss = 1.304, (0.335 sec/step)
step 19968 loss = 1.294, (0.336 sec/step)
step 19969 loss = 1.305, (0.336 sec/step)
step 19970 loss = 1.347, (0.335 sec/step)
step 19971 loss = 1.314, (0.337 sec/step)
step 19972 loss = 1.304, (0.337 sec/step)
step 19973 loss = 1.310, (0.336 sec/step)
step 19974 loss = 1.301, (0.336 sec/step)
step 19975 loss = 1.301, (0.337 sec/step)
step 19976 loss = 1.387, (0.336 sec/step)
step 19977 loss = 1.320, (0.335 sec/step)
step 19978 loss = 1.305, (0.336 sec/step)
step 19979 loss = 1.309, (0.336 sec/step)
step 19980 loss = 1.302, (0.336 sec/step)
step 19981 loss = 1.304, (0.335 sec/step)
step 19982 loss = 1.325, (0.337 sec/step)
step 19983 loss = 1.321, (0.336 sec/step)
step 19984 loss = 1.316, (0.336 sec/step)
step 19985 loss = 1.332, (0.337 sec/step)
step 19986 loss = 1.299, (0.336 sec/step)
step 19987 loss = 1.312, (0.336 sec/step)
step 19988 loss = 1.290, (0.335 sec/step)
step 19989 loss = 1.323, (0.337 sec/step)
step 19990 loss = 1.318, (0.336 sec/step)
step 19991 loss = 1.307, (0.336 sec/step)
step 19992 loss = 1.364, (0.336 sec/step)
step 19993 loss = 1.324, (0.335 sec/step)
step 19994 loss = 1.314, (0.335 sec/step)
step 19995 loss = 1.301, (0.336 sec/step)
step 19996 loss = 1.291, (0.336 sec/step)
step 19997 loss = 1.317, (0.338 sec/step)
step 19998 loss = 1.322, (0.337 sec/step)
step 19999 loss = 1.293, (0.335 sec/step)
The checkpoint has been created.
step 20000 loss = 1.320, (11.272 sec/step)
batch_size>1, below is error:

2018-04-26 13:43:56.846459: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] *************************************************************************************************___
2018-04-26 13:43:56.849249: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[144,12,20,2048]
Traceback (most recent call last):
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1327, in _do_call
return fn(*args)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1306, in _run_fn
status, run_metadata)
File "F:\soft\anaconda\lib\contextlib.py", line 66, in exit
next(self.gen)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 252, in main
loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, train_op], feed_dict=feed_dict)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1321, in _do_run
options, run_metadata)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op 'fc1_voc12_c3/convolution/SpaceToBatchND', defined at:
File "train.py", line 272, in
main()
File "train.py", line 146, in main
net = DeepLabResNetModel({'data': image_batch}, is_training=args.is_training, num_classes=args.num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 48, in init
self.setup(is_training, num_classes)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\deeplab_resnet\model.py", line 411, in setup
.atrous_conv(3, 3, num_classes, 24, padding='SAME', relu=False, name='fc1_voc12_c3'))
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 22, in layer_decorated
layer_output = op(self, layer_input, *args, **kwargs)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 173, in atrous_conv
output = convolve(input, kernel)
File "D:\UBUNTU\github\tensorflow-deeplab-resnet\kaffe\tensorflow\network.py", line 168, in
convolve = lambda i, k: tf.nn.atrous_conv2d(i, k, dilation, padding=padding)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 974, in atrous_conv2d
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 672, in convolution
op=op)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 451, in with_space_to_batch
paddings=paddings)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 3359, in space_to_batch_nd
name=name)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "F:\soft\anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2304,5,7,2048]
[[Node: fc1_voc12_c3/convolution/SpaceToBatchND = SpaceToBatchND[T=DT_FLOAT, Tblock_shape=DT_INT32, Tpaddings=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](res5c_relu, fc1_voc12_c3/convolution/SpaceToBatchND/block_shape, fc1_voc12_c3/convolution/SpaceToBatchND/paddings)]]
[[Node: ExpandDims/_1095 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2405_ExpandDims", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"]]

@ghost
Copy link

ghost commented May 12, 2018

This could be due to your gpu cababilities.
Try decreasing your learning rate by a factor of 10 to decrease the loss.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant