parallel_for's throw_on_error results in terminate #1448

evelkey · 2021-05-31T09:17:19Z

We're using Thrust with Torch 1.7.1 and MinkowskiEngine 0.5.4 and experience a deterministic issue which makes the library unusable for long-running processes.

When we run parallel_for on large arrays there is several memory allocation steps and if we encounter an OOM error it simply results in a terminate instead of an error which could be handled. This is usually a result of an error thrown in a noexcept function.

Environment:

Ubuntu 18.04
CUDA 10.2
Driver Version: 460.73.01

Code to reproduce:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <new>
#include <numeric>
#include <random>
#include <vector>

#include <cuda_runtime.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/sort.h>

int main() {
  std::random_device rd;
  std::mt19937 engine;
  engine.seed(rd());
  std::uniform_real_distribution<float> u(0, 90.);

  // Parameter for Tesla V100 16GB VRAM (Use 250M for 8GB VRAM):
  int N = 550000000;

  std::vector<float> v(N);
  std::generate(v.begin(), v.end(), [&]() { return u(engine); });
  thrust::host_vector<float> hv(v.begin(), v.end());
  thrust::device_vector<float> dv = hv;

  thrust::device_vector<float> res(dv.begin(), dv.end());

  thrust::device_vector<int> index(N);
  thrust::sequence(thrust::device, index.begin(), index.end(), 0, 1);

  while (1) {
    try {
      std::cout << "step" << std::endl;
      thrust::sort_by_key(thrust::device,            //
                          dv.begin(),                // key begin
                          dv.end(),                  // key end
                          thrust::make_zip_iterator( // value begin
                              thrust::make_tuple(    //
                                  dv.begin(),        //
                                  index.begin()      //
                                  )));
    } catch (std::bad_alloc) {
      std::cout << "bad_alloc" << std::endl;
    } catch (...) {
      std::cout << "other error" << std::endl;
    }
    // thrust exception cannot be caught
  }

  cudaDeviceSynchronize();
  return 0;
}

Traceback:

#0  0x00007f07d8108fb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f07d810a921 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f07d8afd957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f07d8b03ae6 in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f07d8b02b49 in __cxa_throw_bad_array_new_length () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f07d8b034b8 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f07d84cb573 in _Unwind_GetTextRelBase () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007f07d84cbad1 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00007f07d8b03d47 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x000056485a33fb4d in thrust::cuda_cub::throw_on_error(cudaError, char const*) ()
#10 0x000056485a35c134 in void thrust::cuda_cub::parallel_for<thrust::cuda_cub::par_t, thrust::cuda_cub::for_each_f<thrust::pointer<thrust::tuple<float, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<thrust::detail::allocator_traits_detail::gozer, void> >, long>(thrust::cuda_cub::execution_policy<thrust::cuda_cub::par_t>&, thrust::cuda_cub::for_each_f<thrust::pointer<thrust::tuple<float, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<thrust::detail::allocator_traits_detail::gozer, void> >, long) ()

I found that the issue is not present on CUDA 11.1, so we started to migrate the codebase over, but it might be useful for someone else who encounters this issue.

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-06-01T20:19:06Z

Just to clear, this was fixed between 10.2 and 11.1 and doesn't happen in newer versions?

evelkey · 2021-06-01T20:42:37Z

Yes, it's fixed in CUDA>=11.0.221 (tested with 11.1 too), but we needed to build custom PyTorch to fix some of the know issues similar to this: isl-org/Open3D#3324 and #1401. We solved the issue by building PyTorch with -Xcompiler=-fno-gnu-unique.

alliepiper · 2021-06-02T16:06:53Z

Sounds good. I'll close this since it's no longer an active issue.

evelkey mentioned this issue May 31, 2021

CUDA OOM leads to unhandled thrust::system exception NVIDIA/MinkowskiEngine#357

Closed

alliepiper closed this as completed Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel_for's throw_on_error results in terminate #1448

parallel_for's throw_on_error results in terminate #1448

evelkey commented May 31, 2021 •

edited

Loading

alliepiper commented Jun 1, 2021

evelkey commented Jun 1, 2021

alliepiper commented Jun 2, 2021

parallel_for's throw_on_error results in terminate #1448

parallel_for's throw_on_error results in terminate #1448

Comments

evelkey commented May 31, 2021 • edited Loading

alliepiper commented Jun 1, 2021

evelkey commented Jun 1, 2021

alliepiper commented Jun 2, 2021

evelkey commented May 31, 2021 •

edited

Loading