Hot fix preventing dead lock during shutdown #179

drnikolaev · 2016-06-24T21:27:37Z

No description provided.

borisfom · 2016-06-24T21:38:50Z

src/caffe/util/gpu_memory.cpp

@@ -37,7 +37,7 @@ void GPUMemory::Manager::init(const vector<int>& gpus, Mode m, bool debug) {
      // Just in case someone installed 'no cleanup' arena before
      delete cub_allocator_;
      cub_allocator_ = new cub::CachingDeviceAllocator(BIN_GROWTH, MIN_BIN,
-          MAX_BIN, MAX_CACHED_BYTES, false, debug_);
+          MAX_BIN, MAX_CACHED_BYTES, true, debug_);


'true' means 'skip cleanup'. I thought the deallocate() fix should was precicesy for us to keep not skipping cleanup?

On a second note, since the only instance of mgr is global, having cleanup is useless and could only add latency and more opportunities for deadlocks. So the change is good, let's keep cleanup off.

That was the problem. We do clean what we can. The rest is cleaned by driver. ~Scope doesn't help either. All this is caused by different life times for objects working with different GPUs. Look here: https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/parallel.cpp#L316-L319 This loop runs from 1, not from 0. We treat GPU 0 differently. I don't claim it's wrong. But it makes memory management more interesting. :)

Right, so not having cleanup in the allocator destructor is fine. And yes we'd still run into the timing issues with other objects being destructed on exit and trying to return GPU memory.
We can in fact take it a bit further and first time you get this 'cudart being unloaded' error, set some flag in mgr so that we don't even try to issue any cuda calls from that time on.

Guess what? I tried that too and found that this is the only place where we need this check. I mean line 124 below. So, we are covered now.

Hot fix preventing dead lock during shutdown

c068392

borisfom reviewed Jun 24, 2016
View reviewed changes

drnikolaev merged commit a7c9144 into NVIDIA:caffe-0.15 Jun 25, 2016

drnikolaev deleted the caffe-0.15-multigpu-ws-fix branch September 22, 2016 04:37

cliffwoolley mentioned this pull request Nov 20, 2016

ERROR: Check failed: error == cudaSuccess (33 vs. 0) invalid resource handle #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hot fix preventing dead lock during shutdown #179

Hot fix preventing dead lock during shutdown #179

drnikolaev commented Jun 24, 2016

borisfom Jun 24, 2016

borisfom Jun 24, 2016 •

edited

Loading

drnikolaev Jun 24, 2016 •

edited

Loading

borisfom Jun 25, 2016

drnikolaev Jun 25, 2016

Hot fix preventing dead lock during shutdown #179

Hot fix preventing dead lock during shutdown #179

Conversation

drnikolaev commented Jun 24, 2016

borisfom Jun 24, 2016

Choose a reason for hiding this comment

borisfom Jun 24, 2016 • edited Loading

Choose a reason for hiding this comment

drnikolaev Jun 24, 2016 • edited Loading

Choose a reason for hiding this comment

borisfom Jun 25, 2016

Choose a reason for hiding this comment

drnikolaev Jun 25, 2016

Choose a reason for hiding this comment

borisfom Jun 24, 2016 •

edited

Loading

drnikolaev Jun 24, 2016 •

edited

Loading