Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Hot fix preventing dead lock during shutdown #179

Merged
merged 1 commit into from
Jun 25, 2016

Conversation

drnikolaev
Copy link

No description provided.

@@ -37,7 +37,7 @@ void GPUMemory::Manager::init(const vector<int>& gpus, Mode m, bool debug) {
// Just in case someone installed 'no cleanup' arena before
delete cub_allocator_;
cub_allocator_ = new cub::CachingDeviceAllocator(BIN_GROWTH, MIN_BIN,
MAX_BIN, MAX_CACHED_BYTES, false, debug_);
MAX_BIN, MAX_CACHED_BYTES, true, debug_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'true' means 'skip cleanup'. I thought the deallocate() fix should was precicesy for us to keep not skipping cleanup?

Copy link

@borisfom borisfom Jun 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second note, since the only instance of mgr is global, having cleanup is useless and could only add latency and more opportunities for deadlocks. So the change is good, let's keep cleanup off.

Copy link
Author

@drnikolaev drnikolaev Jun 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the problem. We do clean what we can. The rest is cleaned by driver. ~Scope doesn't help either. All this is caused by different life times for objects working with different GPUs. Look here: https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/parallel.cpp#L316-L319 This loop runs from 1, not from 0. We treat GPU 0 differently. I don't claim it's wrong. But it makes memory management more interesting. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so not having cleanup in the allocator destructor is fine. And yes we'd still run into the timing issues with other objects being destructed on exit and trying to return GPU memory.
We can in fact take it a bit further and first time you get this 'cudart being unloaded' error, set some flag in mgr so that we don't even try to issue any cuda calls from that time on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess what? I tried that too and found that this is the only place where we need this check. I mean line 124 below. So, we are covered now.

@drnikolaev drnikolaev merged commit a7c9144 into NVIDIA:caffe-0.15 Jun 25, 2016
@drnikolaev drnikolaev deleted the caffe-0.15-multigpu-ws-fix branch September 22, 2016 04:37
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants