Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Serialize GPU tree building with GPU lock. #1335

Merged
merged 1 commit into from
Nov 6, 2020

Conversation

porcuquine
Copy link
Collaborator

@porcuquine porcuquine commented Oct 31, 2020

Don't try to build more than one tree on the GPU at a time. There is no benefit to this, since tree building already uses the GPU's compute as efficiently as it's able, and it causes problems. The GPU tree builder was not designed to be used in this way.

This assumes only a single GPU is available for building trees, but that assumption was already present. We can overcome that limitation with multi-GPU scheduling, which is coming.

@porcuquine porcuquine mentioned this pull request Oct 31, 2020
@porcuquine porcuquine marked this pull request as ready for review October 31, 2020 07:23
@karalabe
Copy link

I don't think you're right in that assumption. I've seen it ample times that my GPU is idle whilst running PC2. By allowing more than 1x PC2, I can fill that gap.

@porcuquine
Copy link
Collaborator Author

Okay, help me understand the nature of the idleness you see. It's possible that serializing at the tree level is too coarse. The underlying assumption of avoiding simultaneous hashing calls to the same GPU is sound.

For context, PC2 involves building two trees: the more expensive 'column tree', resulting in comm_c, and the cheaper simple tree, resulting in comm_r_last. In either case, there are thousands of calls to the GPU — each batching some number (configurable, but hundreds of thousands) of hashes.

I think the current code is interleaving writing results to disk. I'd like to understand if this is what you are observing, or if you are seeing larger, macro blocks of time with no activity on the GPU.

What I'm describing would look like very rapid alternation between high and low GPU utilization — so each process would have 'cracks' which another might fill. I'm guessing this is what you describe.

If you're seeing something else, we should try to understand that. For now, I will assume this is the issue. In that case, we can either make the lock granularity much finer — or we can move the disk-writing out of the hot path. I think the latter would be more economical, and it will still let us stick to one-tree-at-a-time on the GPU (which will make macro scheduling simpler).

We already read the input from disk in a separate thread, so a similar approach should be possible for the writing. Please let me know whether my descriptions match your observations, or if you are seeing something else. If you are seeing long periods of complete idleness during PC2, then that may be indicative of some other problem best addressed on its own.

@porcuquine
Copy link
Collaborator Author

A further detail: each 'tree' mentioned above is actually 8 trees (for 32GiB sectors) of 4GiB each. So we're really talking about 16 trees.

Looking at this a little more closely, I think the following is happening: for both trees, after the tree is built, all the disk-writing happens at once. It's plausible that this all-at-once disk-writing represents what I called a 'macro pause' above. And there might be on the order of 16 of them per PC2.

It would still be better to eliminate these 'write gaps' (breaks in hashing while output is written) by moving them to another thread. Long-term, it should be okay to schedule each of the 4GiB trees individually (so they can interleave, even across PC2 jobs). Short-term, it's still likely better to avoid creating many batch hashers simultaneously.

@porcuquine porcuquine force-pushed the feat/serialize-tree-building branch 2 times, most recently from 70d068f to 6dc1d9b Compare October 31, 2020 21:05
@porcuquine
Copy link
Collaborator Author

Latest commits here hold the GPU lock only during tree-building phase for each subtree, so should divide a 32GiB sector into 32 interleavable units. Each of these is still a 'tree' from the tree builders' perspectives, though.

dignifiedquire
dignifiedquire previously approved these changes Nov 6, 2020
Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit and rebase needed, otherwise lgtm

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants