Serialize GPU tree building with GPU lock. #1335

porcuquine · 2020-10-31T05:15:18Z

Don't try to build more than one tree on the GPU at a time. There is no benefit to this, since tree building already uses the GPU's compute as efficiently as it's able, and it causes problems. The GPU tree builder was not designed to be used in this way.

This assumes only a single GPU is available for building trees, but that assumption was already present. We can overcome that limitation with multi-GPU scheduling, which is coming.

karalabe · 2020-10-31T11:50:30Z

I don't think you're right in that assumption. I've seen it ample times that my GPU is idle whilst running PC2. By allowing more than 1x PC2, I can fill that gap.

porcuquine · 2020-10-31T16:44:18Z

Okay, help me understand the nature of the idleness you see. It's possible that serializing at the tree level is too coarse. The underlying assumption of avoiding simultaneous hashing calls to the same GPU is sound.

For context, PC2 involves building two trees: the more expensive 'column tree', resulting in comm_c, and the cheaper simple tree, resulting in comm_r_last. In either case, there are thousands of calls to the GPU — each batching some number (configurable, but hundreds of thousands) of hashes.

I think the current code is interleaving writing results to disk. I'd like to understand if this is what you are observing, or if you are seeing larger, macro blocks of time with no activity on the GPU.

What I'm describing would look like very rapid alternation between high and low GPU utilization — so each process would have 'cracks' which another might fill. I'm guessing this is what you describe.

If you're seeing something else, we should try to understand that. For now, I will assume this is the issue. In that case, we can either make the lock granularity much finer — or we can move the disk-writing out of the hot path. I think the latter would be more economical, and it will still let us stick to one-tree-at-a-time on the GPU (which will make macro scheduling simpler).

We already read the input from disk in a separate thread, so a similar approach should be possible for the writing. Please let me know whether my descriptions match your observations, or if you are seeing something else. If you are seeing long periods of complete idleness during PC2, then that may be indicative of some other problem best addressed on its own.

porcuquine · 2020-10-31T19:29:53Z

A further detail: each 'tree' mentioned above is actually 8 trees (for 32GiB sectors) of 4GiB each. So we're really talking about 16 trees.

Looking at this a little more closely, I think the following is happening: for both trees, after the tree is built, all the disk-writing happens at once. It's plausible that this all-at-once disk-writing represents what I called a 'macro pause' above. And there might be on the order of 16 of them per PC2.

It would still be better to eliminate these 'write gaps' (breaks in hashing while output is written) by moving them to another thread. Long-term, it should be okay to schedule each of the 4GiB trees individually (so they can interleave, even across PC2 jobs). Short-term, it's still likely better to avoid creating many batch hashers simultaneously.

porcuquine · 2020-10-31T21:56:09Z

Latest commits here hold the GPU lock only during tree-building phase for each subtree, so should divide a 32GiB sector into 32 interleavable units. Each of these is still a 'tree' from the tree builders' perspectives, though.

storage-proofs/porep/src/stacked/vanilla/proof.rs

dignifiedquire

one nit and rebase needed, otherwise lgtm

porcuquine requested a review from dignifiedquire October 31, 2020 05:15

porcuquine mentioned this pull request Oct 31, 2020

Weird lockups in PC2 #1333

Closed

porcuquine marked this pull request as ready for review October 31, 2020 07:23

porcuquine requested a review from cryptonemo as a code owner October 31, 2020 07:23

porcuquine force-pushed the feat/serialize-tree-building branch 2 times, most recently from 70d068f to 6dc1d9b Compare October 31, 2020 21:05

dignifiedquire reviewed Nov 6, 2020

View reviewed changes

storage-proofs/porep/src/stacked/vanilla/proof.rs Show resolved Hide resolved

dignifiedquire previously approved these changes Nov 6, 2020

View reviewed changes

Serialize GPU tree building with GPU lock.

4c204fd

porcuquine dismissed dignifiedquire’s stale review via 4c204fd November 6, 2020 17:49

porcuquine force-pushed the feat/serialize-tree-building branch from e96e47a to 4c204fd Compare November 6, 2020 17:49

dignifiedquire approved these changes Nov 6, 2020

View reviewed changes

porcuquine merged commit 030a78a into master Nov 6, 2020

porcuquine deleted the feat/serialize-tree-building branch November 6, 2020 18:53

This was referenced Nov 15, 2020

Read trees from GPU within scope of building threads. #1367

Closed

Cannot get device list for platform: Clover! #1366

Closed

This was referenced Nov 23, 2020

PreCommit2 stunks on generate tree_r_last using GPU within docker filecoin-project/lotus#4951

Closed

GPU stuck at P2 filecoin-project/lotus#4982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize GPU tree building with GPU lock. #1335

Serialize GPU tree building with GPU lock. #1335

porcuquine commented Oct 31, 2020 •

edited

Loading

karalabe commented Oct 31, 2020

porcuquine commented Oct 31, 2020

porcuquine commented Oct 31, 2020

porcuquine commented Oct 31, 2020

dignifiedquire left a comment

Serialize GPU tree building with GPU lock. #1335

Serialize GPU tree building with GPU lock. #1335

Conversation

porcuquine commented Oct 31, 2020 • edited Loading

karalabe commented Oct 31, 2020

porcuquine commented Oct 31, 2020

porcuquine commented Oct 31, 2020

porcuquine commented Oct 31, 2020

dignifiedquire left a comment

Choose a reason for hiding this comment

porcuquine commented Oct 31, 2020 •

edited

Loading