Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RFC: Checkpoint Sharding Callback #458

Merged
merged 3 commits into from
Feb 13, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
update diagram and description
  • Loading branch information
BlaziusMaximus committed Jan 23, 2024
commit d06f486e5d2b4281d6aecdba70803936160a29ef
2 changes: 1 addition & 1 deletion rfcs/20231213-checkpoint-sharding-callback.md
Original file line number Diff line number Diff line change
@@ -47,7 +47,7 @@ For teams that previously had large and unwieldy shards, this results in a much

![A diagram showcasing the difference between sharding by device and sharding according to a max shard size.](./20231213-checkpoint-sharding-callback/callback-diagram.png "Checkpoint sharding callback diagram")

On the left is the checkpoint layout for the default `ShardByTaskPolicy`. In the first **Shard** file, we have tensor (slice)s **Alpha**, **Omega**, and **Gamma**. These tensors are placed in this shard since they were stored in the same device memory before saving. Therefore, Alpha, Omega, Gamma all belong to one device, while Delta, Kappa, and the other tensors in the second shard belong to a different device. With `max_shard_size`, these tensors would instead be split into different **Shards**, as shown on the right. Every shard on the right has a maximum size of 500MB, so some tensors are split among multiple shards. For example, Alpha is a 1GB tensor, so it is split into two 500MB shards. Additionally, tensors that belong to a particular device can be grouped with tensors that belong to a different device, as in the case of the shard on the right containing Gamma[1].
On the left is the checkpoint layout for the default `ShardByTaskPolicy`. In the first **Shard** file, we have tensor (slice)s **Alpha**, **Omega**, and **Gamma**. These tensors are placed in this shard since they were stored in the same device task before saving. Therefore, Alpha, Omega, Gamma all belong to one task, while Delta, Kappa, and the other tensors in the second shard belong to a different task. With `max_shard_size`, these tensors would instead be split into different **Shards**, as shown on the right. Every shard on the right has a maximum size of 500MB, so some tensors are split among multiple shards. For example, Alpha is a 1GB tensor, so it is split into two 500MB shards.

### Detailed Design

Binary file modified rfcs/20231213-checkpoint-sharding-callback/.DS_Store
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.