Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Error early if we don't have enough disk space #154

Open
achalddave opened this issue Dec 14, 2023 · 0 comments
Open

Error early if we don't have enough disk space #154

achalddave opened this issue Dec 14, 2023 · 0 comments
Labels
good first issue Good for newcomers

Comments

@achalddave
Copy link
Collaborator

For medium-to-large models, if the user doesn't have enough disk space (or, more commonly, has accidentally specified a path on a volume with not enough disk space), we train for a full "epoch," and crash while saving the checkpoint. It would be nice to either:

Option 1: Save a dummy checkpoint at the very start, before training. If this succeeds, assume that future checkpoints will work if --delete-previous-checkpoint is specified. As an addition, we could check if there is num_checkpoints * size(initial checkpoint) disk space remaining if --delete-previous-checkpoint is not specified, but this is not necessary.

Option 2: Estimate the size of the checkpoint (based on number of parameters) and check if we have enough disk space based on number of checkpoints requested.

@achalddave achalddave added the good first issue Good for newcomers label Dec 19, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant