-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
batch config issue? #17
Labels
Comments
Merged
Fixed with pull request Samyamr/batchconfig #33 |
jeffra
pushed a commit
to jeffra/DeepSpeed
that referenced
this issue
May 15, 2020
…tivation_checkpointing Adding activation checkpointing as deepspeed file
rraminen
pushed a commit
to rraminen/DeepSpeed
that referenced
this issue
Jun 4, 2021
IFU-master-2021-05-27
pengwa
pushed a commit
to pengwa/DeepSpeed
that referenced
this issue
Oct 14, 2022
* initial commit * script fix
I am having this same issue using 0.9 but not 0.8 (using a was p4 machine) |
baodii
pushed a commit
to baodii/DeepSpeed
that referenced
this issue
Oct 17, 2023
radna0
pushed a commit
to radna0/DeepSpeed-XLA
that referenced
this issue
Feb 5, 2025
Add light model
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
There are a few things in configure train batch size that does not seem correct to me, and there are few things that we do not currently support.
train_batch_size == train_micro_batch_size_per_gpu * gradient_accumulation_step * world_size
should always hold but currently it does not in some cases.
For example, when train_micro_batch_size_per_gpu and gradient accumulation steps are None in the ds_cofig its initialized to train_batch_size and 1 respectively which leads to
train_batch_size == train_batch_size * 1 * world_size
if train_micro_batch_size_per_gpu > per_device_batch_size, we should throw a config error. Currently, its assigned to be equal to per_device_batch_size.
We do not currently support the user providing only the train_micro_batch_size or train_micro_batch_size and gradient _accumulation_steps.
The text was updated successfully, but these errors were encountered: