batch config issue? #17

samyam · 2020-02-05T19:49:41Z

There are a few things in configure train batch size that does not seem correct to me, and there are few things that we do not currently support.

The following assertion

train_batch_size == train_micro_batch_size_per_gpu * gradient_accumulation_step * world_size

should always hold but currently it does not in some cases.
For example, when train_micro_batch_size_per_gpu and gradient accumulation steps are None in the ds_cofig its initialized to train_batch_size and 1 respectively which leads to

train_batch_size == train_batch_size * 1 * world_size

if train_micro_batch_size_per_gpu > per_device_batch_size, we should throw a config error. Currently, its assigned to be equal to per_device_batch_size.
We do not currently support the user providing only the train_micro_batch_size or train_micro_batch_size and gradient _accumulation_steps.

samyam · 2020-02-07T21:11:08Z

Fixed with pull request Samyamr/batchconfig #33

…tivation_checkpointing Adding activation checkpointing as deepspeed file

IFU-master-2021-05-27

* initial commit * script fix

rafael-ariascalles · 2023-04-13T20:15:45Z

I am having this same issue using 0.9 but not 0.8 (using a was p4 machine)

Add light model

samyam added bug Something isn't working enhancement New feature or request invalid labels Feb 5, 2020

samyam assigned jeffra, samyam and tjruwase Feb 5, 2020

samyam linked a pull request Feb 7, 2020 that will close this issue

Samyamr/batchconfig #33

Merged

samyam closed this as completed Feb 7, 2020

jeffra pushed a commit to jeffra/DeepSpeed that referenced this issue May 15, 2020

Merge pull request deepspeedai#17 from microsoft/samyamr/deepspeed_ac…

925cfdc

…tivation_checkpointing Adding activation checkpointing as deepspeed file

gongwei-130 mentioned this issue Aug 7, 2020

'CUDA error: an illegal memory access was encountered' in forward #308

Open

TonyTangYu mentioned this issue Aug 21, 2020

Warning: NaN or Inf found in input tensor when running DeepSpeedExamples/BingBertSquad. #324

Open

GrvLeo mentioned this issue Oct 22, 2020

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Closed

rraminen pushed a commit to rraminen/DeepSpeed that referenced this issue Jun 4, 2021

Merge pull request deepspeedai#17 from rraminen/IFU_5_27

1850f88

IFU-master-2021-05-27

garvct mentioned this issue Jun 29, 2021

Bert training model failed when add --deepspeed_transformer_kernel #1155

Open

pengwa pushed a commit to pengwa/DeepSpeed that referenced this issue Oct 14, 2022

Curriculum learning support (deepspeedai#17)

db97cd2

* initial commit * script fix

lambda7xx mentioned this issue Feb 24, 2023

[BUG] Zero-Inference usage error with .init_inference() #2372

Closed

phalexo mentioned this issue Oct 11, 2023

[BUG] The code for deepspeed.comm.comm.monitored_barrier() #4488

Open

baodii pushed a commit to baodii/DeepSpeed that referenced this issue Oct 17, 2023

add aio in xpu_upstream (deepspeedai#17)

32ecdae

radna0 pushed a commit to radna0/DeepSpeed-XLA that referenced this issue Feb 5, 2025

Merge pull request deepspeedai#17 from ykdai:dev

23fcea7

Add light model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch config issue? #17

batch config issue? #17

samyam commented Feb 5, 2020

samyam commented Feb 7, 2020

rafael-ariascalles commented Apr 13, 2023

batch config issue? #17

batch config issue? #17

Comments

samyam commented Feb 5, 2020

samyam commented Feb 7, 2020

rafael-ariascalles commented Apr 13, 2023