Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

deepspeed github repo migration #99

Merged
merged 2 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion compute/accelerator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -716,7 +716,7 @@ AMD GPUs run on [ROCm](https://www.amd.com/en/products/software/rocm.html) - not
The API is via [Habana SynapseAI® SDK](https://habana.ai/training-software/) which supports PyTorch and TensorFlow.

Useful integrations:
- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/microsoft/DeepSpeed) integration.
- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) integration.



Expand Down
2 changes: 1 addition & 1 deletion debug/pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -689,7 +689,7 @@ This was a simple low-dimensional example, but in reality the tensors are much b

Now you might say that the `1e-6` discrepancy can be safely ignored. And it's often so as long as this is a final result. If this tensor from the example above is now fed through a 100 layers of `matmul`s, this tiny discrepancy is going to compound and spread out to impact many other elements with the final outcome being quite different from the same action performed on another type of device.

For example, see this [discussion](https://github.com/microsoft/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.
For example, see this [discussion](https://github.com/deepspeedai/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.

It's uncommon that small variations make much of a difference, but sometimes the difference can be clearly seen, as in this example where the same image is produced on a CPU and an MPS device.

Expand Down
2 changes: 1 addition & 1 deletion inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -619,7 +619,7 @@ This section is trying hard to be neutral and not recommend any particular frame

### DeepSpeed-FastGen

[DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/microsoft/DeepSpeed).
[DeepSpeed-FastGen](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/deepspeedai/DeepSpeed).

### TensorRT-LLM

Expand Down
4 changes: 2 additions & 2 deletions network/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Notes:

You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.

Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.

Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.

Expand All @@ -124,7 +124,7 @@ To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed
2. 200-400 Gbps is ok
3. 800-1000 Gbps is ideal

[full details](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491)
[full details](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491)

Of course, the requirements are higher for A100 gpu nodes and even higher for H100s (but no such benchmark information has been shared yet).

Expand Down
2 changes: 1 addition & 1 deletion stabs/incoming.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ Make a new benchmark section:

1. nccl-tests
2. `all_reduce_bench.py`
3. https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
3. https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication
4. like nccl-tests, another common set of benchmarks used at HPC sites are the OSU microbenchmarks like osu_lat, osu_bw, and osu_bibw.

https://mvapich.cse.ohio-state.edu/benchmarks/
Expand Down
2 changes: 1 addition & 1 deletion training/fault-tolerance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ for batch in iterator:
train_step(batch)
```

footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/microsoft/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.
footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.

footnote: this simplified example would work for a single node. For multiple nodes you'd need to gather the stats from all participating nodes and find the one that has the least amount of memory left and act upon that.

Expand Down
14 changes: 7 additions & 7 deletions training/model-parallelism/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ PyTorch:
- [PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel](https://arxiv.org/abs/2304.11277)

Main DeepSpeed ZeRO Resources:
- [Project's github](https://github.com/microsoft/deepspeed)
- [Project's github](https://github.com/deepspeedai/DeepSpeed)
- [Usage docs](https://www.deepspeed.ai/getting-started/)
- [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
- [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
Expand Down Expand Up @@ -372,7 +372,7 @@ Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't
Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs.

Implementations:
- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Varuna](https://github.com/microsoft/varuna)
- [SageMaker](https://arxiv.org/abs/2111.05972)
Expand All @@ -393,7 +393,7 @@ This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter
Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs.

Implementations:
- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Varuna](https://github.com/microsoft/varuna)
- [SageMaker](https://arxiv.org/abs/2111.05972)
Expand Down Expand Up @@ -448,7 +448,7 @@ During compute each sequence chunk is projected onto QKV and then gathered to th

![deepspeed-ulysses sp](images/deepspeed-ulysses.png)

[source](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses)
[source](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses)

On the diagram:
1. Input sequences N are partitioned across P available devices.
Expand All @@ -468,7 +468,7 @@ Example: Let's consider seqlen=8K, num_heads=128 and a single node of num_gpus=8
b. the attention computation is done on the first 16 sub-heads
the same logic is performed on the remaining 7 GPUs, each computing 8k attention over its 16 sub-heads

You can read the specifics of the very efficient comms [here](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).
You can read the specifics of the very efficient comms [here](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).

DeepSpeed-Ulysses keeps communication volume consistent by increasing GPUs proportional to message size or sequence length.

Expand Down Expand Up @@ -496,7 +496,7 @@ Paper: [Ring Attention with Blockwise Transformers for Near-Infinite Context](ht

SP Implementations:
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Deepspeed](https://github.com/microsoft/DeepSpeed)
- [Deepspeed](https://github.com/deepspeedai/DeepSpeed)
- [Colossal-AI](https://colossalai.org/)
- [torchtitan](https://github.com/pytorch/torchtitan)

Expand Down Expand Up @@ -659,7 +659,7 @@ If the network were to be 5x faster, that is 212GBs (1700Gbps) then:

which would be insignificant comparatively to the compute time, especially if some of it is successfully overlapped with the commute.

Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:
Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:

1. With 100 Gbps IB, we only have <20 TFLOPs per GPU (bad)
2. With 200-400 Gbps IB, we achieve reasonable TFLOPs around 30-40 per GPU (ok)
Expand Down