stas00 · stas00 · Feb 3, 2025 · Feb 1, 2025 · Feb 1, 2025
diff --git a/compute/accelerator/README.md b/compute/accelerator/README.md
@@ -716,7 +716,7 @@ AMD GPUs run on [ROCm](https://www.amd.com/en/products/software/rocm.html) - not
 The API is via [Habana SynapseAI® SDK](https://habana.ai/training-software/) which supports PyTorch and TensorFlow.
 
 Useful integrations:
-- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/microsoft/DeepSpeed) integration.
+- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) integration.
 
 
 

diff --git a/debug/pytorch.md b/debug/pytorch.md
@@ -689,7 +689,7 @@ This was a simple low-dimensional example, but in reality the tensors are much b
 
 Now you might say that the `1e-6` discrepancy can be safely ignored. And it's often so as long as this is a final result. If this tensor from the example above is now fed through a 100 layers of `matmul`s, this tiny discrepancy is going to compound and spread out to impact many other elements with the final outcome being quite different from the same action performed on another type of device.
 
-For example, see this [discussion](https://github.com/microsoft/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.
+For example, see this [discussion](https://github.com/deepspeedai/DeepSpeed/issues/4932) - the users reported that when doing Llama-2-7b inference they were getting quite different logits depending on how the model was initialized. To clarify the initial discussion was about Deepspeed potentially being the problem, but in later comments you can see that it was reduced to just which device the model's buffers were initialized on. The trained weights aren't an issue they are loaded from the checkpoint, but the buffers are recreated from scratch when the model is loaded, so that's where the problem emerges.
 
 It's uncommon that small variations make much of a difference, but sometimes the difference can be clearly seen, as in this example where the same image is produced on a CPU and an MPS device.
 

diff --git a/inference/README.md b/inference/README.md
@@ -619,7 +619,7 @@ This section is trying hard to be neutral and not recommend any particular frame
 
 ### DeepSpeed-FastGen
 
-[DeepSpeed-FastGen](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/microsoft/DeepSpeed).
+[DeepSpeed-FastGen](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-fastgen) from [the DeepSpeed team](https://github.com/deepspeedai/DeepSpeed).
 
 ### TensorRT-LLM
 

diff --git a/network/benchmarks/README.md b/network/benchmarks/README.md
@@ -114,7 +114,7 @@ Notes:
 
 You may get results anywhere between 5Gbps and 1600Gbps (as of this writing). The minimal speed to prevent being network bound will depend on your particular training framework, but typically you'd want at least 400Gbps or higher. Though we trained BLOOM on 50Gbps.
 
-Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/microsoft/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
+Frameworks that shard weights and optim stages like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) w/ ZeRO Stage-3 do a lot more traffic than frameworks like [Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) which do tensor and pipeline parallelism in addition to data parallelism. The latter ones only send activations across and thus don't need as much bandwidth. But they are much more complicated to set up and run.
 
 Of course, an efficient framework will overlap communications and compute, so that while one stage is fetching data, the other stage in parallel runs computations. So as long as the communication overhead is smaller than compute the network requirements are satisfied and don't have to be super fantastic.
 
@@ -124,7 +124,7 @@ To get reasonable GPU throughput when training at scale (64+GPUs) with DeepSpeed
 2. 200-400 Gbps is ok
 3. 800-1000 Gbps is ideal
 
-[full details](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491)
+[full details](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491)
 
 Of course, the requirements are higher for A100 gpu nodes and even higher for H100s (but no such benchmark information has been shared yet).
 

diff --git a/stabs/incoming.md b/stabs/incoming.md
@@ -107,7 +107,7 @@ Make a new benchmark section:
 
 1. nccl-tests
 2. `all_reduce_bench.py`
-3. https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
+3. https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/communication
 4. like nccl-tests, another common set of benchmarks used at HPC sites are the OSU microbenchmarks like osu_lat, osu_bw, and osu_bibw.
 
 https://mvapich.cse.ohio-state.edu/benchmarks/

diff --git a/training/fault-tolerance/README.md b/training/fault-tolerance/README.md
@@ -309,7 +309,7 @@ for batch in iterator:
     train_step(batch)
 ```
 
-footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/microsoft/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.
+footnote: don't do this unless you really have to, since caching makes things faster. Ideally figure out the fragmentation issue instead. For example, look up `max_split_size_mb` in the doc for [`PYTORCH_CUDA_ALLOC_CONF`](https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) as it controls how memory is allocated. Some frameworks like [Deepspeed](https://github.com/deepspeedai/DeepSpeed) solve this by pre-allocating tensors at start time and then reuse them again and again preventing the issue of fragmentation altogether.
 
 footnote: this simplified example would work for a single node. For multiple nodes you'd need to gather the stats from all participating nodes and find the one that has the least amount of memory left and act upon that.
 

diff --git a/training/model-parallelism/README.md b/training/model-parallelism/README.md
@@ -145,7 +145,7 @@ PyTorch:
 - [PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel](https://arxiv.org/abs/2304.11277)
 
 Main DeepSpeed ZeRO Resources:
-- [Project's github](https://github.com/microsoft/deepspeed)
+- [Project's github](https://github.com/deepspeedai/DeepSpeed)
 - [Usage docs](https://www.deepspeed.ai/getting-started/)
 - [API docs](https://deepspeed.readthedocs.io/en/latest/index.html)
 - [Blog posts](https://www.microsoft.com/en-us/research/search/?q=deepspeed)
@@ -372,7 +372,7 @@ Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't
 Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs.
 
 Implementations:
-- [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)
 - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
 - [Varuna](https://github.com/microsoft/varuna)
 - [SageMaker](https://arxiv.org/abs/2111.05972)
@@ -393,7 +393,7 @@ This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter
 Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs.
 
 Implementations:
-- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
+- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP.
 - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
 - [Varuna](https://github.com/microsoft/varuna)
 - [SageMaker](https://arxiv.org/abs/2111.05972)
@@ -448,7 +448,7 @@ During compute each sequence chunk is projected onto QKV and then gathered to th
 
 ![deepspeed-ulysses sp](images/deepspeed-ulysses.png)
 
-[source](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses)
+[source](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses)
 
 On the diagram:
 1. Input sequences N are partitioned across P available devices.
@@ -468,7 +468,7 @@ Example: Let's consider seqlen=8K, num_heads=128 and a single node of num_gpus=8
    b. the attention computation is done on the first 16 sub-heads
 the same logic is performed on the remaining 7 GPUs, each computing 8k attention over its 16 sub-heads
 
-You can read the specifics of the very efficient comms [here](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).
+You can read the specifics of the very efficient comms [here](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-ulysses#significant-communication-volume-reduction).
 
 DeepSpeed-Ulysses keeps communication volume consistent by increasing GPUs proportional to message size or sequence length.
 
@@ -496,7 +496,7 @@ Paper: [Ring Attention with Blockwise Transformers for Near-Infinite Context](ht
 
 SP Implementations:
 - [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
-- [Deepspeed](https://github.com/microsoft/DeepSpeed)
+- [Deepspeed](https://github.com/deepspeedai/DeepSpeed)
 - [Colossal-AI](https://colossalai.org/)
 - [torchtitan](https://github.com/pytorch/torchtitan)
 
@@ -659,7 +659,7 @@ If the network were to be 5x faster, that is 212GBs (1700Gbps) then:
 
 which would be insignificant comparatively to the compute time, especially if some of it is successfully overlapped with the commute.
 
-Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/microsoft/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:
+Also the Deepspeed team empirically [benchmarked a 176B model](https://github.com/deepspeedai/DeepSpeed/issues/2928#issuecomment-1463041491) on 384 V100 GPUs (24 DGX-2 nodes) and found that:
 
 1. With 100 Gbps IB, we only have <20 TFLOPs per GPU (bad)
 2. With 200-400 Gbps IB, we achieve reasonable TFLOPs around 30-40 per GPU (ok)