Tensor parallelism (TP) is an important memory optimization for training large-scale deep learning models. Despite the popularity of training Hugging Face (HF) models, the model scaling options for HF trainer was previously limited to sharded data parallelism through ZeRO/FSDP. While ZeRO3 offers superior memory efficiency, it incurs significant communication costs. ZeRO (1/2) has lower communication overhead, but in the case of very large models, it cannot be used directly due to memory limitations. Therefore, combining TP with ZeRO (1/2) offers more balanced options for memory and performance. Moreover, through TP, we can alleviate the batch scaling limitations imposed by ZeRO/FSDP.
We are pleased to announce that DeepSpeed now provides native automatic tensor parallel training for Hugging Face (HF) transformers. This new feature builds on DeepSpeed's AutoTP mechanism, which was previously restricted to inference. AutoTP training can be combined with ZeRO to unlock unprecented efficiency benefits for HF model post-training, including:
1. Model scaling with lower communication costs than FSDP/ZeRO3 (e.g., use AutoTP + ZeRO1 to achieve ZeRO3 memory savings).
2. Batch size scaling for faster training and increased throughput.
3. Context length scaling to enable new application scenarios.
We have integrated AutoTP training with ZeRO1 & ZeRO2, with ZeRO3 integration on the way. AutoTP training is available in DeepSpeed versions >= 0.16.4
The following is a batch scaling experiment of Llama3 8B training conducted on Gaudi2 Accelerator.

Figure 1. Batch scaling experiment on Gaudi2, showing throughput performance improvements from 2 to 4 cards by combining AutoTP and ZeRO. The used mbs is the max possible value with the given config. A higher speedup indicates better performance.
Figure 2 illustrates the basic flowchart, The division of TP and ZeRO is implemented through the AutoTP parser and ZeRO Wrapper in Accelerate. Besides, The TP-based dataloader and save mechanism are both supported in DeepSpeed and Accelerate.
Although we evaluated AutoTP training with Llama2 & Llama3 models in this blog, we expect compatibility with other Hugging Face models, especially those previously validated with AutoTP inference.
Requirements
deepspeed >= 0.16.4
transformers >= 4.50.1
accelerate >= 1.6.0
Enable TP training
Similar to ZeRO, AutoTP training is enabled using the deepspeed configuration file by specifying [tensor_parallel][autotp_size]
.
"ZeRO_optimization": {
"stage": 1,
"gather_16bit_weights_on_model_save": true,
...
},
"tensor_parallel":{
"autotp_size": 4
},
The parallel configuration follows this logic:
tp_size = auto_tp_size
dp_size = num_gpus / tp_size
Note that the global_batch_size (gbs) changes with different TP settings:
gbs (only dp) = per_device_batch_size * n_gpus * gradient_accumulation_steps
gbs (dp with tp) = per_device_batch_size * n_gpus / tp_size * gradient_accumulation_steps
Save Model
Saving checkpoints and model files is fully compatible with HF transformers. The trainer.save_model() method saves the original model. Ensure gather_16bit_weights_on_model_save
is set to true
in the deepspeed configuration file.
"ZeRO_optimization": {
...
"gather_16bit_weights_on_model_save": true,
},
trainer.save_model(your_saved_path)
Models saved this way can be directly used for HF format inference without intermediate transformations.
Saving Checkpoints and Resuming
Saving Checkpoints remains compatible with HF transformers. Use trainer.save_state() or set the save interval for automatic saving, which can be used to resume training.
trainer.train(resume_from_checkpoint="your_saved_path/checkpoint-1200")
We validated AutoTP training using supervised finetune training (SFT) task: stanford_alpaca. The original benchmark model used in this project is Llama2-7B. The example code is also available here
Training Loss curve
The following loss curves depict SFT training, where gbs is uniformly set to 32, and other configurations match the default experiment settings from (stanford_alpaca). The loss curves are largely consistent across the following setups:
- ZeRO3
- TP + disable ZeRO
- ZeRO1 and ZeRO1 + AutoTP
- ZeRO2 and ZeRO2 + AutoTP
Resuming Training
We tested recovery training curves from step 1200 in AutoTP + ZeRO1 and AutoTP + ZeRO2, which align with the original training curves.
Model Evaluation
We conducted inference evaluations for the MMLU task. In MMLU, the scores for AutoTP + ZeRO1 and ZeRO1, as well as AutoTP + ZeRO2 and ZeRO2, are consistent, showing a fixed improvement over the pre-training model before SFT.
Groups | Version | Filter | n-shot | Metric | Model before SFT | ZeRO1 DP8 training | ZeRO1 TP4 DP2 training | ZeRO2 DP8 training | ZeRO2 TP4DP2 training |
---|---|---|---|---|---|---|---|---|---|
mmlu | 2 | none | acc | 0.4185 ± 0.0041 | 0.4472 ± 0.0041 | 0.4444 ± 0.0041 | 0.4543 ± 0.0041 | 0.4529 ± 0.0041 | |
- humanities | 2 | none | acc | 0.3979 ± 0.0069 | 0.4185 ± 0.0070 | 0.4145 ± 0.0069 | 0.4274 ± 0.0070 | 0.4272 ± 0.0070 | |
- other | 2 | none | acc | 0.4712 ± 0.0089 | 0.5249 ± 0.0087 | 0.5182 ± 0.0088 | 0.5282 ± 0.0087 | 0.5269 ± 0.0087 | |
- social sciences | 2 | none | acc | 0.4742 ± 0.0089 | 0.5070 ± 0.0089 | 0.5083 ± 0.0088 | 0.5151 ± 0.0088 | 0.5115 ± 0.0089 | |
- stem | 2 | none | acc | 0.3428 ± 0.0084 | 0.3549 ± 0.0084 | 0.3539 ± 0.0084 | 0.3622 ± 0.0084 | 0.3609 ± 0.0084 |
Table 1. MMLU score with Llama2-7B inference
If users define their own dataloader, please ensure data consistency within deepspeed.utils.groups.get_tensor_model_parallel_group()
. DeepSpeed provides basic validation functions to assist with this.
Furthermore, if users are not using transformers library, you can replace the TensorParallel_Layer
layer and its subclasses as needed. See prepare_tp_model
function in unit/model_parallelism/test_autotp_training.py
. Users can also define different shard and gather for subclasses of TensorParallel_Layer.
- Optimization: Communication/Activation optimization.
- Usability: Support Transformers TP plan, decouple AutoTP parser and more model testing,
Theoretically, features supported by ZeRO should also be supported, though extensive testing is pending.
Welcome bug reports, enhancement, and additional model training examples.
This work was made possible through a deep collaboration between Intel and Microsoft. The contributors include Mingzhi Liu, Guokai Ma, Kiefer Kuah, Yejing Lai, Kurt Chen, Yejun Guo, Guangxin Xu, Xiaofei Feng, and Yang Wang from Intel; Guanhua Wang and Olatunji Ruwase from Microsoft.