🎵 HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
Yushi Huang*, Zining Wang*, Ruihao Gong📧, Jing Liu, Xinjie Zhang, Jinyang Guo, Xianglong Liu, Jun Zhang📧
(* denotes equal contribution, 📧 denotes corresponding author.)
This is the official implementation of our paper HarmoniCa, a novel training-based framework that achieves a new state-of-the-art result in block-wise caching of diffusion transformers. It achieves over 40% latency reduction (i.e.,
(Left) Generation comparison on DiT-XL/2
-
May 3, 2025: 🔥 We release our Python code for DiT-XL/2 presented in our paper. Have a try!
-
May 1, 2025: 🌟 Our paper has been accepted by ICML 2025! 🎉 Cheers!
Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives— aligned predicted noise vs. high-quality images— between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across
After cloning the repository, you can follow these steps to complete the model's training and inference process.
With PyTorch (>2.0) installed, execute the following command to install the necessary packages and pre-trained models.
pip install accelerate diffusers timm torchvision wandb
python download.py
We'd like to provide the following example to train the model. More details about the training can be found in our paper.
export CUDA_VISIBLE_DEVICES=0,1,2,3
torchrun --nnodes=1 --nproc_per_node=4 --master_port 12345 train_router.py --results-dir results --model DiT-XL/2 --image-size 256 --num-classes 1000 --epochs 2000 --global-batch-size 64 --global-seed 42 --vae ema --num-works 8 --log-every 100 --ckpt-every 1000 --wandb --num-sampling-steps 10 --l1 7e-8 --lr 0.01 --max-steps 20000 --cfg-scale 1.5 --ste-threshold 0.1 --lambda-c 500
Here is the corresponding command for inference.
python sample.py --model DiT-XL/2 --vae ema --image-size 256 --num-classes 1000 --cfg-scale 4 --num-sampling-steps 10 --seed 42 --accelerate-method dynamiclayer --ddim-sample --path Path/To/The/Trained/Router/ --thres 0.1
- Training and inference code for PixArt models.
Our code was developed based on DiT and Learning-to-Cache.
If you find our HarmoniCa useful or relevant to your research, please kindly cite our paper:
@inproceedings{
anonymous2025harmonica,
title={HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration},
author={Yushi Huang and Zining Wang and Ruihao Gong and Jing Liu and Xinjie Zhang and Jinyang Guo and Xianglong Liu and Jun Zhang},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
}