This summary will including three parts:
- some repositories that you can follow
- some representative person or labs that you can follow
- some important works in the different research interests
For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.
Besides, awesome-AI-system works also very well. And you can find other repositories in its content.
The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.
This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.
Follow others' research, and find yourself's idea.
It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people.
If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.
Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many important work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT
Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.
SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh
IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU
Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al
Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU
James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST
I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.
Perhaps someone should write a detailed survey.
Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.
- ⭐ Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models: evaluations helps you find the bottleneck
- ⭐ Full Stack Optimization of Transformer Inference: a Survey: a survey by UCB
- ⭐ Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: worth a read
- ⭐ Deep Learning Workload Scheduling in GPU Datacenters: A Survey: survey for GPU Datacenters DL Workload Scheduling
- ⭐ Towards Efficient and Reliable LLM Serving: A Real-World Workload Study: a benchmark for LLM serving
- ⭐ LLM Inference Unveiled: Survey and Roofline Model Insights: both survey and analysis
- A SURVEY OF RESOURCE-EFFICIENT LLM AND MULTIMODAL FOUNDATION MODELS: worth reading
- Training and Serving System of Foundation Models: A Comprehensive Survey
- Model Compression and Efficient Inference for Large Language Models: A Survey
- ⭐ Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
- ⭐ A Survey on Efficient Inference for Large Language Models: worth reading
- Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
- ⭐ Navigating Challenges and Technical Debt in Large Language Models Deployment: important
- The CAP Principle for LLM Serving: anothor angle
- Demystifying Data Management for Large Language Models: talking about database in LLM, by Xupeng MIAO, accepted by SIDMOD'24
- Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI: with code
- A Survey on Mixture of Experts
- Analyzing LLM performance: The impact of high-bandwidth memory on model inference: analyze of inference
- Inference Optimization of Foundation Models on AI Accelerators
- LLM Inference Serving: Survey of Recent Advances and Opportunities: newest
- A Survey on Mixture of Experts
- LLM Inference Serving: Survey of Recent Advances and Opportunities: better than nothing
- Contemporary Model Compression on Large Language Models Inference: survey in model compression
- ⭐ Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning: bring insights for MLSys
- Resource-efficient Algorithms and Systems of Foundation Models: A Survey
- ⭐ A Survey on Inference Optimization Techniques for Mixture of Experts Models: asurvey on MoE models
- Deploying Foundation Model Powered Agent Services: A Survey: survey for AI agent service
Make useful benchmark or evaluation is helfpul.
-
MLPerf Inference Benchmark: inference github, a well-known benchmark
-
llmperf: evaluate both performance and correctness, but based on ray
-
The Importance of Workload Choice in Evaluating LLM Inference Systems: important angles in LLM inference systems
-
Vidur: A Large-Scale Simulation Framework For LLM Inference: test the performance of LLM inference
-
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems: an evaluation framework
-
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale: a Simulator
-
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators: inference + hardware
-
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference: a performance evaluation framework, can be used to estimate the time cost
-
Predicting LLM Inference Latency: A Roofline-Driven ML Method: predict inference performance based on Roofline
-
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments: a work for predict LLMSys performance
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf
prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding
Both frameworks use parallel decoding, and deserve a more detailed research.
There are some interesting papers about parallel decoding.
- Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
- APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding: how to make it auto-parallel?
In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training
- ⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
- ⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
- Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught
- Fast Best-of-N Decoding via Speculative Rejection: optimize alignment in inference, accepted by NIPS'24
This topic is about GPT-o1, aka the strawberry.
- ⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
- ⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
- Large Language Model Guided Tree-of-Thought: also a ToT
- Let's Verify Step by Step: verify by step can be helpful
- Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
- Critique-out-Loud Reward Models
- Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind
Also named as Speculative Sampling, model collaboration.
- ⭐ Accelerating Large Language Model Decoding with Speculative Sampling: opening of Speculative Decoding, by DeepMind
- ⭐ Fast inference from transformers via speculative decoding: work of similar period with the upper one, by Google, accepted by ICML'23
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification: paper under guidance of Zhihao JIA, use Tree decoding and a set of draft models
- LLMCad: Fast and Scalable On-device Large Language Model Inference: paper under guidance of Xin JIN, speculative decoding for on-device LLM inference based on tree decoding and other optimizations
- Speculative Decoding with Big Little Decoder: similar to speculative decoding, accepted in NIPS'23
- Online Speculative Decoding: update draft model online
- Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding: the trade-off analyse deserves a reading
- The Synergy of Speculative Decoding and Batching in Serving Large Language Models: analyse for combining the spec decoding with batching
- REST: Retrieval-Based Speculative Decoding: use retrieval for spec decoding, some familiar names in the authors list
- Cascade Speculative Drafting for Even Faster LLM Inference: by UIUC
- Multi-Candidate Speculative Decoding: multiple draft models
- ⭐ Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding: survey for Speculative Decoding
- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding: a work with Yang YOU's name
- Decoding Speculative Decoding: provide some insight into the selection of draft models
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting: perhaps tree specualtive decoding?
- ⭐ Speculative Streaming: Fast LLM Inference without Auxiliary Models: a promising method for speculative decoding
- Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding: accelerating spec decoding
- Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens: accelerate spec decoding with Fusing all tokens
- Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding: using several SSMs, adaptive SSM prediction length, pipelining SSM decode and LLM verify
- Recurrent Drafter for Fast Speculative Decoding in Large Language Models
- Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
- Accelerating LLM Inference with Staged Speculative Decoding: token tree and a second stage of speculative decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding: combine KV cache with spec decoding
- EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models: algorithm optimization in spec decoding
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices: any difference with specinfer?
- Optimizing Speculative Decoding for Serving Large Language Models Using Goodput: model the speculative decoding length
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding: spec decoding for long-context
- QSpec: Speculative Decoding with Complementary Quantization Schemes: spec decoding with quantization, a novel A+B
- Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement: optimization ob Medusa
- The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation: use learning-free, negligible-cost draft strategies, namely N-grams obtained from the model weights and the context
- EdgeLLM: Fast On-device LLM Inference with Speculative Decoding: seem a extended work of LLMCad
- Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM
- Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought
Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.
- ⭐ Efficiently Scaling Transformer Inference: use model parallel to accelerating inference, by Google, in MLSys'23
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment: a distributed inference engine that supports asymmetric partitioning of the inference computation
- InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding: Efficient Long-sequence training
- Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference: accepted by PPoPP'24
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs: full-stack approach of LLM training
- DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers: sequence parallel by Yang YOU
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism: Elastic Sequence Parallelism?
- GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism: this could be potential in inference
- TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models: pipeline parallism
- QUART: Latency-Aware FaaS System for Pipelining Large Model Inference: pipeline in serving and fast expanding
- Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations: optimize sequence parallel
- CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts: optimize sequence parallel
- ⭐ PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation: pipeline parallelism and speculation, accepted by SC'24
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models: overlap comm with comp, similar to Liger
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning: accepted by ASPLOS'24
- T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives: many work about overlap in LLM, accepted by ASPLOS'24
- FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: Fine-grained decomposition, perhaps provide some experiment result
- Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference: modify the model design for fast decoding, based on comm-comp overlapping
- NanoFlow: Towards Optimal Large Language Model Serving Throughput: overlaping based on nano-batch, with some interesting engineer implemntation
- Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping: overlapping, provided by Deepspeed team
An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.
-
⭐ Accelerating Sparse Deep Neural Networks: use N:M sparsity to fully utilize the hardware for accelerating, by Nvidia
-
⭐ Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time: interesting paper in using sparsity, under guidence of Tri DAO and Ce ZHANG, accepted in ICML'23
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
-
Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism: accepted by PPoPP'23
-
⭐ PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation: A novel way to deal with dynamic sparsity may be used for GNN and MoE, accepted by SOSP'23
-
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving: seem a follow-up work of Deja Vu, also focus on KV-Cache
-
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inferenc: sparsity in FFN
-
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models: a simple and effective sparsification method named "ProSparse"
-
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters: work for powerinfo
-
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations: pruning for LLM
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention: inference framework based on sparse attention, by Microsoft
-
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models: use ReLU to imporve Sparsity, just like powerinfer
-
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation: algorithm optimization that can utilize sparsity to accelerate inference
-
Star Attention: Efficient LLM Inference over Long Sequences: a two-phase block-sparse approximation
-
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries: use Sparse Coding over Universal Dictionaries to compress KV cache, it's novelty
Low-precision for memory and computing efficiency.
- Understanding and Overcoming the Challenges of Efficient Transformer Quantization
- ⭐ LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: by UW
- ⭐ SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models: paper under guidance of Song HAN
- ⭐ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration: paper under guidance of Song HAN
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving: paper under guidance of Tianqi CHEN, quantization is not important, designing how to quantify is important, in review of MLSys'24
- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
- Understanding the Impact of Post-Training Quantization on Large Language Models: tech report will help
- ⭐ LLM-FP4: 4-Bit Floating-Point Quantized Transformers: by HKUST, accepted in EMNLP'23
- ⭐ Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization: by SJTU, accepted in DAC'24
- INT4 Wight + FP8 KV-Cache: optimization for LLM inference: INT4 Wight + FP8 KV-Cache + Continues batching
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization: quant KV cache
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference: simple and crude optimization work
- LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization: for Heterogeneous Clusters and Adaptive Quantization, under guidence of Chuan WU, accepted by PPoPP'24(poster)
- IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact: use pivot token
- QAQ: Quality Adaptive Quantization for LLM KV Cache
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving: quantization in inference, under guidence of Song HAN
- Does compressing activations help model parallel training?: analyse in compressing(including pruning and quantization) in MP training, accepted by MLSys'24
- Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression: compress KV cache with quantization
- Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs: with targeted activate function
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design: FPx quantization, accepted by ATC'24
- Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: combine quantization with MoE
- Does Compressing Activations Help Model Parallel Training?: quantization Activation?
- PQCache: Product Quantization-based KVCache for Long Context LLM Inference: apply quantization and Maximum Inner-Product Search for KV Cache compression
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs: provide efficient kernels for lookup quantization
- Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation: a computation optimization for Low-Precision
- Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs: a computation optimization for 6-bit LLM
- Mixture of Experts with Mixture of Precisions for Tuning Quality of Service: quantization on MoE models
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference: compress the KV Cache
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models: quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents
- Progressive Mixed-Precision Decoding for Efficient LLM Inference: gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers
- COMET: Towards Partical W4A4KV4 LLMs Serving: provide quantization algorithm, quantization kernel and SM schedule method
- MixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction: quantization with outliers, optimization on AWQ, accepted by SC'24
- Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference: low-bit compression to accelerate communication
- Unifying KV Cache Compression for Large Language Models with LeanKV: combine quantization and sparity to compress KV cache
- MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design: mix quantization, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption
Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.
Update2023/12/12: I'd like to use Continues Batching
to take place of the Dynamic Batching
I used before. The name Dynamic Batching
is more likely to be used in Triton.
- ⭐ Orca: A Distributed Serving System for Transformer-Based Generative Models: Continues batch processing without redundant computing, accepted in OSDI'23
- Fast Distributed Inference Serving for Large Language Models: considering Job Completion Time(JCT) in LLM serving, paper under guidance of Xin JIN
- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline: schedule based on response length prediction by LLM, paper under guidance of Yang YOU
- S3: Increasing GPU Utilization during Generative Inference for Higher Throughput: idea similar to above, by Harvard University
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills: blocking the prefill phase and reduce pipeline bubbles, by MSRIndia
- Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference: accepted by HiPC'23
- Handling heavy-tailed input of transformer inference on GPUs: accepted by ICS'22
- CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system: Some form of inference service
- TCB: Accelerating Transformer Inference Services with Request Concatenation: perhaps similar to ByteTransformer, accepted by ICPP'22
- Fairness in Serving Large Language Models: under guidence of Ion Stoica, accepted by OSDI'24
- Characterizing and understanding deep neural network batching systems on GPUs: benchmarking is important
- Hydragen: High-Throughput LLM Inference with Shared Prefixes
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts: think about the memory access of KV cache
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve: follow-up work of sarathi
- Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction: predict length
- LiveMind: Low-latency Large Language Models with Simultaneous Inference: perform inferences with incomplete prompts, to take advantage of streaming prompt
- A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length: theoretical analysis of latency
- ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
- Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models: seems similar to ORCA or bytetransformer?
- BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching: optimization on ORCA, dynamic re-batching
- EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving: A fusion monster with a variety of optimization techniques
- ⭐ AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality: what's Redundancy
This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.
- ⭐ FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: one of the most important work these years, both simple and easy to use, by Tri DAO
- ⭐ FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning: you'd better not ignore it
- ⭐ Flash-Decoding for long-context inference: you'd better not ignore it, too
- ⭐ Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: successor to FlashAttention in inference, accepted by VLDB'24
- ⭐ FlashDecoding++: Faster Large Language Model Inference on GPUs: worth reading, FLashDecoding follow-up
- SubGen: Token Generation in Sublinear Time and Memory
- DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference
- Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers: modification in self-attention
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- Flex Attention: A Programming Model for Generating Optimized Attention Kernels: auto-generated attention kernel
- Splitwise: Efficient generative LLM inference using phase splitting: splitting prefill and decode in a map-reduce style, by UW and Microsoft
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: also split the prefill and decode, accepted by OSDI'24
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads: seems a combination of SARATHI and Splitwise
- ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference: similar to splitwise, accepted by ASPLOS'24
- Splitwiser: Efficient LLM Inference with Constrained Resources
- ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit: Token-adaptive Early Exit
- Automatic Task Parallelization of Dataflow Graphs in ML/DL models
- MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures: compilation optimization on compuataion graph
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference: optimize attention kernel in mix-batching
This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.
- ⭐ Efficient Memory Management for Large Language Model Serving with PagedAttention: memory page management for the KV-Cache in Attention-type model, accepted by SOSP'23 (many papers will cite the vLLM project instead of their paper, which makes it harder for us to find its citated by)
- ⭐ AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs: cache management for inference, accepted by MLSys'23
- Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs: block-based data layout, accepted by TACO'October-2023
- AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems: a unique observation that there is rich similarity in attention computation across inference sequences
- BPIPE: memory-balanced pipeline parallelism for training large language models: memory balance perhaps can work well in inferencce, by SNU, accepted by ICML'23
- Improving Large Language Model Throughput with Efficient LongTerm Memory Management: perhaps a new view
- CacheGen: Fast Context Loading for Language Model Applications
- Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models: consider the memory consumption in fine-tuning
- Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference: compress KV Cache
- LLM as a System Service on Mobile Devices: LLM as a service on Mobile devices
- DistMind: Efficient Resource Disaggregation for Deep Learning Workloads: by Xin JIN, accepted by ToN'Jan24
- ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching: sparsity in KV Cache, accepted by ISCA'24
- AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving: a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention: improve PagedAttention
- Layer-Condensed KV Cache for Efficient Inference of Large Language Models: only computes and caches the KVs of a small number of layers
- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models: compress KV cache
- CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion: very popular idea recently
- Block Transformer: Global-to-Local Language Modeling for Fast Inference: build KV Cache block from many tokens' KV Cache
- MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool: KV Cache management in P/D disaggregation arch
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention: multi-round chat and memory management, accepted by ATC'24
- Stateful Large Language Model Serving with Pensieve: similar to cachedattention
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving: P/D disaggregation archtecture and KV Cache management
- P/D-Serve: Serving Disaggregated Large Language Model at Scale: a P/D based system, with D2D access optimization
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management: offload KV Cache
- Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption: a survey for optimizing KV Cache
- vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving: tensor management especially for llm inference
- Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation: remove unimportant tokens in KV Cache
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving: compression and streaming transfering of KV Cache, accepted by SIGCOMM'24
- Compute Or Load KV Cache? Why Not Both?: recompute and load together for long context
- LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management: manage KV Cache by layers
- Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching: compress KV cache and multi-level memory
- EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models: better prefix-cache
- ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference: Low-rank KV cache and dynamic rebuild KV cache
- ⭐ VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration: the first work I see that optimize KV cache in vision models
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction: KV cache page evict and recall, accepted by NIPS'24
- SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation: Optimization on Zero? redesign the data flow of heterogeneous hardware and sharded model training to minimize the excessive communication overhead, accepted by NIPS'24
- ⭐ KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management: memory management for KV cache and parameter, seems a novel work considering the weights migration
- SYMPHONY: Improving Memory Management for LLM Inference Workloads: dynamically migrates K,V caches to enable finegrained scheduling of inference requests
- Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
- Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
- SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
- Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU
- Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels: for Ascend GPU (perhaps also work for NVIDIA?)
Heterogeneous scenarios or single PC are becoming increasingly important.
Making optimization for the calculating on CPU or SSD will have different methods.
-
Efficient LLM Inference on CPUs: LLMs with quantization on CPUs, by Intel, accepted by NIPS'23
-
Inference Performance Optimization for Large Language Models on CPUs: xFasterTransformer, LLMs inference optimization on CPUs, by Intel
-
Distributed Inference Performance Optimization for LLMs on CPUs: similar work to above, by Intel
-
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference: inference on CPU based on advanced hardware
-
TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload: free to run operations such as GPU kernel calls in many different orders
-
Improving Throughput-oriented Generative Inference with CPUs: cooperate of CPUs and GPU, accepted by APSys'23
-
Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs: execute the operators on the CPU and GPU in parallel, by SJTU
-
EdgeNN: Efficient Neural Network Inference for CPU-GPU Integrated Edge Devices: inference on edge devices, accepted by ICDE'23
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: by SJTU IPADS
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory: by Apple
-
Efficient LLM inference solution on Intel GPU: intel GPU is interesting
-
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines: efficient serving with CPU-GPU system
-
Efficient and Economic Large Language Model Inference with Attention Offloading: similar to FastDecode
-
Petals: Collaborative Inference and Fine-tuning of Large Models: looks like heterogeneous resources are being utilized
-
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
-
⭐ A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors: use CPU for DL, accepted by ASPLOS'24
-
LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control: based on offload
-
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge: computation on CPU with quantization
-
TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading: how to use SSD?
-
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference: offload KV Cache to CSD(Computational Storage Drive)
-
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference: some idea in using CPU
-
Improving Throughput-oriented LLM Inference with CPU Computations: pipeline in CPU-GPU inference
-
Understanding Performance Implications of LLM Inference on CPUs: analyse of using CPU for inference
-
Pie: Pooling CPU Memory for LLM Inference: use CPU memory to enlarge batchsize to improve throughput, by Ion Stoica
-
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference: offload KV cache and attention to CPU for larger batchsize, similar to fastdecode, by Ion Stoica
-
Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems: more likely inference on personal device
-
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation: use recomputation and transfer to re-produce KV cache; can use their run-time and split parallelism
Inspired by AI PC, open up a new area.
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU: inference a 30B model with a 16GB GPU, accepted by ICML'23
- LLM as a System Service on Mobile Devices: an intro for LLM on private devices
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: based on sparsity in NN Layers
- ⭐ LLM for Mobile: An Initial Roadmap: a road map
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone: work on smartphone
- Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM: on edge devices, accepted by MICRO'24
-
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
-
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
-
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
-
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
-
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: algorithm analyse for Heterogeneous GPUs
-
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving
In this part, researchers provide some algorithm-based method to optimizing LLM inference.
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models: accepted by NIPS'23
- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time: consider the different importance of tokens in KV Cache, similar to H2O
- ⭐ SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: skipping maybe an useful method like spec decoding
- Inference with Reference: Lossless Acceleration of Large Language Models: also a potential optimization
- Efficient Streaming Language Models with Attention Sinks: streaming LLM for infinite sequence lengths, by MIT and under guidence of Song HAN
- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference: also important tokens, just like H2O, accepted by MLSys'24
- Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache: an optimization to H2O, accepted by MLSys'24
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval: use approximate nearest neighbor search to search the most relevant KV cache
- CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs: based on observation: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache
- TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention: sparse attention
- SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation: algorithm optimization for less KV Cache
- Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPU: use characterization results to optimize KV Cache management
- ⭐ DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: you must know DeepSpeed
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- DeepSpeed Model Implementations for Inference (MII)
- ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs: developed by ByteDance, accepted by IPDPS'23
- TurboTransformers: an efficient GPU serving system for transformer models: by Tencent Inc, accepted by PPoPP'21
- Accelerating Generative AI with PyTorch II: GPT, Fast: a blog in PyTorch, use only PyTorch code, gpt-fast
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving: based on FlexFlow
- FlashInfer: Kernel Library for LLM Serving
- Efficiently Programming Large Language Models using SGLang: we can get some optimization from here
- Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models: different parallel, by Tencent
LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.
-
⭐ AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: accepted by OSDI'23
-
⭐ STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining: Elastic will be important in the future, accepted by ASPLOS'23
-
INFaaS: Automated Model-less Inference Serving: accepted by ATC'21
-
Tabi: An Efficient Multi-Level Inference System for Large Language Models: under guidence of Kai CHEN, accepted by EuroSys'23
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance: cost is the service provider cares most
-
FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
-
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning: accepted by NSDI'23
-
Cocktail: A Multidimensional Optimization for Model Serving in Cloud: model ensembling, accepted by NSDI'22
-
SLA-Driven ML INFERENCE FRAMEWORK FOR CLOUDS WITH HETEROGENEOUS ACCELERATORS: accepted by MLSys'22
-
FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference: accepted by ICPP'23
-
Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving
-
BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching: accepted by SC'20
-
MArk: exploiting cloud services for cost-effective, SLO-aware machine learning inference serving: accepted by ATC'19
-
⭐ MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters: challenges and solutions in real-world scenarios, accepted by NSDI'22
-
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads: under the guidence of Ion Stoica
-
Learned Best-Effort LLM Serving: a best-effort serving system of UCB
-
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences: accepted by OSDI'22, enables microsecond-scale kernel preemption and controlled concurrent execution in GPU scheduling
-
PipeSwitch: fast pipelined context switching for deep learning applications: PipeSwitch, a system that enables unused cycles of an inference application to be filled by training or other inference applications, accepted by OSDI'20
-
⭐ Paella: Low-latency Model Serving with Software-defined GPU Scheduling: how the tasks are scheduled to GPUs, accepted by SOSP'23
-
OTAS: An Elastic Transformer Serving System via Token Adaptation: elastic in serving while considering SLO
-
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression: Multi-Tenant is interesting
-
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models: find different problems in serving LLMs
-
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access: accepted by EuroSys'23
-
Towards Pareto Optimal Throughput in Small Language Model Serving: Small Language Model Serving
-
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
-
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services: idea of QoE
-
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
-
Deferred Continuous Batching in Resource-Efficient Large Language Model Serving: similar to FlexLLM
-
LLMServingSim: A Simulation Infrastructure for LLM Inference Serving Systems: provide some features about LLM serving
-
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving: Improvements to ORCA(SLS) and FastServe(ILS)
-
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems: consider serving efficiency from energy view
-
Power-aware Deep Learning Model Serving with μ-Serve: consider energy
-
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming: a new token transmission scheme, useful in chatbot
-
Responsive ML inference in multi-tenanted environments using AQUA: serving several LLMs based on time-sharing GPUs cycles, in multi-tenanted environments
-
Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning: effect of hyper-parameters in inference engine
-
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: request schedule
-
Efficient LLM Scheduling by Learning to Rank: rank request based on output length predict and schedule
-
Responsive ML inference in multi-tenanted environments using AQUA: offload context to other GPUs in multi-tenant environment
-
UELLM: A Unified and Efficient Approach for LLM Inference Serving: serving optimization in MaaS clouds
-
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving: shcduling the requests
-
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving: harvest stranded GPU resources for offline LLM inference tasks
-
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services: accepted by SC'24
-
Revisiting SLO and Goodput Metrics in LLM Serving: check metrics SLO and Goodput in LLM serving
-
Hops: Fine-grained heterogeneous sensing, efficient and fair Deep Learning cluster scheduling system: schedule tasks in multi-tenant deep learning (DL) cluster, accepted by SoCC'24
-
⭐ Ensuring Fair LLM Serving Amid Diverse Applications: ensures fair LLM access across diverse applications, with a copilot trace analysis
-
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching: exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing
-
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching: similar to blendserve
- ⭐ A System for Microserving of LLMs: seems a idea and industrial practice that makes sense
- Enabling Elastic Model Serving with MultiWorld: optimizing collective communication lib for LLM inference
- Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks
- AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive: communicating strategy based on runtime, ICDCS'24
- Crux: GPU-Efficient Communication Scheduling for Deep Learning Training: a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs, SIGCOMM'24
- TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections: by Luo MAI, similar to SpotServe?
- SpotServe: Serving Generative Large Language Models on Preemptible Instances: by Xupeng MIAO and under guidence of Zhihao JIA
- Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances: by team of SpotServe
- FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms: a FaaS-oriented Performance-aware Elastic Scaling system to enable efficient resource allocation in serverless platforms for ML jobs, accepted by SoCC'24
- Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows: scheduler for latency-sensitive request
- Llumnix: Dynamic Scheduling for Large Language Model Serving: scheduling in multi instances may by helpful for me now
- Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths: solve Dynamic Input Lengths by multi-instance and request scheduling
- Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: scheduling based on a output length predictor
- Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs: request scheduling in cluster and on instance
- Fast Inference for Augmented Large Language Models: schedule for Augmented LLM
- ALISE: Accelerating Large Language Model Serving with Speculative Scheduling: prediction-based scheduling + memory management + quantization's hodgepodge
- The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving: cost model in request scheduling
- Queue Management for SLO-Oriented Large Language Model Serving: schedule for request with differnt models and differnet SLO requirements
- FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving: fairness and request switch
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters: beginninf of Serving for LoRA, under the guidence of Ion Stoica: accepted by MLSys'24
- Dynamic LoRA Serving System for Offline Context Learning: successor of S-LoRA
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference: serving LoRA is becoming more and more important
- PUNICA: MULTI-TENANT LORA SERVING: accepted by MLSys'24
- Petals: Collaborative Inference and Fine-tuning of Large Models
- LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design: maybe useful, kernel optimization
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving: accepted by OSDI'24
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU: optimize SGMV kernels
- V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM: LoRA for vision models, and optimize LoRA kernels
- Efficient Multi-task LLM Quantization and Serving for Multiple LoRA Adapters: facilitates the sharing of a single quantized model for multiple LoRA adapters, accepted by NIPS'24
- Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models: more like a survey for LoRA serving
For LoRA but not serving
- ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
- LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin: potential new style of LoRA
- Higher Layers Need More LoRA Experts
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
- LoRA Meets Dropout under a Unified Framework: Analyze LoRA algorithmically
- HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning: algorithm optimization for LoRA
- SBoRA: Low-Rank Adaptation with Regional Weight Updates: an algorithm optimization for LoRA
- A Survey on LoRA of Large Language Models: survey of LoRAs, incluing parallel LoRA computing and Multi-LoRA, github
- mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs: can study the LoRA-aware pipeline parallelism scheme, github
- MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts: LoRA based MoE, github
- GongBu: Easily Fine-tuning LLMs for Domain-specific Adaptation: LLM fine-tuning tools
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
- Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24
Long-Context is a hot point recently.
- Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget
- Context Parallelism for Scalable Million-Token Inference
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection: select some important KV cache to take part in attention computation
Process differnet ML loads in a cluster.
- PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: serve multiple different loads in GPU cluster, accepted by SC'24
- PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption: why Encryption in LLM inference? by IPADS, accepted by ASPLOS'25
- Topology-aware Preemptive Scheduling for Co-located LLM Workloads: schedule different workloads
- ⭐ Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models: retrieval will be helpful, but how to use it?
- Generative Dense Retrieval: Memory Can Be a Burden: accepted by EACL'24
- ⭐ Accelerating Retrieval-Augmented Language Model Serving with Speculation: also a paper for RaLM
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation: improve RAG inference with cache, under guidence of Xin JIN
- FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research
- Accelerating Retrieval-Augmented Language Model Serving with Speculation: help understand RaLM
- NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2
- Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: RAG with spec decoding, different draft models with different RAG
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion: optimize KV cache reuse(prefix cache)
- RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation: trade-off between latency and quantity
Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read
-
⭐ DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: accepted by ICML'22
-
Accelerating Distributed MoE Training and Inference with Lina: both training and inference, accepted by ATC'23
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts: accepted by MLSys'23
-
Tutel: Adaptive Mixture-of-Experts at Scale: accepted by MLSys'23
-
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference: accepted by ISCA'24
-
Optimizing Mixture of Experts using Dynamic Recompilations: under guidence of Zhihao JIA
-
Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping: expert swapping is interesting
-
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference: some hot optimizations for inference, accepted by NIPS'24
-
Exploiting Transformer Activation Sparsity with Dynamic Inference
-
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
-
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production: accepted by ACL'22
-
Fast Inference of Mixture-of-Experts Language Models with Offloading: combine moe with offloading
-
⭐ MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving: under guidence of Luo MAI, provided some features and de# moe inference
-
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
-
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement: train MoE with new schedule plan, maybe work for inference
-
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
-
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models: quantized experts and expers management
-
Toward Inference-optimal Mixture-of-Expert Large Language Models: some analysis for training moe based on inference cost
-
[Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules]: comm optimization in MoE, accepted by InfoCom'24
-
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models: based on offload, accepted by MLSys'24
-
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy: introduce some features of MoE, accepted by ICLR'24
-
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: introduce some features of MoE too
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: introduction paper
-
Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies: all_to_all comm, HPDC'24
-
Scattered Mixture-of-Experts Implementation: ScatterMoE, an implementation of Sparse MoE
-
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts: the Shortcut-connection looks more like a algorithm optimization, and provide oppotunity for overlapping
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model: a opsen-source work and it inferences based expert-parallel
-
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget: MoE experts offloading, at the cost of reduced accuracy
-
ProMoE: Fast MoE-based LLM Serving using Proactive Caching: optimization on Pre-gated MoE, by IPADS
-
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design: pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, NIPS'24
-
MoEsaic: Shared Mixture of Experts: share Expert among different MoE instance, "MoE's modular architecture lets users compose their model from popular off-the-shelf experts" is a new scenario
-
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference: use quantization to decrease uncached MoE load overhead, on edge devices
-
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference: prediction and offload based optimization
-
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs: use offload-pipeline to accelerate inference moe on single GPU
-
⭐ MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems: benchmarking for MoE systems
-
⭐ Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection: damn! I had considered this before:( . key insight is that expert importance varies significantly across tokens and inference phases, utilize this to solve the all-activate problem
-
⭐ EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference: Gemm implemention optimization and alltoall communication overlap
-
⭐ Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling: optimize all2all order, co-locate experts from different models,
-
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling: scheduling comp and comm in MoE training, perhaps useful for MoE inference. accepted by EuroSys'24
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models: a start work in MoE
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: algorithm change in MoE
-
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping: Computation-Communication Overlapping, accepted by MLSys'24
-
Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training: training with offload, ICML'24
-
MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
-
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules: Dedicated Schedules for MP+EP+ESP MoE training, maybe work for infernece
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: load is stabilized in the middle and late stages of training, but may not wrok greatly for insference
-
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization: parallel strategy of MoE, accepted by ATC'23
-
APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes: fine-tune MoE models with CPU and some algorithm insights, accepted by SC'24
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: prediction the expert workload to optimize training
- MOSEL: Inference Serving Using Dynamic Modality Selection: improving system throughput by 3.6x with an accuracy guarantee and shortening job completion times by 11x
- Generative AI Beyond LLMs: System Implications of Multi-Modal Generation: by META
- Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: by Google
- Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference: optimization for diffusion models by cache
- DISTMM: Accelerating distributed multimodal model training: helpful although it is made for training, accepted by NSDI'24
- Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training: distributed MM trainging
- DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: multimodal model training, mm is getting more popular recently
- DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
- Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training
- Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
- DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
- SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models
What is this? maybe multiple LLM?
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems: a new scenario, by Stanford
- ALTO: An Efficient Network Orchestrator for Compound AI Systems: also new to me, by Stanford
- Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling: accuracy scaling is interesting, accepted by ASPLOS'24
- MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving: multiple LLMs
- ROUTERBENCH: A Benchmark for Multi-LLM Routing System: but what is multi-LLM?
- Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference: prompt KV cache reuse, accepted by MLSys'24
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving: similar to BlockLLM?
- Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution: for LLM-based Applications
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- RouteLLM: Learning to Route LLMs with Preference Data: use multiple LLMs for efficient serving
- USHER: Holistic Interference Avoidance for Resource Optimized ML Inference: inference several models simultaneously
- Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24
- Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
- Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
- Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE
It is usually related to CPU-GPU heterogeneity and GPU power consumption.
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving: early exits, accepted by SOSP'24
- Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation: early exits and some system optimization, accepted by SOSP'24
Wise men learn by others.
- Orca 2: Teaching Small Language Models How to Reason
- FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference: optimization for retrieval-augmented language model
- Optimizing Dynamic Neural Networks with Brainstorm: this idea has the potential to go further, accepted by OSDI'23
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Ring Attention?
- Reducing Activation Recomputation in Large Transformer Models: by NVIDIA
- Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models: an interesting performance metric, accepted by NIPS'23
- FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication: accepted by SIGMOD'23
- Efficient Multi-GPU Graph Processing with Remote Work Stealing: accepted by ICDE'23
- ARK: GPU-driven Code Execution for Distributed Deep Learning: accepted by NSDI'23
- Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs: accepted by MLSys'22
- Golgi: Performance-Aware, Resource-Efficient Function Scheduling for Serverless Computing: Scheduling for Serverless Computing
- FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters: expand to other ML models instead of LLM
- Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
- FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
- Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM: efficient SpMM, accepted by ASPLOS'24
- GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching: GPU memory pool, accepted by ASPLOS'24
- QuickLLaMA: Query-aware Inference Acceleration for Large Language Models: an inference-friendly LLaMA architecture
- HybridFlow: A Flexible and Efficient RLHF Framework: framework for RLHF
- Marconi: Prefix Caching for the Era of Hybrid LLMs: prefix cache for new model arch like combine attention with SSM
I'd like to create a separate area for data flows. It's just my preference.
- ⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
- Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
- VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22
How about data pre-processing overhead in training?
Just my preference.
- Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
- GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
- PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks: accepted by IPDPS'24
- NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
- Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
- Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24
- TORCHGT: A Holistic System for Large-Scale Graph Transformer Training: graph transformer model