EleutherAI ML Scalability & Performance Reading Group My annotated papers, slides, and meeting recordings for the EleutherAI ML Scalability & Performance research paper reading group. Sessions: Session 1 Intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks Session 2 FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Session 3 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models Session 4 Sequence Parallelism: Long Sequence Training from System Perspective Blockwise Parallel Transformer for Large Context Models Ring Attention with Blockwise Transformers for Near-Infinite Context Length Session 5 Efficient Memory Management for Large Language Model Serving with PagedAttention