diff --git a/README.md b/README.md index b230ddf5..aa033f2c 100644 --- a/README.md +++ b/README.md @@ -12,11 +12,27 @@ -
📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉 -
+## 📖 News 🔥🔥 +
+ +- [2025-01-08]: [📚Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ~1.5x🎉faster vs SDPA EA. +- [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[hgemm-tensorcores-mma](https://github.com/DefTruth/hgemm-tensorcores-mma): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API. + +## 📖 Contents👇👀 + +
+ +- [📖 HGEMM Benchmark](#hgemm-mma-bench) +- [📖 FA2-MMA Benchmark](#fa-mma-bench) +- [📖 150+ CUDA Kernels](#cuda-kernel) +- [📖 100+ Blogs(LLM/CUDA)](#my-blogs-part-1) + +## 📖 HGEMM-MMA Benchmark 🎉🎉 + +
@@ -24,6 +40,7 @@
+ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [hgemm-tensorcores-mma⚡️⚡️](https://github.com/DefTruth/hgemm-tensorcores-mma) repo for more details. ![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d) @@ -40,6 +57,9 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d |Collective Store (Shfl)|Row Major (NN)|Col Major (TN)| SGEMM FP32/TF32| |✔️|✔️|✔️|✔️| +## 📖 FA2-MMA Benchmark 🎉🎉 + +
I have also implemented **FlashAttention-2** using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, **Fully Shared QKV SMEM**, **Prefetch Q s2r**, **Prefetch K/V g2s**, **QKV Fine-grained Tiling**, Collective Store, etc. Please refer to [flash-attention-mma⚡️⚡️](./kernels/flash-attn) for more details. @@ -131,7 +151,7 @@ __global__ void // Q, K, V, O -> [B, H, N, D] flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...); ``` -- 📚 Split Q + Fully QKV Fine-grained Tiling (**O(Brx16)~O(1) SRAM** vs FA2 **O(4xBrxd) SRAM**) +- 📚 Split Q + Fully QKV Fine-grained Tiling (**O(2xBrx16)~O(1) SRAM** vs FA2 **O(4xBrxd) SRAM**)
@@ -142,7 +162,6 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* __global__ void // Q, K, V, O -> [B, H, N, D] flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...); ``` - ## ©️Citations🎉🎉 ```BibTeX @@ -538,7 +557,7 @@ GNU General Public License v3.0 ## 🎉Contribute ([©️back👆🏻](#contents)) -
+
How to contribute? Star this repo or check [🌤🌤CONTRIBUTE🎉🎉](https://github.com/DefTruth/CUDA-Learn-Notes/issues/50). @@ -552,7 +571,9 @@ How to contribute? Star this repo or check [🌤🌤CONTRIBUTE🎉🎉](https:// -## 📖 References ([©️back👆🏻](#contents)) +## 📖 References ([©️back👆🏻](#contents)) +
+ - [flash-attention-minimal](https://github.com/tspeterkim/flash-attention-minimal) - [tiny-flash-attention](https://github.com/66RING/tiny-flash-attention) - [cute-gemm](https://github.com/reed-lau/cute-gemm)