-
Notifications
You must be signed in to change notification settings - Fork 135
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: fix the performance issue of
append_paged_kv_cache
(#588)
The performance of `append_paged_kv_cache` is terrible for small batch size, which is a known issue that we haven't fixed for a long time, this PR fixes it. This PR also adds support for non-contiguous append keys/values (which could be sliced from fused qkv matrix). We first call a triton kernel to convert `append_indptr` to `batch_indices` and `positions` (which is similar to [CSR2COO conversion](https://docs.nvidia.com/cuda/cusparse/#cusparse-t-csr2coo) in sparse matrix). After the conversion, we can use element parallelism instead of batch parallelism. It's also worth trying using triton for the second `AppendPagedKVCacheKernel` kernel, I think the performance should be fine. I'll leave it for future work. Some todo items: 1. add torch.compile support. After this PR (reference number can be found at #583 ): ```bash model: l1b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.094ms throughput: 5.563GB/s model: l1b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.014ms all_layers: 0.216ms throughput: 1514.280GB/s model: l1b seqlens: [5000] single_layer: 0.014ms all_layers: 0.216ms throughput: 1517.017GB/s model: l1b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.014ms all_layers: 0.217ms throughput: 1510.863GB/s --- model: l3b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.165ms throughput: 11.123GB/s model: l3b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.021ms all_layers: 0.580ms throughput: 1975.732GB/s model: l3b seqlens: [5000] single_layer: 0.021ms all_layers: 0.586ms throughput: 1958.078GB/s model: l3b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.021ms all_layers: 0.581ms throughput: 1973.174GB/s --- model: l8b seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.185ms throughput: 11.321GB/s model: l8b seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.021ms all_layers: 0.661ms throughput: 1982.815GB/s model: l8b seqlens: [5000] single_layer: 0.021ms all_layers: 0.662ms throughput: 1980.227GB/s model: l8b seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.021ms all_layers: 0.667ms throughput: 1964.861GB/s --- model: l70b-tp8 seqlens: [1, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.006ms all_layers: 0.457ms throughput: 1.434GB/s model: l70b-tp8 seqlens: [4993, 1, 1, 1, 1, 1, 1, 1] single_layer: 0.009ms all_layers: 0.710ms throughput: 576.866GB/s model: l70b-tp8 seqlens: [5000] single_layer: 0.009ms all_layers: 0.685ms throughput: 598.366GB/s model: l70b-tp8 seqlens: [625, 625, 625, 625, 625, 625, 625, 625] single_layer: 0.009ms all_layers: 0.690ms throughput: 593.453GB/s ``` cc @abcdabcd987
- Loading branch information
Showing
9 changed files
with
285 additions
and
93 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.