bugfix: fix the prefill/append attention kernel accuracy issue on sm75 #448

yzh119 · 2024-08-16T00:55:31Z

As reported by @esmeetu , the prefill/append attention kernel produce incorrect results on sm75. This PR fixes the issue.

We need another round of kernel configuration check before releasing the official sm75 wheel (e.g., the nthrs per block 1024 is too large for sm75, we should use smaller values such as 512/256), @zhyncs would you mind helping with this?

zhyncs · 2024-08-16T05:25:35Z

@yzh119 ok

@LiuXiaoxuanPKU

🤖 I have created a release *beep* *boop* --- ## [0.1.6](v0.1.5...v0.1.6) (2024-08-27) ### SM75 Support Starting from [0.1.6](v0.1.5...v0.1.6), our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080). ### API Changes #### `plan`/`run` Since [0.1.6](v0.1.5...v0.1.6) on, `begin_forward`/`forward`/`end_forward` APIs are replaced with the new `plan`/`run` API. - `forward` is renamed to `run`, which is more precise and consistent with the naming convention of cutlass's python API. - `begin_forward` is renamed to `plan`, which is consistent with the naming convention of nvmath API. - `end_forward` is deprecated and has no effect after this PR. There is some slight difference between the old `forward` and the new `run` API: - All extra arguments such as `causal` and `logits_soft_cap` will be provided in `plan` (previously `begin_forward`) API, and cached until next `plan` call, and we only need to provide query and KV-Cache tensors in `run` API. The old `begin_forward`/`forward`/`end_forward` APIs are still functional, but we will gradually deprecate them in future releases. Check [#466](#466) for more details. #### `MultiLevelCascadeAttentionWrapper` Since [0.1.6](v0.1.5...v0.1.6) on, we introduce a new `MultiLevelCascadeAttentionWrapper` API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache. See [documentation](https://docs.flashinfer.ai/api/python/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) and [tutorial](https://docs.flashinfer.ai/tutorials/kv_layout.html#multi-level-cascade-inference-data-layout) on API usage and layout explaination. The old `BatchDecodeWithSharedPrefixPagedKVCacheWrapper` and `BatchPrefillWithSharedPrefixPagedKVCacheWrapper` will be deprecated in future releases. ### Features * sm75 support ([#448](#448), [#449](#449)) * add `MultiLevelCascadeAttentionWrapper` API ([#462](#462)) ([1e37989](1e37989)) * add accept num, emit num metric for ChainSpeculativeSampling ([#450](#450)) ([fa38b5e](fa38b5e)) * support bmm fp8 ([#469](#469)) ([f1c0b68](f1c0b68)) ### Refactor * refactor: replace `begin_forward`/`forward`/`end_forward` with `plan`/`run` [#466](#466) ### Misc * misc: improve error handling of sampling kernels ([#456](#456)) ([0dce178](0dce178)) ### Performance Improvements * slight optimization on f16->f8 fragment layout swizzling ([#453](#453)) ([0d61871](0d61871)) * slight optimization on fragment layout swizzle ([#458](#458)) ([7c397cb](7c397cb)) * use persistent kernel for merging attention states ([#459](#459)) ([be6bf5b](be6bf5b)) ### Acknowledgement We thank [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU) on enhance of speculative sampling operator, [@merrymercy](https://github.com/merrymercy) on API change suggestion and [@zhyncs](https://github.com/zhyncs) on integrating fp8 BMM cublas implementation. --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

upd

92d1364

yzh119 merged commit 338b2f5 into main Aug 16, 2024

zhyncs deleted the sm75-prefill-fix branch August 16, 2024 05:25

zhyncs mentioned this pull request Aug 17, 2024

[Bug] T4 not work sgl-project/sglang#1058

Closed

4 tasks

yzh119 mentioned this pull request Aug 27, 2024

chore(main): release 0.1.6 #447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: fix the prefill/append attention kernel accuracy issue on sm75 #448

bugfix: fix the prefill/append attention kernel accuracy issue on sm75 #448

yzh119 commented Aug 16, 2024

zhyncs commented Aug 16, 2024

bugfix: fix the prefill/append attention kernel accuracy issue on sm75 #448

bugfix: fix the prefill/append attention kernel accuracy issue on sm75 #448

Conversation

yzh119 commented Aug 16, 2024

zhyncs commented Aug 16, 2024