Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

Merged
merged 8 commits into from
Aug 11, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Aug 11, 2024

hardware fp8->fp16 fast conversion instruction is not available for sm_80 & sm_89, which makes #420 slow for these architectures.

this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from vllm project) to accelerate such cases.

Co-authored-by: Antoni Baum antoni@anyscale.com
Co-authored-by: Cody Yu cody@anyscale.com

@yzh119 yzh119 merged commit c93f647 into main Aug 11, 2024
@yzh119 yzh119 deleted the faster-f8-f16-dequant branch August 11, 2024 07:50
yzh119 added a commit that referenced this pull request Aug 13, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.1.5](v0.1.4...v0.1.5)
(2024-08-13)


### Bugfix

* Fix PagedPrefill python api and some typos
([#441](#441))
([3fff008](3fff008))
* fix prefill kernels' lse result for empty kv-cache
([#440](#440))
([6ac28f4](6ac28f4))

### Features

* decouple float and int workspace buffer
([#442](#442))
([a7ee566](a7ee566))


### Performance Improvements

* faster fp8->fp16 dequantization for pre sm_90 arch
([#439](#439))
([c93f647](c93f647))

### Acknowledgement

We thank contributions and feedbacks from the community:
[@comaniac](https://github.com/comaniac),
[@hnyls2002](https://github.com/hnyls2002),
[@jianfei-wangg](https://github.com/jianfei-wangg),
[@Yard1](https://github.com/Yard1).


---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
yzh119 added a commit that referenced this pull request Aug 13, 2024
Followup of #439 , use `constexpr` in if conditions so that
`BIAS_OFFSET` won't exceed 32 at compile time.
zhyncs pushed a commit that referenced this pull request Aug 14, 2024
hardware fp8->fp16 fast conversion instruction is not available for
sm_80 & sm_89, which makes #420 slow for these architectures.

this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from
vllm project) to accelerate such cases.

Co-authored-by: Antoni Baum <antoni@anyscale.com>
Co-authored-by: Cody Yu <cody@anyscale.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant