Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: add llama 3.1 style rope #401

Merged
merged 13 commits into from
Jul 27, 2024
Merged

feat: add llama 3.1 style rope #401

merged 13 commits into from
Jul 27, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jul 27, 2024

Reference implementation: https://github.com/meta-llama/llama-models/blob/709a61fd810157f75fbb314e7287089eec06d9c3/models/llama3_1/api/model.py#L41

This PR also expose the BatchQKApplyRotaryInPlaceKernel to pytorch APIs, previous they are only used in TVM wrappers.

@yzh119 yzh119 merged commit 4c89dec into main Jul 27, 2024
yzh119 added a commit that referenced this pull request Jul 29, 2024
🤖 I have created a release *beep* *boop*
---

##
[0.1.2](v0.1.1...v0.1.2)
(2024-07-29)

### Bugfix
* Fix the sampling kernel bug for cu118
([#386](#386),
[#387](#387))
([0cd499](0cd4994),
[dc3f18](dc3f184))

### Features

* add llama 3.1 style rope
([#401](#401))
([4c89dec](4c89dec))
* non-inplace rope operators
([#405](#405))
([74ffba1](74ffba1))
* sliding window attention
([#406](#406))
([28cffd3](28cffd3))
* support non-contiguous (packed) input for prefill kernels
([#404](#404))
([68c3719](68c3719))


### Performance Improvements

* slight optimization on merge states
([#313](#313))
([701c813](701c813))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
@yzh119 yzh119 deleted the llama-3.1-rope branch August 3, 2024 00:20
@chenzhuofu
Copy link

Awesome!

@chenzhuofu
Copy link

chenzhuofu commented Aug 25, 2024

Looks like llama-3.1-rope hasn't been incoporated into PosEncodingMode, so I think I may explicitly use BatchQKApplyLlama31Rotary and with PosEncodingMode::kNone in AttentionKernel. How do you think of it? @yzh119

@yzh119
Copy link
Collaborator Author

yzh119 commented Sep 1, 2024

@chenzhuofu , yes the wheel size will explode if we take llama 3.1 style rope into PosEncodingMode.

I'm refactoring the codebase to JIT, and the issue should be resolve soon.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants