Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: add rotary_dim argument to rope APIs for partial apply rope #599

Merged
merged 3 commits into from
Nov 10, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Nov 10, 2024

This PR implements the final piece of #530 , so that we can partially apply rotary embedding to first head dimensions instead of entire head dimensions.

We also add a simple benchmark for RoPE, below is the result on H100:

batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 23us, throughput:   0.876GB/s
batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 26us, throughput:   0.801GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.735GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.639GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 31us, throughput: 672.889GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 32us, throughput: 662.972GB/s
---
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  14.559GB/s
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  14.435GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 37us, throughput: 1339.450GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 37us, throughput: 1340.399GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 148us, throughput: 2696.563GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 148us, throughput: 2689.104GB/s
---
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  74.186GB/s
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  74.452GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 110us, throughput: 2350.830GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 110us, throughput: 2359.814GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 717us, throughput: 2895.389GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 718us, throughput: 2891.385GB/s
---
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.449GB/s
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.646GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 130us, throughput: 2576.101GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 130us, throughput: 2582.447GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 924us, throughput: 2906.154GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 925us, throughput: 2903.484GB/s

@yzh119 yzh119 merged commit eb9bc71 into main Nov 10, 2024
@yzh119 yzh119 mentioned this pull request Nov 10, 2024
@yzh119 yzh119 deleted the rope-dim branch November 10, 2024 08:46
yzh119 added a commit that referenced this pull request Nov 10, 2024
The TVM wrapper was broken in #599 because of API changes, this PR fixes
the issue.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant