feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 · 2024-11-10T05:45:04Z

This PR implements the final piece of #530 , so that we can partially apply rotary embedding to first head dimensions instead of entire head dimensions.

We also add a simple benchmark for RoPE, below is the result on H100:

batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 23us, throughput:   0.876GB/s
batch_size:   1, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 26us, throughput:   0.801GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.735GB/s
batch_size:   1, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.639GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 31us, throughput: 672.889GB/s
batch_size:   1, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 32us, throughput: 662.972GB/s
---
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  14.559GB/s
batch_size:  19, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  14.435GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 37us, throughput: 1339.450GB/s
batch_size:  19, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 37us, throughput: 1340.399GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 148us, throughput: 2696.563GB/s
batch_size:  19, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 148us, throughput: 2689.104GB/s
---
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  74.186GB/s
batch_size:  99, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  74.452GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 110us, throughput: 2350.830GB/s
batch_size:  99, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 110us, throughput: 2359.814GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 717us, throughput: 2895.389GB/s
batch_size:  99, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 718us, throughput: 2891.385GB/s
---
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 27us, throughput:  95.449GB/s
batch_size: 128, append_len:     1, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 27us, throughput:  95.646GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 130us, throughput: 2576.101GB/s
batch_size: 128, append_len:   128, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 130us, throughput: 2582.447GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: False, latency: 924us, throughput: 2906.154GB/s
batch_size: 128, append_len:  1024, num_qo_heads:    32, num_kv_heads:     8, head_dim:   128, use_cos_sin_cache: True, latency: 925us, throughput: 2903.484GB/s

The TVM wrapper was broken in #599 because of API changes, this PR fixes the issue.

yzh119 added 3 commits November 10, 2024 05:27

upd

4b5a4de

upd

1722ac6

add bench

2fb96b5

yzh119 merged commit eb9bc71 into main Nov 10, 2024

yzh119 mentioned this pull request Nov 10, 2024

Support vLLM-style rope #530

Closed

github-actions bot mentioned this pull request Nov 10, 2024

chore(main): release 0.2.0 #476

Open

yzh119 deleted the rope-dim branch November 10, 2024 08:46

yzh119 mentioned this pull request Nov 10, 2024

hotfix: fix rope tvm wrapper #601

Merged

yzh119 added a commit that referenced this pull request Nov 10, 2024

hotfix: fix rope tvm wrapper (#601)

3dd9405

The TVM wrapper was broken in #599 because of API changes, this PR fixes the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 commented Nov 10, 2024 •

edited

Loading

feat: add rotary_dim argument to rope APIs for partial apply rope #599

feat: add rotary_dim argument to rope APIs for partial apply rope #599

Conversation

yzh119 commented Nov 10, 2024 • edited Loading

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

feat: add `rotary_dim` argument to rope APIs for partial apply rope #599

yzh119 commented Nov 10, 2024 •

edited

Loading