Improve inference speed of multi-query attention model

[The multi-query attention paper](https://arxiv.org/pdf/1911.02150.pdf) reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query attention but only observed up to 25% speed-ups when it's fully integrated in the Transformers model. We did observe up to 2x speed-ups for a simplified version of the attention layer (without softmax and layer normalization). See more details [here](https://github.com/bigcode-project/bigcode-analysis/tree/multi_query_experiments/multi_query_experiments). 

Further inference gains are likely possible but do require further investigation. For example, we would like to benchmark the difference in a more optimized inference environment like [Deepspeed-inference](https://www.deepspeed.ai/tutorials/inference-tutorial/). We are also happy to discuss other solutions and directions in the #wg-inference channel. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve inference speed of multi-query attention model #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve inference speed of multi-query attention model #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions