Skip to content

Improve inference speed of multi-query attention model #3

Open
@harm-devries

Description

@harm-devries

The multi-query attention paper reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query attention but only observed up to 25% speed-ups when it's fully integrated in the Transformers model. We did observe up to 2x speed-ups for a simplified version of the attention layer (without softmax and layer normalization). See more details here.

Further inference gains are likely possible but do require further investigation. For example, we would like to benchmark the difference in a more optimized inference environment like Deepspeed-inference. We are also happy to discuss other solutions and directions in the #wg-inference channel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions