Exllama kernels support for AWQ models #28634

IlyasMoutawwakil · 2024-01-22T09:20:19Z

What does this PR do?

Following casper-hansen/AutoAWQ#313
ExllamaV2 offers up to 2x speedup compared to GEMM, while also compatible with AMD ROCm.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc and @younesbelkada

HuggingFaceDocBuilderDev · 2024-01-23T09:36:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for making exllama kernel compatible with AWQ models ! This will make AWQ so much faster ! I've left a few minor comments.

src/transformers/integrations/awq.py

src/transformers/modeling_utils.py

src/transformers/utils/quantization_config.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

IlyasMoutawwakil · 2024-01-25T09:01:13Z

I guess all points are addressed.
@casper-hansen when is 0.1.9 planned ?

younesbelkada

Thanks a lot for adding the ex-llama v2 support ! 🔥
Let's add autoawq==0.1.9 in the Dockerfile:

transformers/docker/transformers-all-latest-gpu/Dockerfile

Line 59 in bbe30c6

    
           RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl

- @casperhansen could you confirm the 0.1.9 is planned sometime soon?
Do you know also if this feature is supported on NVIDIA T4 GPUs? If that's the case can you add a simple generation test in: https://github.com/huggingface/transformers/blob/main/tests/quantization/autoawq/test_awq.py
Thanks !

casper-hansen · 2024-01-26T11:15:28Z

@younesbelkada The next release will be 0.2.0 🤗. For T4 support, I have not tested it. If AutoGPTQ supports T4 with ExLlama v1 and v2 kernels, AutoAWQ should too as the kernels are the same.

EDIT: To answer the timeline question. There is no set-in-stone plan for the next release. PRs to be merged before release include AMD support, Marlin support, Qwen2 support, and hopefully PEFT support. I expect this could be done in <1-2 weeks.

younesbelkada · 2024-01-26T11:21:51Z

Awesome! Per my understanding ex-llama + AutoGPTQ should be supported on T4 so it should be all good !
Let me know whenever you have some progress for the PEFT support so that I'll dive in to add AWQ + PEFT support directly in PEFT

younesbelkada

@IlyasMoutawwakil - #26610 being merged would you be happy to transfer the logic inside transformers/src/quantizers/quantizer_awq.py's post-processing method?

younesbelkada

Thanks very much @IlyasMoutawwakil ! I left one suggestion - what do you think?

younesbelkada · 2024-01-31T13:41:11Z

src/transformers/integrations/awq.py

+
+    # default values for exllamav2 from
+    # https://github.com/AutoGPTQ/AutoGPTQ/blob/6ba14f17ef73c161c2c4707cbf0b41e569a9c6dd/auto_gptq/nn_modules/qlinear/qlinear_exllamav2.py#L171
+    model = exllamav2_post_init(model, max_input_len=2048, max_batch_size=8)


couldn't we make max_input_len configurable through AwqConfig - wdyt?

Marc suggested we leave it as is for now #28634 (comment)

oh okay ! I think it would makes sense to directly expose a exllama_config I think - wdyt @SunMarc ?

Yes, it would make more sense to expose it in a exllama_config !

I guess in another PR right ?

Hmmm I think it should be better to add it now and not leave the main branch with hardcoded config values, it shouldn't be super complex as you can just copy over the existing logic in GptqConfig right?

ArthurZucker

Looks already really nice thanks to the integration refactor!

younesbelkada

Thanks, clean ! Let's merge this PR right after the next release of autoawq
@casper-hansen do you have any ETA for the next release?

younesbelkada

Thanks again !

SunMarc

I see that @casper-hansen already made the release of awq with exllama kernel. Can you check that everything works fine with the latest release @IlyasMoutawwakil ? Then, we are good to merge !

IlyasMoutawwakil · 2024-02-23T03:26:38Z

@SunMarc on it!

IlyasMoutawwakil · 2024-02-23T04:01:57Z

works on rocm5.6 with torch 2.2

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

quantization_config = AwqConfig(version="exllama")
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
    quantization_config=quantization_config,
    device_map="auto" or torch.device("cuda"),
)

input_ids = torch.randint(0, 100, (1, 128), dtype=torch.long, device="cuda")
output = model(input_ids)
print(output.logits)

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-AWQ")
input_ids = tokenizer.encode("How to make a cake", return_tensors="pt").to(model.device)
output = model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=50256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The device_map is mandatory since exllamav2_post_init scratch space tensors allocation needs to check which qweights are on which cuda devices. I believe this is a requirement in GPTQ as well.

younesbelkada

Thanks LGTM with one nit !

docker/transformers-all-latest-gpu/Dockerfile

ArthurZucker

Thank you for adding exllama support 🔥

added exllama kernels support for awq models

c6528d3

IlyasMoutawwakil requested a review from younesbelkada January 22, 2024 09:20

doc

0ef397f

style

e20c192

SunMarc approved these changes Jan 23, 2024

View reviewed changes

IlyasMoutawwakil and others added 2 commits January 24, 2024 12:45

Update src/transformers/modeling_utils.py

78c03c1

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

refactor

30cc4f6

younesbelkada reviewed Jan 26, 2024

View reviewed changes

IlyasMoutawwakil added 2 commits January 29, 2024 08:54

moved exllama post init to after device dispatching

76cbea1

bump autoawq version

df44597

younesbelkada reviewed Jan 30, 2024

View reviewed changes

IlyasMoutawwakil and others added 3 commits January 31, 2024 09:52

Merge branch 'main' into awq-exllama-support

6d8e4d5

added exllama test

4d976b9

style

104c0f3

younesbelkada reviewed Jan 31, 2024

View reviewed changes

ArthurZucker reviewed Jan 31, 2024

View reviewed changes

configurable exllama kernels

7944f04

younesbelkada approved these changes Feb 2, 2024

View reviewed changes

Merge branch 'main' into awq-exllama-support

fdd5b2e

younesbelkada approved these changes Feb 22, 2024

View reviewed changes

SunMarc requested review from younesbelkada and SunMarc February 22, 2024 17:00

SunMarc approved these changes Feb 22, 2024

View reviewed changes

copy exllama_config from gptq

dd07fc3

IlyasMoutawwakil and others added 2 commits February 23, 2024 06:08

moved exllama version check to post init

10b093f

Merge branch 'main' into awq-exllama-support

faa94d0

younesbelkada approved these changes Mar 4, 2024

View reviewed changes

docker/transformers-all-latest-gpu/Dockerfile Outdated Show resolved Hide resolved

moved to quantization dockerfile

a7bb24a

IlyasMoutawwakil requested a review from ArthurZucker March 4, 2024 07:56

IlyasMoutawwakil mentioned this pull request Mar 4, 2024

add test configurations for quantization with onnxruntime, awq, bnb (#95) huggingface/optimum-benchmark#144

Merged

ArthurZucker approved these changes Mar 5, 2024

View reviewed changes

ArthurZucker merged commit 4fc708f into huggingface:main Mar 5, 2024
19 of 21 checks passed

ArthurZucker mentioned this pull request Mar 7, 2024

Exllama v2 Quantization support #29448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama kernels support for AWQ models #28634

Exllama kernels support for AWQ models #28634

IlyasMoutawwakil commented Jan 22, 2024

HuggingFaceDocBuilderDev commented Jan 23, 2024

SunMarc left a comment

IlyasMoutawwakil commented Jan 25, 2024

younesbelkada left a comment

casper-hansen commented Jan 26, 2024 •

edited

Loading

younesbelkada commented Jan 26, 2024

younesbelkada left a comment

younesbelkada left a comment

younesbelkada Jan 31, 2024

IlyasMoutawwakil Jan 31, 2024

younesbelkada Jan 31, 2024

SunMarc Jan 31, 2024

IlyasMoutawwakil Jan 31, 2024

younesbelkada Feb 1, 2024

ArthurZucker left a comment

younesbelkada left a comment

younesbelkada left a comment

SunMarc left a comment

IlyasMoutawwakil commented Feb 23, 2024

IlyasMoutawwakil commented Feb 23, 2024 •

edited

Loading

younesbelkada left a comment

ArthurZucker left a comment

Exllama kernels support for AWQ models #28634

Exllama kernels support for AWQ models #28634

Conversation

IlyasMoutawwakil commented Jan 22, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 23, 2024

SunMarc left a comment

Choose a reason for hiding this comment

IlyasMoutawwakil commented Jan 25, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

casper-hansen commented Jan 26, 2024 • edited Loading

younesbelkada commented Jan 26, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Jan 31, 2024

Choose a reason for hiding this comment

IlyasMoutawwakil Jan 31, 2024

Choose a reason for hiding this comment

younesbelkada Jan 31, 2024

Choose a reason for hiding this comment

SunMarc Jan 31, 2024

Choose a reason for hiding this comment

IlyasMoutawwakil Jan 31, 2024

Choose a reason for hiding this comment

younesbelkada Feb 1, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

IlyasMoutawwakil commented Feb 23, 2024

IlyasMoutawwakil commented Feb 23, 2024 • edited Loading

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

casper-hansen commented Jan 26, 2024 •

edited

Loading

IlyasMoutawwakil commented Feb 23, 2024 •

edited

Loading