Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Is it possible load quantized model from huggingface? #2458

Closed
pei0033 opened this issue Nov 19, 2024 · 6 comments
Closed

Is it possible load quantized model from huggingface? #2458

pei0033 opened this issue Nov 19, 2024 · 6 comments
Assignees
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 question Further information is requested triaged Issue has been triaged by maintainers

Comments

@pei0033
Copy link
Contributor

pei0033 commented Nov 19, 2024

Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration?
I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).

@hello-11 hello-11 added question Further information is requested triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Nov 20, 2024
@Tracin
Copy link
Collaborator

Tracin commented Nov 20, 2024

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

@lodm94
Copy link

lodm94 commented Nov 21, 2024

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models!
I am struggling finding a way to serve a single, quantized model with several adapters in TRT-LLM.
Currently, it seems you need to start adapters fine-tuning from a GPTQ model. If you fine tune adapters using awq or bnb results are completely differnt once you land on trt-llm!!

the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter

@Barry-Delaney
Copy link
Collaborator

Hi @pei0033 @lodm94, we have just updated the conversion script for LLaMA family checkpoints with AutoAWQ / AutoGPTQ (with desc_act = False), please refer to LLaMA AWQ and GPTQ documentations.
Please feel free to comment, thanks for your feedback!

@pei0033
Copy link
Contributor Author

pei0033 commented Dec 18, 2024

Thanks for the replies!!

@lodm94
Copy link

lodm94 commented Jan 9, 2025

Hi @pei0033 @lodm94, we have just updated the conversion script for LLaMA family checkpoints with AutoAWQ / AutoGPTQ (with desc_act = False), please refer to LLaMA AWQ and GPTQ documentations. Please feel free to comment, thanks for your feedback!

Hi @Barry-Delaney!
It seems that:

  • checkpoints from AutoAWQ can be converted and then built into trt-llm and it seems to work. Still under assessment performance loss, if any.
  • checkpoints from AutoGPTQ can be converted and then built into trt-llm and it seems to work BUT AutoGPTQ do not support llama 3.X family quantization (last commit is really old). With Mistral/Mistral-nemo model i managed to convert and build checkpoint. Performance drops, if any, still to be assessed.

So, i tried to convert and build llama 3.1 8B checkpoint using GPTQModel library for quantization. This project is newer and i can quantize model from Llama 3.X family. But when i build and run in trt-llm output is completely out of control:

Output [Text 0 Beam 0]: "<|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|> ........... <|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|>"

Is it ok to assume that convert_checkpoint.py script work ONLY with AutoGPTQ/AutoAWQ libraries and not with just any GPTQ/AWQ model checkpoint?

I do not want to use nvidia model optimizer for quantization because i need to use quantized checkpoint both to build engines AND to fine-tune adapters. At this moment, only AWQ quantization through AutoAWQ allows this aproach.

Is it correct?

For clairty, my environment is triton-server NGC container, tag 24.12, trt-llm version 0.16.0

@Barry-Delaney
Copy link
Collaborator

Is it ok to assume that convert_checkpoint.py script work ONLY with AutoGPTQ/AutoAWQ libraries and not with just any GPTQ/AWQ model checkpoint?
I do not want to use nvidia model optimizer for quantization because i need to use quantized checkpoint both to build engines AND to fine-tune adapters. At this moment, only AWQ quantization through AutoAWQ allows this aproach.
Is it correct?

@lodm94 the answer is yes. For a linear layer with GEMM shape [M, N, K], we need these components in the TRT-LLM layer:

Name Dtype Shape Layout
{LAYER_NAME}.weight float16 [K, N / 4] Interleaved and packed INT4
{LAYER_NAME}.weight_scaling_factor float16/bfloat16 [K / group_size, N] Row-major
{LAYER_NAME}.zero (for asymmetric quantization) float16/bfloat16 [K / group_size, N] Row-major
{LAYER_NAME}.activation_scaling_factor (for ModelOpt AWQ) float16 [K] Row-major
{LAYER_NAME}.alpha (for ModelOpt W4A8 AWQ) float32 [1] -

And here's the layout of AutoGPTQ:

Name Dtype Shape Layout
{LAYER_NAME}.qweight int32 [K / 8, N] Packed INT4
{LAYER_NAME}.scales float16 [K / group_size, N] Row-major
{LAYER_NAME}.qzeroes int32 [K / group_size, N / 8] Packed INT4

So our conversion script is basically unpacking and interleaving the weights, unpacking and calculating zeros. You can check them in postprocess_weight_only_groupwise(). The availability of a kind of checkpoint depends on whether or not its components can get converted to the TRT-LLM required ones correctly.

It seems your failure cames from incorrect conversion. You can choose from:

  • Check the layout of your quantized checkpoint and modify postprocess_weight_only_groupwise() manually.
  • Modify on the legacy path load_weights_from_gptq() and add TRTLLM_DISABLE_UNIFIED_CONVERTER=1.
  • Create a separate script to convert your local checkpoints to AutoAWQ/AutoGPTQ format and then use same commands.

Also creating feature request for us and ModelOpt is welcomed. ^^

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants