Is it possible load quantized model from huggingface? #2458

pei0033 · 2024-11-19T07:42:29Z

Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration?
I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).

Tracin · 2024-11-20T08:34:27Z

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

lodm94 · 2024-11-21T14:25:51Z

I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt.

Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models!
I am struggling finding a way to serve a single, quantized model with several adapters in TRT-LLM.
Currently, it seems you need to start adapters fine-tuning from a GPTQ model. If you fine tune adapters using awq or bnb results are completely differnt once you land on trt-llm!!

the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter

Barry-Delaney · 2024-12-05T18:01:49Z

Hi @pei0033 @lodm94, we have just updated the conversion script for LLaMA family checkpoints with AutoAWQ / AutoGPTQ (with desc_act = False), please refer to LLaMA AWQ and GPTQ documentations.
Please feel free to comment, thanks for your feedback!

pei0033 · 2024-12-18T00:33:43Z

Thanks for the replies!!

lodm94 · 2025-01-09T16:59:58Z

Hi @pei0033 @lodm94, we have just updated the conversion script for LLaMA family checkpoints with AutoAWQ / AutoGPTQ (with desc_act = False), please refer to LLaMA AWQ and GPTQ documentations. Please feel free to comment, thanks for your feedback!

Hi @Barry-Delaney!
It seems that:

checkpoints from AutoAWQ can be converted and then built into trt-llm and it seems to work. Still under assessment performance loss, if any.
checkpoints from AutoGPTQ can be converted and then built into trt-llm and it seems to work BUT AutoGPTQ do not support llama 3.X family quantization (last commit is really old). With Mistral/Mistral-nemo model i managed to convert and build checkpoint. Performance drops, if any, still to be assessed.

So, i tried to convert and build llama 3.1 8B checkpoint using GPTQModel library for quantization. This project is newer and i can quantize model from Llama 3.X family. But when i build and run in trt-llm output is completely out of control:

Output [Text 0 Beam 0]: "<|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|> ........... <|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|>"

Is it ok to assume that convert_checkpoint.py script work ONLY with AutoGPTQ/AutoAWQ libraries and not with just any GPTQ/AWQ model checkpoint?

I do not want to use nvidia model optimizer for quantization because i need to use quantized checkpoint both to build engines AND to fine-tune adapters. At this moment, only AWQ quantization through AutoAWQ allows this aproach.

Is it correct?

For clairty, my environment is triton-server NGC container, tag 24.12, trt-llm version 0.16.0

Barry-Delaney · 2025-01-10T03:52:26Z

Is it ok to assume that convert_checkpoint.py script work ONLY with AutoGPTQ/AutoAWQ libraries and not with just any GPTQ/AWQ model checkpoint?
I do not want to use nvidia model optimizer for quantization because i need to use quantized checkpoint both to build engines AND to fine-tune adapters. At this moment, only AWQ quantization through AutoAWQ allows this aproach.
Is it correct?

@lodm94 the answer is yes. For a linear layer with GEMM shape [M, N, K], we need these components in the TRT-LLM layer:

Name	Dtype	Shape	Layout
`{LAYER_NAME}.weight`	float16	[K, N / 4]	Interleaved and packed INT4
`{LAYER_NAME}.weight_scaling_factor`	float16/bfloat16	[K / group_size, N]	Row-major
`{LAYER_NAME}.zero` (for asymmetric quantization)	float16/bfloat16	[K / group_size, N]	Row-major
`{LAYER_NAME}.activation_scaling_factor` (for ModelOpt AWQ)	float16	[K]	Row-major
`{LAYER_NAME}.alpha` (for ModelOpt W4A8 AWQ)	float32	[1]	-

And here's the layout of AutoGPTQ:

Name	Dtype	Shape	Layout
`{LAYER_NAME}.qweight`	int32	[K / 8, N]	Packed INT4
`{LAYER_NAME}.scales`	float16	[K / group_size, N]	Row-major
`{LAYER_NAME}.qzeroes`	int32	[K / group_size, N / 8]	Packed INT4

So our conversion script is basically unpacking and interleaving the weights, unpacking and calculating zeros. You can check them in postprocess_weight_only_groupwise(). The availability of a kind of checkpoint depends on whether or not its components can get converted to the TRT-LLM required ones correctly.

It seems your failure cames from incorrect conversion. You can choose from:

Check the layout of your quantized checkpoint and modify postprocess_weight_only_groupwise() manually.
Modify on the legacy path load_weights_from_gptq() and add TRTLLM_DISABLE_UNIFIED_CONVERTER=1.
Create a separate script to convert your local checkpoints to AutoAWQ/AutoGPTQ format and then use same commands.

Also creating feature request for us and ModelOpt is welcomed. ^^

hello-11 added question Further information is requested triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Nov 20, 2024

hello-11 assigned Tracin Nov 20, 2024

pei0033 closed this as completed Dec 18, 2024

kaiyux mentioned this issue Dec 24, 2024

TensorRT-LLM v0.16 Release #2611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible load quantized model from huggingface? #2458

Is it possible load quantized model from huggingface? #2458

pei0033 commented Nov 19, 2024

Tracin commented Nov 20, 2024

lodm94 commented Nov 21, 2024 •

edited

Loading

Barry-Delaney commented Dec 5, 2024

pei0033 commented Dec 18, 2024

lodm94 commented Jan 9, 2025 •

edited

Loading

Barry-Delaney commented Jan 10, 2025

Is it possible load quantized model from huggingface? #2458

Is it possible load quantized model from huggingface? #2458

Comments

pei0033 commented Nov 19, 2024

Tracin commented Nov 20, 2024

lodm94 commented Nov 21, 2024 • edited Loading

Barry-Delaney commented Dec 5, 2024

pei0033 commented Dec 18, 2024

lodm94 commented Jan 9, 2025 • edited Loading

Barry-Delaney commented Jan 10, 2025

lodm94 commented Nov 21, 2024 •

edited

Loading

lodm94 commented Jan 9, 2025 •

edited

Loading