-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Is it possible load quantized model from huggingface? #2458
Comments
I think currently TRTLLM supports load from AutoGPTQ and QServe (W4Aint8). AWQ can be applied by nvidia-modelopt. |
Even if you can quantize AWQ using nvidia-modelopt, results are quite different from AutoAWQ quantized models! the problem is that you can't apply convert_checkpoint.py script to AutoAWQ checkpoint. You need to apply quantize.py script that start from unquantized model and go straightforward to trt checkpoint. Something is broken here, or autoawq and modelopt just use different quantization algorithms, leading to different results once you apply the fine-tuned adapter |
Thanks for the replies!! |
Hi @Barry-Delaney!
So, i tried to convert and build llama 3.1 8B checkpoint using GPTQModel library for quantization. This project is newer and i can quantize model from Llama 3.X family. But when i build and run in trt-llm output is completely out of control:
Is it ok to assume that convert_checkpoint.py script work ONLY with AutoGPTQ/AutoAWQ libraries and not with just any GPTQ/AWQ model checkpoint? I do not want to use nvidia model optimizer for quantization because i need to use quantized checkpoint both to build engines AND to fine-tune adapters. At this moment, only AWQ quantization through AutoAWQ allows this aproach. Is it correct? For clairty, my environment is triton-server NGC container, tag 24.12, trt-llm version 0.16.0 |
@lodm94 the answer is yes. For a linear layer with GEMM shape [M, N, K], we need these components in the TRT-LLM layer:
And here's the layout of AutoGPTQ:
So our conversion script is basically unpacking and interleaving the weights, unpacking and calculating zeros. You can check them in It seems your failure cames from incorrect conversion. You can choose from:
Also creating feature request for us and ModelOpt is welcomed. ^^ |
Is there any way to load a quantized model directly from huggingface and convert it to TensorRT-LLM checkpoint (or engine) without calibration?
I could find some scipt of AutoGPTQ but I coundl't find other quantization method (like AutoAWQ, CompressedTensor or BNB).
The text was updated successfully, but these errors were encountered: