Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How can i quantize model when i use custom model on tensorrt-llm?whether i need to write c++ code or not?any examples?thank u for your time and help. #2718

Open
DelongYang666 opened this issue Jan 24, 2025 · 3 comments
Labels
triaged Issue has been triaged by maintainers

Comments

@DelongYang666
Copy link

No description provided.

@nv-guomingz
Copy link
Collaborator

Step 0. https://nvidia.github.io/TensorRT-LLM/architecture/add-model.html
Step 1. Implement your private op if needed(C++ refer to plugin implmentation, it's not a must-have option)
Step 2. Quantize the model with recipe like int4/int8 weights only or int8 smoothquant.

@nv-guomingz nv-guomingz added the triaged Issue has been triaged by maintainers label Jan 25, 2025
@DelongYang666
Copy link
Author

Thank u!i saw the document [https://nvidia.github.io/TensorRT-LLM/architecture/add-model.html] 3 hours ago. And I will try.And, a vedio on bilibili told me that fp8 may performence well than other quantization way. But , this way depends on modelopt. The modelopt does not support my custom model to quantize. Any ways to implement fp8 ?

@nv-guomingz
Copy link
Collaborator

Thank u!i saw the document [https://nvidia.github.io/TensorRT-LLM/architecture/add-model.html] 3 hours ago. And I will try.And, a vedio on bilibili told me that fp8 may performence well than other quantization way. But , this way depends on modelopt. The modelopt does not support my custom model to quantize. Any ways to implement fp8 ?

@RalphMao do you have any comments on this question?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants