Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Added documentation of using warmups to initialize lora weights #515

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TheCodeWrangler
Copy link

This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend.

This approach allows for the LoRa weights to never be required for the client of the triton-inference-server backend and does not require loading or passing these weights from any of the python backend models (preprocessing) to avoid the numpy datatype conversion (which does not support bfloat16)

@smehta2000
Copy link

Tagging @kaiyux @byshiue to help triage and/or add to review board, thanks!

@TheCodeWrangler
Copy link
Author

Curious to get any feedback here

This update is also related to a performance issue I am seeing.
NVIDIA/TensorRT-LLM#1957

This PR gets results much closer to the expected outputs but not fully in line with huggingface/ pre-compiled results. Would love to have some feedback on the process for preparation of the adapter weights.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants