Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Doc]: Clarify QLoRA (Quantized Model + LoRA) Support in Documentation #13179

Closed
AlexanderZhk opened this issue Feb 12, 2025 · 8 comments
Closed
Labels
documentation Improvements or additions to documentation

Comments

@AlexanderZhk
Copy link

AlexanderZhk commented Feb 12, 2025

📚 The doc issue

Two parts of the documentation appear to contradict each other, especially at first glance.

Here, it is explicitly stated that LoRA inference with a quantized model is not supported:

##### LORA and quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!

However, here, an example is provided for running offline inference with a quantized model and a LoRA adapter:

This example shows how to use LoRA with different quantization techniques
for offline inference.

To resolve this confusion, it would be very helpful to clarify the following points directly (please correct me if I am mistaken):

  1. QLoRA is supported, but only for offline inference. This means you cannot dynamically load LoRA adapters after loading the quantized base model.
  2. QLoRA is not supported with the OpenAI-compatible server, even for a single LoRA-base model pair.

Edit:

It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback, that's why I was confused.

### Transformers fallback

#### Supported features
##### LORA and quantization

@AlexanderZhk AlexanderZhk added the documentation Improvements or additions to documentation label Feb 12, 2025
@jeejeelee
Copy link
Collaborator

jeejeelee commented Feb 13, 2025

I think this means that transformers-fallback doesn't support these 2 features. For models integrated with vllm, we support QLoRA.

BTW, Afer #13166 was landed, I think transformers-fallback can support LoRA directly, cc @Isotr0py @hmellor

@AlexanderZhk
Copy link
Author

For models integrated with vllm, we support QLoRA.

Would be great, if you could point me to a more specific example, my understanding of vllm/transformers isn't too deep.

Take qwen2 for example, it is integrated (if I understand correctly) here vllm/model_executor/models/qwen2.py
However, running a qwen2 model quantized with vllm serve is not supported.

@jeejeelee
Copy link
Collaborator

Could you please provide more detailed information, such as log information and errors

@AlexanderZhk
Copy link
Author

Could you please provide more detailed information, such as log information and errors

That's partly why I created the issue, it does load, but why does the documentation state otherwise? Did it just not get updated? Are there issues, we need to be aware of, when running QLoRa currently?

Image

@hmellor
Copy link
Member

hmellor commented Feb 14, 2025

The documentation does not state otherwise.

The documentation explicitly states that quantisation and LoRA are not compatible together with the Transformers fallback.

@AlexanderZhk
Copy link
Author

I see now, thanks for clarifying. It's easy to miss on the docs site, that ##### LORA and quantization is a subsection of ### Transformers fallback

### Transformers fallback

#### Supported features
##### LORA and quantization

@hmellor
Copy link
Member

hmellor commented Feb 15, 2025

Ok, we should make that clearer. Thank you for the feedback!

@hmellor
Copy link
Member

hmellor commented Feb 17, 2025

The documentation change in #12960 should help with this

@hmellor hmellor closed this as completed Feb 17, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants