-
Notifications
You must be signed in to change notification settings - Fork 320
Inference tutorial - Part 3 of e2e series #2343
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled JobsAs of commit ccc2932 with merge base 2898903 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @jainapurva, by the way I'm adding a ![]() |
b93b892
to
ce675b8
Compare
docs/source/inference.rst
Outdated
.. note:: | ||
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_. | ||
|
||
Inference with vLLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm
it might be easier to do command line compared to code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, overall I feel we should add some more text in between code blocks so it feels more like a tutorial, and remove some duplicate code, which is distracting to readers
docs/source/serving.rst
Outdated
Step 1: Untie Embedding Weights | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step necessary actually? I don't think I had to do any of this for Llama models for example. Can you share the source for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using the same steps as here: https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w. In case of any updates, we should update both the model card and tutorial with same instructions
Last tutorial of the 3 part series of using TorchAO in model lifecycle.