Skip to content

Inference tutorial - Part 3 of e2e series #2343

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 26 commits into from
Jul 1, 2025
Merged

Conversation

jainapurva
Copy link
Contributor

@jainapurva jainapurva commented Jun 9, 2025

Last tutorial of the 3 part series of using TorchAO in model lifecycle.

Copy link

pytorch-bot bot commented Jun 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2343

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit ccc2932 with merge base 2898903 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025
@jainapurva jainapurva added the topic: documentation Use this tag if this PR adds or improves documentation label Jun 10, 2025
@andrewor14
Copy link
Contributor

Hi @jainapurva, by the way I'm adding a serving.rst here: #2394. It uses the same template as parts 1 and 2. After that's landed, do you mind updating your PR to use that file instead? Right now it's a blank page with the template:

Screenshot 2025-06-17 at 5 48 14 PM

@jainapurva jainapurva force-pushed the inference_tutorial branch from b93b892 to ce675b8 Compare June 18, 2025 21:05
.. note::
For more information on supported quantization and sparsity configurations, see `HF-Torchao Docs <https://huggingface.co/docs/transformers/main/en/quantization/torchao>`_.

Inference with vLLM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this section, can you replace with https://huggingface.co/pytorch/Qwen3-8B-int4wo-hqq#inference-with-vllm

it might be easier to do command line compared to code

Copy link
Contributor

@andrewor14 andrewor14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, overall I feel we should add some more text in between code blocks so it feels more like a tutorial, and remove some duplicate code, which is distracting to readers

Step 1: Untie Embedding Weights
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step necessary actually? I don't think I had to do any of this for Llama models for example. Can you share the source for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the same steps as here: https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w. In case of any updates, we should update both the model card and tutorial with same instructions

@jainapurva jainapurva marked this pull request as ready for review July 1, 2025 19:59
@jainapurva jainapurva changed the title Inference tutorial - Part 3 of e2e series [WIP] Inference tutorial - Part 3 of e2e series Jul 1, 2025
@jainapurva jainapurva merged commit 09f0d6c into main Jul 1, 2025
19 of 21 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: documentation Use this tag if this PR adds or improves documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants