On Device Sampling #350

quic-sanising · 2025-04-08T18:15:55Z

✨ On Device Sampling

📌 Overview

This PR introduces On Device Sampling for QEffForCausalLM models, enabling sampling operations to be executed directly on the QAIC device rather than the host CPU. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability.

🚀 Motivation

Traditionally, sampling (e.g., greedy, top-k, top-p) is performed on the host CPU after logits are returned from the device. This approach incurs:

High PCIe traffic due to large logits tensors [batch_size, vocab_size]
Latency bottlenecks from CPU-bound sampling logic
Limited scalability due to CPU thread constraints

On Device Sampling addresses these issues by:

Performing sampling directly on the QAIC device
Returning only the selected next tokens [batch_size, 1]
Leveraging the device’s parallelism and optimized compute paths

⚙️ Supported Sampling Strategies

The following sampling techniques are now supported natively on the QAIC device:

Repetition Penalty: Penalize tokens that have appeared in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Presence Penalty: Penalize tokens that are present in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
Temperature Scaling: Adjust the sharpness of the logits distribution. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Top K: Sample from the k largest tokens by value.
Top P: Sample from the smallest set of tokens whose cumulative probability is greater than or equal to p. Must be in (0, 1]. Set to 1 to consider all tokens.
Min P: Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
Greedy Sampling: Choose the token with highest value.
Random Sampling: Choose a token randomly with its probability of being chosen given by its value.

🛠️ Implementation Details

Sampler Integration: Sampling logic is injected via include_sampler=True during model loading. No changes to the model architecture are required.
Memory Optimization: Two scratch buffers of shape [batch_size, vocab_size] are used to track token occurrences for applying repetition and presence penalties efficiently on-device.
Performance Gains:
- Reduced PCIe traffic (logits → tokens)
- Higher throughput via device-level parallelism
- Scalable to 64+ concurrent inference streams

🧪 Usage

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    include_sampler=True,
    return_pdfs=False,
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)

To disable On Device Sampling and revert to host-side sampling, simply set include_sampler=False.

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

QEfficient/transformers/models/modeling_auto.py

QEfficient/transformers/models/pytorch_transforms.py

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

QEfficient/utils/constants.py

QEfficient/transformers/models/modeling_auto.py

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

QEfficient/transformers/models/modeling_auto.py

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-hemagnih · 2025-06-05T09:15:17Z

Can you please rebase the code and see if there are any conflicts. Mostly code looks okay to me. We can plan its merge once all the pending comments are addressed

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

ochougul

I don't see a test case.
everything else is Okay. approving.

quic-amitraj · 2025-06-10T04:37:58Z

@quic-sanising Could you please add the description for this PR?

quic-amitraj

Please add docstring to all the functions and classes.

quic-amitraj · 2025-06-10T04:55:08Z

QEfficient/transformers/models/modeling_auto.py

+        self,
+        example_inputs: Dict[str, torch.Tensor],
+        output_names: List[str],
+        dynamic_axes: Dict[str, Dict[int, str]],


Could you please add a docstring to enhance clarity for other developers? Additionally, could we relocate this to the sampler folder? This will help keep modeling_auto.py streamlined and avoid unnecessary complexity.

Added the doc-string. Cannot to relocate this function to sampler as it is a member function of QEFFAutoModelForCausalLM.

QEfficient/transformers/sampler/sampler.py

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising · 2025-06-11T00:43:44Z

@quic-sanising Could you please add the description for this PR?

Done

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-hemagnih · 2025-06-11T04:45:44Z

As discussed in the meeting, we will approve and merge this PR. Please have another PR with unit test added for this feature to ensure that its not broken by any subsequent checkins

quic-sanising · 2025-06-11T05:18:21Z

Moved to #440

quic-sanising added 20 commits April 8, 2025 13:48

Initial commit

718d763

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Reformat code

b8d099e

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Fix bug

544c0dd

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Add Gumbel-Max trick based random sampling

0b4d0a9

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Bring up to date

24efc93

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Use Gumbel-Max Trick based Random Sampling as default

2af43c6

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Clip k to max value

3eca771

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Add docstring for sampling parameters

b0e9162

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Fix bug

0486e42

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Add support for continuous batching

e7dda72

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Fix ONNX error for batch_size 1 treated as a Constant

f94c657

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Undo docstring deletion

fa026a4

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Remove device and unncessary reshapes

eff2007

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Revert batch_size to 1

ebfbaea

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Remove vocab_size from dynamic axes

83d33ac

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Change condition

fc3dc82

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Change size of each sampling parameter to (batch_size, 1)

abbaf53

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Reformat code

f5f5e2d

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Fix bug

05c0bf0

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Allow chunked prompts during prefill

3b63ecb

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising force-pushed the on-device-sampling branch from 9fab549 to 3b63ecb Compare April 8, 2025 18:48

Merge remote-tracking branch 'upstream/main' into on-device-sampling

0b6873c

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising marked this pull request as ready for review April 9, 2025 04:48

quic-sanising requested review from quic-rishinr and ochougul as code owners April 9, 2025 04:48

Add missing params

1691a08

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-amitraj marked this pull request as draft April 11, 2025 08:49

quic-rishinr added the 1.20.0 label Apr 11, 2025

quic-hemagnih reviewed Apr 23, 2025

View reviewed changes

QEfficient/transformers/models/modeling_auto.py Outdated Show resolved Hide resolved

quic-hemagnih reviewed Apr 23, 2025

View reviewed changes

QEfficient/transformers/models/modeling_auto.py Outdated Show resolved Hide resolved

quic-sanising commented May 27, 2025

View reviewed changes

QEfficient/transformers/models/pytorch_transforms.py Outdated Show resolved Hide resolved

Move max_top_k_ids to qaic_config

a58e9af

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

ochougul requested changes May 28, 2025

View reviewed changes

quic-sanising added 2 commits May 30, 2025 13:16

Create function to get sampling inputs and outputs for ONNX export

485330c

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Fix bug

0a079cd

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising requested review from ochougul and quic-hemagnih June 3, 2025 22:19

quic-sanising marked this pull request as ready for review June 3, 2025 22:19

quic-sanising requested a review from quic-amitraj as a code owner June 3, 2025 22:19

ochougul reviewed Jun 4, 2025

View reviewed changes

QEfficient/transformers/models/modeling_auto.py Show resolved Hide resolved

Fix scalar tensor error and revert batch_size to 1

413195c

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Merge branch 'main' into on-device-sampling

f987bde

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising requested a review from ochougul June 5, 2025 16:53

ochougul approved these changes Jun 9, 2025

View reviewed changes

quic-amitraj requested changes Jun 10, 2025

View reviewed changes

quic-sanising added 3 commits June 10, 2025 16:15

Add Qualcomm signature and license

d46f35f

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Sort imports

c20a49d

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Update doc strings

4923ba6

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-sanising requested a review from quic-amitraj June 11, 2025 03:30

quic-sanising added 3 commits June 10, 2025 23:06

Remove false check

1f96a9c

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Merge branch 'main' into on-device-sampling

68313a3

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

Run linter

d87a99d

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>

quic-hemagnih approved these changes Jun 11, 2025

View reviewed changes

quic-sanising closed this Jun 12, 2025

quic-sanising deleted the on-device-sampling branch June 18, 2025 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

On Device Sampling #350

On Device Sampling #350

Uh oh!

quic-sanising commented Apr 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-hemagnih commented Jun 5, 2025

Uh oh!

ochougul left a comment

Uh oh!

quic-amitraj commented Jun 10, 2025

Uh oh!

quic-amitraj left a comment

Uh oh!

quic-amitraj Jun 10, 2025

Uh oh!

quic-sanising Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

quic-sanising commented Jun 11, 2025

Uh oh!

quic-hemagnih commented Jun 11, 2025

Uh oh!

quic-sanising commented Jun 11, 2025

Uh oh!

Uh oh!

On Device Sampling #350

On Device Sampling #350

Uh oh!

Conversation

quic-sanising commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!