Skip to content

On Device Sampling #350

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 58 commits into from
Closed

Conversation

quic-sanising
Copy link
Contributor

@quic-sanising quic-sanising commented Apr 8, 2025

✨ On Device Sampling

📌 Overview

This PR introduces On Device Sampling for QEffForCausalLM models, enabling sampling operations to be executed directly on the QAIC device rather than the host CPU. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability.



🚀 Motivation

Traditionally, sampling (e.g., greedy, top-k, top-p) is performed on the host CPU after logits are returned from the device. This approach incurs:

  • High PCIe traffic due to large logits tensors [batch_size, vocab_size]
  • Latency bottlenecks from CPU-bound sampling logic
  • Limited scalability due to CPU thread constraints

On Device Sampling addresses these issues by:

  • Performing sampling directly on the QAIC device
  • Returning only the selected next tokens [batch_size, 1]
  • Leveraging the device’s parallelism and optimized compute paths


⚙️ Supported Sampling Strategies

The following sampling techniques are now supported natively on the QAIC device:

  1. Repetition Penalty​: Penalize tokens that have appeared in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.​

  2. Presence Penalty​: Penalize tokens that are present in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

  3. Temperature Scaling​: Adjust the sharpness of the logits distribution. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.​

  4. Top K​: Sample from the k largest tokens by value.

  5. Top P​: Sample from the smallest set of tokens whose cumulative probability is greater than or equal to p.​ Must be in (0, 1]. Set to 1 to consider all tokens.​

  6. Min P​: Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.​

  7. Greedy Sampling​: Choose the token with highest value.

  8. Random Sampling: Choose a token randomly with its probability of being chosen given by its value.



🛠️ Implementation Details

  • Sampler Integration: Sampling logic is injected via include_sampler=True during model loading. No changes to the model architecture are required.

  • Memory Optimization: Two scratch buffers of shape [batch_size, vocab_size] are used to track token occurrences for applying repetition and presence penalties efficiently on-device.

  • Performance Gains:

    • Reduced PCIe traffic (logits → tokens)
    • Higher throughput via device-level parallelism
    • Scalable to 64+ concurrent inference streams


🧪 Usage

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    include_sampler=True,
    return_pdfs=False,
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)

To disable On Device Sampling and revert to host-side sampling, simply set include_sampler=False.

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-sanising quic-sanising marked this pull request as ready for review April 9, 2025 04:48
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-amitraj quic-amitraj marked this pull request as draft April 11, 2025 08:49
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-sanising quic-sanising marked this pull request as ready for review June 3, 2025 22:19
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-hemagnih
Copy link
Contributor

Can you please rebase the code and see if there are any conflicts. Mostly code looks okay to me. We can plan its merge once all the pending comments are addressed

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-sanising quic-sanising requested a review from ochougul June 5, 2025 16:53
Copy link
Contributor

@ochougul ochougul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a test case.
everything else is Okay. approving.

@quic-amitraj
Copy link
Contributor

@quic-sanising Could you please add the description for this PR?

Copy link
Contributor

@quic-amitraj quic-amitraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docstring to all the functions and classes.

self,
example_inputs: Dict[str, torch.Tensor],
output_names: List[str],
dynamic_axes: Dict[str, Dict[int, str]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a docstring to enhance clarity for other developers? Additionally, could we relocate this to the sampler folder? This will help keep modeling_auto.py streamlined and avoid unnecessary complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the doc-string. Cannot to relocate this function to sampler as it is a member function of QEFFAutoModelForCausalLM.

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-sanising
Copy link
Contributor Author

@quic-sanising Could you please add the description for this PR?

Done

Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
@quic-hemagnih
Copy link
Contributor

As discussed in the meeting, we will approve and merge this PR. Please have another PR with unit test added for this feature to ensure that its not broken by any subsequent checkins

@quic-sanising
Copy link
Contributor Author

Moved to #440

@quic-sanising quic-sanising deleted the on-device-sampling branch June 18, 2025 17:04
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants