-
Notifications
You must be signed in to change notification settings - Fork 45
On Device Sampling #350
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
On Device Sampling #350
Conversation
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
9fab549
to
3b63ecb
Compare
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Can you please rebase the code and see if there are any conflicts. Mostly code looks okay to me. We can plan its merge once all the pending comments are addressed |
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test case.
everything else is Okay. approving.
@quic-sanising Could you please add the description for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add docstring to all the functions and classes.
self, | ||
example_inputs: Dict[str, torch.Tensor], | ||
output_names: List[str], | ||
dynamic_axes: Dict[str, Dict[int, str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a docstring to enhance clarity for other developers? Additionally, could we relocate this to the sampler folder? This will help keep modeling_auto.py streamlined and avoid unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the doc-string. Cannot to relocate this function to sampler as it is a member function of QEFFAutoModelForCausalLM.
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Done |
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
Signed-off-by: quic-sanising <quic_sanising@quicinc.com>
As discussed in the meeting, we will approve and merge this PR. Please have another PR with unit test added for this feature to ensure that its not broken by any subsequent checkins |
Moved to #440 |
✨ On Device Sampling
📌 Overview
This PR introduces On Device Sampling for
QEffForCausalLM
models, enabling sampling operations to be executed directly on the QAIC device rather than the host CPU. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability.🚀 Motivation
Traditionally, sampling (e.g., greedy, top-k, top-p) is performed on the host CPU after logits are returned from the device. This approach incurs:
[batch_size, vocab_size]
On Device Sampling addresses these issues by:
[batch_size, 1]
⚙️ Supported Sampling Strategies
The following sampling techniques are now supported natively on the QAIC device:
Repetition Penalty: Penalize tokens that have appeared in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
Presence Penalty: Penalize tokens that are present in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
Temperature Scaling: Adjust the sharpness of the logits distribution. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
Top K: Sample from the
k
largest tokens by value.Top P: Sample from the smallest set of tokens whose cumulative probability is greater than or equal to
p
. Must be in (0, 1]. Set to 1 to consider all tokens.Min P: Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
Greedy Sampling: Choose the token with highest value.
Random Sampling: Choose a token randomly with its probability of being chosen given by its value.
🛠️ Implementation Details
Sampler Integration: Sampling logic is injected via
include_sampler=True
during model loading. No changes to the model architecture are required.Memory Optimization: Two scratch buffers of shape
[batch_size, vocab_size]
are used to track token occurrences for applying repetition and presence penalties efficiently on-device.Performance Gains:
🧪 Usage
To disable On Device Sampling and revert to host-side sampling, simply set
include_sampler=False
.