-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
speculative decoding #630
speculative decoding #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.
src/lmflow/pipeline/inferencer.py
- [Feature] line 331: better treat
temperature=0.0
asuse_argmax=True
. ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?- [Bug] line 359: no bug when
num_sample=1
, but should notice thattorch.multinomial
is without replacement (see this link), thusreplacement=True
should be specified. - [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
- [Style] line 458-459, 484: better use
logger.debug
instead of print. - [Feature] line 465: assume
ThreadPoolExecutor(max_worker=num_model_for_verify
, better export the argument ofnum_model_for_verify
(default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size. - [Style] line 499, 502: typo: "flnal" -> "final"
- [Style] line 507-508, 512-513: use
logger.debug
instead of print.
tests/pipeline/test_spec_inf.py
- The
tests
folder is used for unittests, better modify this part according to standard format of unittest, or move this part toexamples/*.py
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks 👍
inferencer now supports speculative decoding via
SpeculativeInferencer
. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!