Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

speculative decoding #630

Merged
merged 4 commits into from
Sep 6, 2023
Merged

speculative decoding #630

merged 4 commits into from
Sep 6, 2023

Conversation

wheresmyhair
Copy link
Collaborator

inferencer now supports speculative decoding via SpeculativeInferencer. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.
Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.

src/lmflow/pipeline/inferencer.py

  • [Feature] line 331: better treat temperature=0.0 as use_argmax=True.
  • ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?
  • [Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.
  • [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
  • [Style] line 458-459, 484: better use logger.debug instead of print.
  • [Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.
  • [Style] line 499, 502: typo: "flnal" -> "final"
  • [Style] line 507-508, 512-513: use logger.debug instead of print.

tests/pipeline/test_spec_inf.py

  • The tests folder is used for unittests, better modify this part according to standard format of unittest, or move this part to examples/*.py later.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

@wheresmyhair
Copy link
Collaborator Author

src/lmflow/pipeline/inferencer.py

  • [Feature] line 331: better treat temperature=0.0 as use_argmax=True.
  • ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?
  • [Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.
  • [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
  • [Style] line 458-459, 484: better use logger.debug instead of print.
  • [Style] line 499, 502: typo: "flnal" -> "final"
  • [Style] line 507-508, 512-513: use logger.debug instead of print.

Fixed.

  • [Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.
  • [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

Instead of using threading + pred_next_token(), SpeculativeInferencer now uses get_backend_model() method from the HFDecoderModel model object to get logits and do the calculations.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks 👍

@research4pan research4pan merged commit 7f2711a into OptimalScale:main Sep 6, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants