[Feature]: MLPSpeculator Tensor Parallel support #5809

njhill · 2024-06-25T01:10:59Z

🚀 The feature, motivation and pitch

MLPSpeculator-based speculative decoding was recently added in #4947, but the initial integration only covers single GPU usage.

There will soon be "speculator" models available for larger target models that require multiple GPUs so we would like to ensure that TP can be used.

The first part of this issue would be testing it out in conjunction with #5414 and making necessary adjustments so that it will work with TP=1 for the speculator and TP=N for the target model.

Following this we can look at having the speculator itself run with TP>1, but that may be more involved since it will require some distributed coordination of the sampling of each speculated token in the MLPSpeculator loop. It might be possible to avoid additional communication here by the having the sampler used by the speculator model use a dedicated torch.Generator for its sampling and doing this sampling in tandem across the ranks.

@JRosenkranz already used VocabParallelEmbedding in the implementation so the model layers themselves should work fine.

cc @cadedaniel @sirejdua @JRosenkranz @tdoublep

The text was updated successfully, but these errors were encountered:

cadedaniel · 2024-06-25T01:17:37Z

initial thought:

[Speculative Decoding] Support draft model on different tensor-parallel size than target model #5414 may be a bad fit for this; we should keep eyes open for best solution for MLPSpeculator
the goal for this issue should be to get MLPSpeculator on TP==1 working with target model on TP>1. we can start today with a small model (don't have to wait for new MLPSpeculator), the result should generalize to larger target models.

njhill · 2024-06-25T01:27:19Z

we can start today with a small model (don't have to wait for new MLPSpeculator), the result should generalize to larger target models.

Yes sorry I should have made that clear, the large models are more the motivation but it can be developed/tested with existing ones.

sirejdua · 2024-06-25T21:31:01Z

Thanks for writing this up @njhill , I can start working on it.

njhill added the feature request New feature or request label Jun 25, 2024

sirejdua mentioned this issue Jul 1, 2024

[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) #6050

Merged

njhill closed this as completed in #6050 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: MLPSpeculator Tensor Parallel support #5809

[Feature]: MLPSpeculator Tensor Parallel support #5809

njhill commented Jun 25, 2024 •

edited

Loading

cadedaniel commented Jun 25, 2024

njhill commented Jun 25, 2024

sirejdua commented Jun 25, 2024

[Feature]: MLPSpeculator Tensor Parallel support #5809

[Feature]: MLPSpeculator Tensor Parallel support #5809

Comments

njhill commented Jun 25, 2024 • edited Loading

🚀 The feature, motivation and pitch

cadedaniel commented Jun 25, 2024

njhill commented Jun 25, 2024

sirejdua commented Jun 25, 2024

njhill commented Jun 25, 2024 •

edited

Loading