[DeepSeek R1] Qwen2.5 Distillations #2236

DavidLandup0 · 2025-04-29T09:20:12Z

This PR adds a distinct family of Qwen2.5 models, distilled from DeepSeek-R1:

While technically distillations, DeepSeek's configurations make changes to the tokenizer config and preprocessing flow. To avoid the flag-based slippery slope of adding overriding configs to existing Qwen models, as well as to complement #2171, we separate the tokenizer and preprocessor, adding the distinct changes introduced with DeepSeek-R1's distillation as separate classes and files.

Example Usage

Google Colab

2-line setup/prompt on Google Colab:

https://colab.research.google.com/drive/1Lgt7N29MvmZ8lSbRnALDMX9-sVqv63X9?usp=sharing

Keras-Hub

Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:06:05) 
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
>>> import keras_hub
>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> keras_hub_model = keras_hub.models.DeepSeekR1QwenCausalLM.from_preset(f"hf://{hf_preset}")
>>> keras_hub_model.generate("What is Keras?", max_length=24)
'What is Keras? Explain its applications?\nWhat is TensorFlow? Explain its applications?\nAlso Explain TensorFlow.js Applications.\n'

HuggingFace Equivalent

>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> deepseek_qwen = AutoModelForCausalLM.from_pretrained(hf_preset)
>>> deepseek_qwen_tokenizer = AutoTokenizer.from_pretrained(hf_preset)

>>> inputs = deepseek_qwen_tokenizer(["What is Keras?"], return_tensors="pt")
>>> outputs = deepseek_qwen.generate(**inputs, max_new_tokens=24)
>>> deepseek_qwen_tokenizer.decode(outputs[0])

'<｜begin▁of▁sentence｜>What is Keras? What is its purpose? What is Keras used for? What is Keras used for in practice? What is K'

Numerical Equivalency

Currently, there seems to be noise in the numerics/weights when naively converting. Still looking into why this is happening.
Though, they're generally comparable. For example, taking the mean across the first axis of the lm_head (called token_embedding in KerasHub - we see a fairly similar profile, but not numerical equivalency:

>>> ax.plot(keras_hub_model.backbone.token_embedding.get_weights()[0].mean(axis=0), label='KH', alpha=0.5)
>>> ax.plot(deepseek_qwen.lm_head.weight.mean(axis=0).detach().numpy(), label='HF', alpha=0.5)

This doesn't seem to affect the outputs that much though, as seen above in the responses. I'll investigate further into why these discrepancies arise - since they should be directly loading the weights as they are into the structure of the model's components.

pass-lin · 2025-05-03T07:54:47Z

Here's a little suggestion: you can try testing your performance on the math dataset. If the final results are comparable to those achieved by vllm, we can ignore this error.

mattdangerw · 2025-05-06T18:50:03Z

@DavidLandup0 sounds like what we really need here is the ability to combine a QwenBackbone with a DeepSeek tokenizer? If so, I think we might be able to relax our requirements so a high level task (e.g. the QwenCausalLM could be using a tokenizer from DeepSeek). This is something I have thought we probably need anyway.

I'll try to make a PR showing the basic loading changes, but lmk what you think conceptually!

DavidLandup0 · 2025-05-25T11:35:36Z

@DavidLandup0 sounds like what we really need here is the ability to combine a QwenBackbone with a DeepSeek tokenizer? If so, I think we might be able to relax our requirements so a high level task (e.g. the QwenCausalLM could be using a tokenizer from DeepSeek). This is something I have thought we probably need anyway.

I'll try to make a PR showing the basic loading changes, but lmk what you think conceptually!

Fundamentally, yes. We're looking to switch up the tokenizer for an existing workflow/backbone. Having a general builder where you can do arbitrary tokenizers and backbones would be beneficial across the board since it's not uncommon for people to mix-and-match tokenizers with models.

To unblock this PR - what do you think about going forward with the API that we have right now, and then updating this when we allow mix-and-match? As-is, we only have a single file (i.e. DeepSeekQwen2 model) duplicated, so it's easy to switch once we support the new feature.

DavidLandup0 added 6 commits April 29, 2025 15:28

Add DeepSeekR1-Qwen preprocessor, tokenizer and conversion script

d3d6164

Fix typo, add sanity check

169ec14

Merge master branch into feature branch

eece281

Remove commented code

3b22383

Remove prints

0ee0033

Shorten name for E501

05d500f

DavidLandup0 requested review from mattdangerw and divyashreepathihalli April 29, 2025 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeepSeek R1] Qwen2.5 Distillations #2236

[DeepSeek R1] Qwen2.5 Distillations #2236

Uh oh!

DavidLandup0 commented Apr 29, 2025 •

edited

Loading

Uh oh!

pass-lin commented May 3, 2025

Uh oh!

mattdangerw commented May 6, 2025

Uh oh!

DavidLandup0 commented May 25, 2025

Uh oh!

Uh oh!

[DeepSeek R1] Qwen2.5 Distillations #2236

Are you sure you want to change the base?

[DeepSeek R1] Qwen2.5 Distillations #2236

Uh oh!

Conversation

DavidLandup0 commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Usage

Google Colab

Keras-Hub

HuggingFace Equivalent

Numerical Equivalency

Uh oh!

pass-lin commented May 3, 2025

Uh oh!

mattdangerw commented May 6, 2025

Uh oh!

DavidLandup0 commented May 25, 2025

Uh oh!

Uh oh!

DavidLandup0 commented Apr 29, 2025 •

edited

Loading