Skip to content

[DeepSeek R1] Qwen2.5 Distillations #2236

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

DavidLandup0
Copy link
Collaborator

@DavidLandup0 DavidLandup0 commented Apr 29, 2025

This PR adds a distinct family of Qwen2.5 models, distilled from DeepSeek-R1:

While technically distillations, DeepSeek's configurations make changes to the tokenizer config and preprocessing flow. To avoid the flag-based slippery slope of adding overriding configs to existing Qwen models, as well as to complement #2171, we separate the tokenizer and preprocessor, adding the distinct changes introduced with DeepSeek-R1's distillation as separate classes and files.

Example Usage

Google Colab

2-line setup/prompt on Google Colab:

image

Keras-Hub

Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:06:05) 
[Clang 13.0.0 (clang-1300.0.29.3)] on darwin
>>> import keras_hub
>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> keras_hub_model = keras_hub.models.DeepSeekR1QwenCausalLM.from_preset(f"hf://{hf_preset}")
>>> keras_hub_model.generate("What is Keras?", max_length=24)
'What is Keras? Explain its applications?\nWhat is TensorFlow? Explain its applications?\nAlso Explain TensorFlow.js Applications.\n'

HuggingFace Equivalent

>>> hf_preset = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
>>> deepseek_qwen = AutoModelForCausalLM.from_pretrained(hf_preset)
>>> deepseek_qwen_tokenizer = AutoTokenizer.from_pretrained(hf_preset)

>>> inputs = deepseek_qwen_tokenizer(["What is Keras?"], return_tensors="pt")
>>> outputs = deepseek_qwen.generate(**inputs, max_new_tokens=24)
>>> deepseek_qwen_tokenizer.decode(outputs[0])

'<|begin▁of▁sentence|>What is Keras? What is its purpose? What is Keras used for? What is Keras used for in practice? What is K'

Numerical Equivalency

Currently, there seems to be noise in the numerics/weights when naively converting. Still looking into why this is happening.
Though, they're generally comparable. For example, taking the mean across the first axis of the lm_head (called token_embedding in KerasHub - we see a fairly similar profile, but not numerical equivalency:

>>> ax.plot(keras_hub_model.backbone.token_embedding.get_weights()[0].mean(axis=0), label='KH', alpha=0.5)
>>> ax.plot(deepseek_qwen.lm_head.weight.mean(axis=0).detach().numpy(), label='HF', alpha=0.5)
image

This doesn't seem to affect the outputs that much though, as seen above in the responses. I'll investigate further into why these discrepancies arise - since they should be directly loading the weights as they are into the structure of the model's components.

@pass-lin
Copy link
Contributor

pass-lin commented May 3, 2025

Here's a little suggestion: you can try testing your performance on the math dataset. If the final results are comparable to those achieved by vllm, we can ignore this error.

@mattdangerw
Copy link
Member

@DavidLandup0 sounds like what we really need here is the ability to combine a QwenBackbone with a DeepSeek tokenizer? If so, I think we might be able to relax our requirements so a high level task (e.g. the QwenCausalLM could be using a tokenizer from DeepSeek). This is something I have thought we probably need anyway.

I'll try to make a PR showing the basic loading changes, but lmk what you think conceptually!

@DavidLandup0
Copy link
Collaborator Author

@DavidLandup0 sounds like what we really need here is the ability to combine a QwenBackbone with a DeepSeek tokenizer? If so, I think we might be able to relax our requirements so a high level task (e.g. the QwenCausalLM could be using a tokenizer from DeepSeek). This is something I have thought we probably need anyway.

I'll try to make a PR showing the basic loading changes, but lmk what you think conceptually!

Fundamentally, yes. We're looking to switch up the tokenizer for an existing workflow/backbone. Having a general builder where you can do arbitrary tokenizers and backbones would be beneficial across the board since it's not uncommon for people to mix-and-match tokenizers with models.

To unblock this PR - what do you think about going forward with the API that we have right now, and then updating this when we allow mix-and-match? As-is, we only have a single file (i.e. DeepSeekQwen2 model) duplicated, so it's easy to switch once we support the new feature.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants