Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix deepseek awq v3 #3450

Merged
merged 19 commits into from
Feb 12, 2025
Merged

Fix deepseek awq v3 #3450

merged 19 commits into from
Feb 12, 2025

Conversation

hnyls2002
Copy link
Collaborator

@hnyls2002 hnyls2002 commented Feb 10, 2025

python -m sglang.launch_server --model-path cognitivecomputations/DeepSeek-V3-AWQ --tp-size 8 --trust-remote --disable-mla

@hnyls2002 hnyls2002 marked this pull request as draft February 10, 2025 04:43
@halexan
Copy link

halexan commented Feb 10, 2025

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

@chenchunhui97
Copy link

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

I am having a try......

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Feb 10, 2025

We should also introduce triton fused moe kernel like moe_wna16.
AWQ marlin kernel may be just get 10 token/s on 8*A100.

@hnyls2002
Copy link
Collaborator Author

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

Yes, this PR is exactly for this

@hnyls2002 hnyls2002 marked this pull request as ready for review February 10, 2025 11:47
@hnyls2002 hnyls2002 changed the title Fix deepseek awq v3 [DO NOT MERGE] Fix deepseek awq v3 Feb 10, 2025
@hnyls2002 hnyls2002 changed the title [DO NOT MERGE] Fix deepseek awq v3 Fix deepseek awq v3 Feb 10, 2025
@pachinko
Copy link

``> > After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

Yes, this PR is exactly for this

still have a problem, i am running this model cognitivecomputations/DeepSeek-V3-AWQ

[2025-02-11 14:42:20 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/WORK/sglang/python/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_executor/model_runner.py", line 186, in __init__
    self.load_model()
  File "/WORK/sglang/python/sglang/srt/model_executor/model_runner.py", line 307, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/WORK/sglang/python/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/WORK/sglang/python/sglang/srt/models/deepseek_v2.py", line 924, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'model.layers.6.mlp.experts.w2_weight'

[2025-02-11 14:42:20] Received sigquit from a child proces. It usually means the child failed.

@halexan
Copy link

halexan commented Feb 11, 2025

@pachinko

What is your launch command?

@pachinko
Copy link

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

@halexan
Copy link

halexan commented Feb 11, 2025

We should also introduce triton fused moe kernel like moe_wna16. AWQ marlin kernel may be just get 10 token/s on 8*A100.

So, does this pr still use AWQ marlin kernel?

@pachinko
Copy link

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

I replaced the config.json with the awq version.

@hnyls2002
Copy link
Collaborator Author

hnyls2002 commented Feb 11, 2025

@halexan

python3 -m sglang.launch_server \
    --model-path /home/model/DeepSeek-R1 \
    --tp 8 \
    --dist-init-addr 10.10.0.1:6000 \
    --nnodes 1 \
    --node-rank 0 \
    --trust-remote-code \
    --disable-radix-cache  \
    --disable-outlines-disk-cache \
    --host 0.0.0.0 \
    --port 40000

I replaced the config.json with the awq version.

R1 and MLA are not supported by now, due to some unknown accuracy reasons. You can use V3-AWQ with this command

 python -m sglang.launch_server --model-path cognitivecomputations/DeepSeek-V3-AWQ --tp-size 8 --trust-remote --disable-mla

@chenchunhui97
Copy link

After this pr being merged.

Can sglang run this cognitivecomputations/DeepSeek-V3-AWQ?

I succeeded to deploy the model on 8*A800 by building docker image on branch fix-dpsk-v3-awq.

@Xu-Chen
Copy link
Contributor

Xu-Chen commented Feb 12, 2025

Could you share some benchmark?

@Zachary-ai-engineer
Copy link

We tested V3 AWQ based on the latest code and found that indicators such as tpot were relatively poor. How should we solve this problem?
image

@halexan
Copy link

halexan commented Feb 12, 2025

How about benchmark?@chenchunhui97

Copy link
Member

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is a bit tricky, I'll merge it first to unblock the awq usage. Refactoring is on its way.

@zhyncs zhyncs merged commit 8616357 into main Feb 12, 2025
21 checks passed
@zhyncs zhyncs deleted the fix-dpsk-v3-awq branch February 12, 2025 14:09
chongli-uw pushed a commit to chongli-uw/sglang that referenced this pull request Feb 15, 2025
@luweizheng
Copy link

My launch script on 8*A800 80G. This model havs been successfully deployed with vLLM with a smaller context length. But it seems vLLM does not optimize well on MLA now.

python3 -m sglang.launch_server --model-path /path/to/DeepSeek-R1-awq/DeepSeek-R1-awq --tp 8 --host 0.0.0.0 --port 11434 --trust-remote-code

Error:

File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 962, in load_weights
    w = ops.awq_dequantize(
        ^^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/vllm/_custom_ops.py", line 222, in awq_dequantize
    return torch.ops._C.awq_dequantize(qweight, scales, zeros, split_k_iters,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fs/fast/u20247643/envs/sglang/lib/python3.12/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: expected scalar type Half but found BFloat16

@chenchunhui97 @zhyncs Any suggestions?

@zjp-shadow zjp-shadow mentioned this pull request Feb 22, 2025
5 tasks
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants