[Bug] AWQ scalar type error #3780

zjp-shadow · 2025-02-22T04:37:08Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

When I run the Deepseek-R1-AWQ, I met a scalar type bug same as pr #3450 . @hnyls2002

Loading safetensors checkpoint shards:  97% Completed | 72/74 [00:44<00:01,  1.55it/s]
Loading safetensors checkpoint shards:  99% Completed | 73/74 [00:45<00:00,  1.72it/s]
[2025-02-22 12:19:24 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/managers/scheduler.py", line 1816, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/managers/scheduler.py", line 240, in __init__
    self.tp_worker = TpWorkerClass(
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
    self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/managers/tp_worker.py", line 68, in __init__
    self.model_runner = ModelRunner(
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/model_executor/model_runner.py", line 195, in __init__
    self.load_model()
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/model_executor/model_runner.py", line 318, in load_model
    self.model = get_model(
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/model_loader/loader.py", line 362, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/sglang/srt/models/deepseek_v2.py", line 962, in load_weights
    w = ops.awq_dequantize(
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/vllm/_custom_ops.py", line 222, in awq_dequantize
    return torch.ops._C.awq_dequantize(qweight, scales, zeros, split_k_iters,
  File "/mnt/Shared_h0/zjp/anaconda3/envs/deepseek/lib/python3.9/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: expected scalar type Half but found BFloat16

Reproduction

I use the command recommended by instructions

python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code

But it produces the mistake same as preview.

RuntimeError: expected scalar type Half but found BFloat16

Environment

INFO 02-22 12:35:34 __init__.py:190] Automatically detected platform cuda.
Python: 3.9.21 (main, Dec 11 2024, 16:24:11) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 570.86.15
PyTorch: 2.5.1+cu124
sglang: 0.4.3.post2
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.1.post2+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.12
fastapi: 0.115.8
hf_transfer: 0.1.9
huggingface_hub: 0.29.1
interegular: 0.3.3
modelscope: 1.23.0
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.63.2
tiktoken: 0.9.0
anthropic: 0.46.0
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NODE    NODE    SYS     SYS     SYS     SYS     0-127,256-383   0               N/A
GPU1    NV18     X      NODE    NODE    SYS     SYS     SYS     SYS     0-127,256-383   0               N/A
GPU2    NODE    NODE     X      NV17    SYS     SYS     SYS     SYS     0-127,256-383   0               N/A
GPU3    NODE    NODE    NV17     X      SYS     SYS     SYS     SYS     0-127,256-383   0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV18    NODE    NODE    128-255,384-511 1               N/A
GPU5    SYS     SYS     SYS     SYS     NV18     X      NODE    NODE    128-255,384-511 1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV18    128-255,384-511 1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV18     X      128-255,384-511 1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

The text was updated successfully, but these errors were encountered:

FrankLeeeee · 2025-02-22T04:51:43Z

Hi let me try to reproduce it.

FrankLeeeee · 2025-02-22T08:43:13Z

I managed to run this model with the following command by adding the argument --dtype half.

python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half

zjp-shadow · 2025-02-22T13:20:50Z

I managed to run this model with the following command by adding the argument --dtype half.

python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half

This can indeed be started, but I'm not sure if the issue is caused by the precision of different dtypes. Using R1-AWQ in fp16 tends to result in gibberish, which I didn't encounter in the quantization like 1.58bit of Ollama I deployed earlier. The answer of R1-AWQ is shown in the picture.

fridayL · 2025-02-24T09:11:37Z

I managed to run this model with the following command by adding the argument --dtype half.
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half

This can indeed be started, but I'm not sure if the issue is caused by the precision of different dtypes. Using R1-AWQ in fp16 tends to result in gibberish, which I didn't encounter in the quantization like 1.58bit of Ollama I deployed earlier. The answer of R1-AWQ is shown in the picture.

I get same problem, how did you solve it

FrankLeeeee · 2025-02-25T03:04:55Z

@zjp-shadow @fridayL

I figured it out, AWQ does not work with MLA yet. with this command, you can get the model running and generating the expected output.

python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half --disable-mla

My input is like

messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temper

This is my output:

"<think>\nOkay, let's see. The user is asking for a list of three countries and their capitals. Hmm, I need to make sure I pick countries that are well-known so the answer is useful. Maybe start with some obvious ones. United States, Canada, Mexico? Wait, but maybe some people might"

FrankLeeeee added the help wanted Extra attention is needed label Feb 22, 2025

FrankLeeeee mentioned this issue Feb 25, 2025

[doc] added quantization doc for dpsk #3843

Merged

6 tasks

Fridge003 closed this as completed Mar 3, 2025

ispobock mentioned this issue Mar 10, 2025

[Feature] add support for deepseek v3 gptq / awq #2706

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AWQ scalar type error #3780

[Bug] AWQ scalar type error #3780

zjp-shadow commented Feb 22, 2025

FrankLeeeee commented Feb 22, 2025

FrankLeeeee commented Feb 22, 2025 •

edited

Loading

zjp-shadow commented Feb 22, 2025

fridayL commented Feb 24, 2025

FrankLeeeee commented Feb 25, 2025 •

edited

Loading

[Bug] AWQ scalar type error #3780

[Bug] AWQ scalar type error #3780

Comments

zjp-shadow commented Feb 22, 2025

Checklist

Describe the bug

Reproduction

Environment

FrankLeeeee commented Feb 22, 2025

FrankLeeeee commented Feb 22, 2025 • edited Loading

zjp-shadow commented Feb 22, 2025

fridayL commented Feb 24, 2025

FrankLeeeee commented Feb 25, 2025 • edited Loading

FrankLeeeee commented Feb 22, 2025 •

edited

Loading

FrankLeeeee commented Feb 25, 2025 •

edited

Loading