Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FEAT: support deepseek-r1-distill-qwen #2781

Merged
merged 8 commits into from
Jan 24, 2025

Conversation

qinxuye
Copy link
Contributor

@qinxuye qinxuye commented Jan 23, 2025

No description provided.

@XprobeBot XprobeBot added this to the v1.x milestone Jan 23, 2025
@ChengjieLi28 ChengjieLi28 merged commit a57b99b into xorbitsai:main Jan 24, 2025
13 checks passed
@qinxuye qinxuye deleted the feat/deepseek-distill branch January 24, 2025 08:53
@worm128
Copy link

worm128 commented Feb 4, 2025

DeepSeek-R1-Distill-Qwen-14B-GGUF 和 deepseek-r1-distill-qwen-14b-awq 都加载失败
其中DeepSeek-R1-Distill-Qwen-14B-GGUF错误信息:
2025-02-04 20:37:52 2025-02-04 04:37:52,016 xinference.core.worker 45 INFO [request d977912c-e2f4-11ef-bd47-0242ac110003] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f7baf8f8270>, kwargs: model_uid=DeepSeek-R1-Distill-Qwen-14B-GGUF-0,model_name=DeepSeek-R1-Distill-Qwen-14B-GGUF,model_size_in_billions=14,model_format=ggufv2,quantization=Q6_K,model_engine=llama.cpp,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[0],download_hub=None,model_path=None,xavier_config=None
2025-02-04 20:37:52 2025-02-04 04:37:52,017 xinference.core.worker 45 INFO You specify to launch the model: DeepSeek-R1-Distill-Qwen-14B-GGUF on GPU index: [0] of the worker: 0.0.0.0:52371, xinference will automatically ignore the n_gpu option.
2025-02-04 20:37:52 2025-02-04 04:37:52,581 xinference.model.llm.llm_family 45 INFO Caching from URI: /data
2025-02-04 20:37:52 2025-02-04 04:37:52,586 xinference.model.llm.llm_family 45 INFO Cache /data exists
2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Detected different devices in the system:
2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 2080 Ti
2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] NVIDIA GeForce RTX 3090 Ti
2025-02-04 20:37:55 WARNING 02-04 04:37:55 cuda.py:81] Please make sure to set CUDA_DEVICE_ORDER=PCI_BUS_ID to avoid unexpected behavior.
2025-02-04 20:38:42 2025-02-04 04:38:42,145 xinference.core.model 66 INFO Start requests handler.
2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
2025-02-04 20:38:42 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2025-02-04 20:38:42 ggml_cuda_init: found 1 CUDA devices:
2025-02-04 20:38:42 Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
2025-02-04 20:38:42 llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23287 MiB free
2025-02-04 20:38:42 gguf_init_from_file: invalid magic characters ''
2025-02-04 20:38:42 llama_model_load: error loading model: llama_model_loader: failed to load model from /data/DeepSeek-R1-Distill-Qwen-14B-GGUF
2025-02-04 20:38:42
2025-02-04 20:38:42 llama_load_model_from_file: failed to load model
2025-02-04 20:38:42 2025-02-04 04:38:42,417 xinference.core.worker 45 ERROR Failed to load model DeepSeek-R1-Distill-Qwen-14B-GGUF-0
2025-02-04 20:38:42 Traceback (most recent call last):
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 908, in launch_builtin_model
2025-02-04 20:38:42 await model_ref.load()
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
2025-02-04 20:38:42 return self._process_result_message(result)
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
2025-02-04 20:38:42 raise message.as_instanceof_cause()
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 667, in send
2025-02-04 20:38:42 result = await self._run_coro(message.message_id, coro)
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
2025-02-04 20:38:42 return await coro
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive
2025-02-04 20:38:42 return await super().on_receive(message) # type: ignore
2025-02-04 20:38:42 File "xoscar/core.pyx", line 558, in on_receive
2025-02-04 20:38:42 raise ex
2025-02-04 20:38:42 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
2025-02-04 20:38:42 async with self._lock:
2025-02-04 20:38:42 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
2025-02-04 20:38:42 with debug_async_timeout('actor_lock_timeout',
2025-02-04 20:38:42 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
2025-02-04 20:38:42 result = await result
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 457, in load
2025-02-04 20:38:42 self._model.load()
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/llama_cpp/core.py", line 140, in load
2025-02-04 20:38:42 self._llm = Llama(
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 369, in init
2025-02-04 20:38:42 internals.LlamaModel(
2025-02-04 20:38:42 File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_internals.py", line 56, in init
2025-02-04 20:38:42 raise ValueError(f"Failed to load model from file: {path_model}")
2025-02-04 20:38:42 ValueError: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF
2025-02-04 20:38:42 2025-02-04 04:38:42,479 xinference.core.worker 45 ERROR [request d977912c-e2f4-11ef-bd47-0242ac110003] Leave launch_builtin_model, error: [address=0.0.0.0:40445, pid=66] Failed to load model from file: /data/DeepSeek-R1-Distill-Qwen-14B-GGUF, elapsed time: 50 s

@Zephyr69
Copy link

Zephyr69 commented Feb 6, 2025

Im encountering a weird problem with deepseek-r1-distill-qwen 32b awq. I loaded the model with vllm backend. With each request, the model seems to stop generating after outputting 1000+ tokens. There is no warnings or errors from inference or vllm.

@qinxuye
Copy link
Contributor Author

qinxuye commented Feb 6, 2025

Im encountering a weird problem with deepseek-r1-distill-qwen 32b awq. I loaded the model with vllm backend. With each request, the model seems to stop generating after outputting 1000+ tokens. There is no warnings or errors from inference or vllm.

What's the stop reason?

@Zephyr69
Copy link

Zephyr69 commented Feb 6, 2025

Im encountering a weird problem with deepseek-r1-distill-qwen 32b awq. I loaded the model with vllm backend. With each request, the model seems to stop generating after outputting 1000+ tokens. There is no warnings or errors from inference or vllm.

What's the stop reason?

There isn't an apparent stop reason other than "finished request xxx".

@Zephyr69
Copy link

Zephyr69 commented Feb 7, 2025

Im encountering a weird problem with deepseek-r1-distill-qwen 32b awq. I loaded the model with vllm backend. With each request, the model seems to stop generating after outputting 1000+ tokens. There is no warnings or errors from inference or vllm.

What's the stop reason?

I think I found the problem. It seems inference is not passing max_tokens to vllm's inference parameters, so its default, 1024, is used by vllm.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants