Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug]: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. #12441

Closed
1 task done
851780266 opened this issue Jan 26, 2025 · 12 comments
Closed
1 task done
Labels
bug Something isn't working

Comments

@851780266
Copy link

Your current environment

env

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (conda-forge gcc 14.2.0-1) 14.2.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.17

Python version: 3.10.16 | packaged by conda-forge | (main, Dec  5 2024, 14:16:10) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.119.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
GPU 2: Tesla V100-PCIE-32GB
GPU 3: Tesla V100-PCIE-32GB

Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_adv.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_cnn.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_engines_precompiled.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_engines_runtime_compiled.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_graph.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_heuristic.so.9.2.0
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libcudnn_ops.so.9.2.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    1
Socket(s):             16
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel Xeon Processor (Cascadelake)
Stepping:              5
CPU MHz:               3099.998
BogoMIPS:              6199.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke avx512_vnni md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.19.0
[pip3] transformers==4.48.0
[pip3] triton==3.0.0
[conda] blas                      1.0                         mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] cuda-cudart               11.8.89                       0    nvidia
[conda] cuda-cupti                11.8.87                       0    nvidia
[conda] cuda-libraries            11.8.0                        0    nvidia
[conda] cuda-nvrtc                11.8.89                       0    nvidia
[conda] cuda-nvtx                 11.8.86                       0    nvidia
[conda] cuda-runtime              11.8.0                        0    nvidia
[conda] cuda-version              12.6                          3    nvidia
[conda] libblas                   3.9.0            16_linux64_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] libcublas                 11.11.3.6                     0    nvidia
[conda] libcufft                  10.9.0.58                     0    nvidia
[conda] libcufile                 1.11.1.6                      0    nvidia
[conda] libcurand                 10.3.7.77                     0    nvidia
[conda] libcusolver               11.4.1.48                     0    nvidia
[conda] libcusparse               11.7.5.86                     0    nvidia
[conda] liblapack                 3.9.0            16_linux64_mkl    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] libnpp                    11.8.0.86                     0    nvidia
[conda] libnvjpeg                 11.9.0.86                     0    nvidia
[conda] libopenvino-pytorch-frontend 2024.6.0             h5888daf_3    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] mkl                       2022.1.0           hc2b9512_224    https://mirrors.ustc.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pytorch-cuda              11.8                 h7e8668a_6    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.5.1                    pypi_0    pypi
[conda] torchaudio                2.1.1               py310_cu118    pytorch
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.48.0                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-31    0               N/A
GPU1    PHB      X      PHB     PHB     0-31    0               N/A
GPU2    PHB     PHB      X      PHB     0-31    0               N/A
GPU3    PHB     PHB     PHB      X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/data/miniconda3/envs/vllm/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:
MKL_INTERFACE_LAYER=LP64,GNU
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

shell

python -m vllm.entrypoints.openai.api_server --served-model-name TableGPT2-7B --port 12233 --trust-remote-code --gpu-memory-utilization 0.9 --model ./TableGPT2-7B/ --dtype=half

error log


INFO 01-26 14:50:23 model_runner.py:1067] Loading model weights took 14.2487 GB
ERROR 01-26 14:50:23 _custom_ops.py:53] Error in calling custom op rms_norm: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].
ERROR 01-26 14:50:23 _custom_ops.py:53] 
ERROR 01-26 14:50:23 _custom_ops.py:53] CPU: registered at /workspace/csrc/torch_bindings.cpp:18 [kernel]
ERROR 01-26 14:50:23 _custom_ops.py:53] Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:96 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:75 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:79 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:87 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:71 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:100 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:83 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:91 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:294 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:321 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutocastXPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:463 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutocastMPS: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:209 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:165 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:207 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]
ERROR 01-26 14:50:23 _custom_ops.py:53] 
ERROR 01-26 14:50:23 _custom_ops.py:53] Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
INFO 01-26 14:50:23 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250126-145023.pkl...
INFO 01-26 14:50:23 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250126-145023.pkl.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 45, in wrapper
    return fn(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 207, in rms_norm
    torch.ops._C.rms_norm(out, input, weight, epsilon)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
NotImplementedError: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at /workspace/csrc/torch_bindings.cpp:18 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:96 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:75 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:79 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:87 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:71 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:100 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:83 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:91 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:294 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:321 [backend fallback]
AutocastXPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:463 [backend fallback]
AutocastMPS: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:209 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:165 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:207 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1658, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 415, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 288, in forward
    hidden_states, residual = layer(
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.input_layernorm(hidden_states)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/custom_op.py", line 16, in forward
    return self._forward_method(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/layernorm.py", line 87, in forward_cuda
    ops.rms_norm(
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 54, in wrapper
    raise NotImplementedError(msg % (fn.__name__, e)) from e
NotImplementedError: Error in calling custom op rms_norm: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at /workspace/csrc/torch_bindings.cpp:18 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:96 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:75 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:79 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:87 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:71 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:100 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:83 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:91 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:294 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:321 [backend fallback]
AutocastXPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:463 [backend fallback]
AutocastMPS: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:209 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:165 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:207 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]

Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
    return cls(
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 348, in __init__
    self._initialize_kv_caches()
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1305, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
NotImplementedError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250126-145023.pkl): Error in calling custom op rms_norm: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [CPU, Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastMPS, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

CPU: registered at /workspace/csrc/torch_bindings.cpp:18 [kernel]
Meta: registered at ../aten/src/ATen/core/MetaFallbackKernel.cpp:23 [backend fallback]
BackendSelect: fallthrough registered at ../aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:153 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:497 [backend fallback]
Functionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at ../aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at ../aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at ../aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:96 [backend fallback]
AutogradOther: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradCPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradCUDA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:75 [backend fallback]
AutogradXLA: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:79 [backend fallback]
AutogradMPS: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:87 [backend fallback]
AutogradXPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:71 [backend fallback]
AutogradHPU: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:100 [backend fallback]
AutogradLazy: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:83 [backend fallback]
AutogradMeta: registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:91 [backend fallback]
Tracer: registered at ../torch/csrc/autograd/TraceTypeManual.cpp:294 [backend fallback]
AutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:321 [backend fallback]
AutocastXPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:463 [backend fallback]
AutocastMPS: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:209 [backend fallback]
AutocastCUDA: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:165 [backend fallback]
FuncTorchBatched: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:731 [backend fallback]
BatchedNestedTensor: registered at ../aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:758 [backend fallback]
FuncTorchVmapMode: fallthrough registered at ../aten/src/ATen/functorch/VmapModeRegistrations.cpp:27 [backend fallback]
Batched: registered at ../aten/src/ATen/LegacyBatchingRegistrations.cpp:1075 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at ../aten/src/ATen/functorch/TensorWrapper.cpp:207 [backend fallback]
PythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:161 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at ../aten/src/ATen/functorch/DynamicLayer.cpp:493 [backend fallback]
PreDispatch: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:165 [backend fallback]
PythonDispatcher: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:157 [backend fallback]

Not implemented or built, mostly likely because the current current device does not support this kernel (less likely TORCH_CUDA_ARCH_LIST was set incorrectly while building)
[rank0]:[W126 14:50:24.964326689 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Traceback (most recent call last):
  File "/data/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/data/miniconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/data/miniconda3/envs/vllm/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/data/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@851780266 851780266 added the bug Something isn't working label Jan 26, 2025
@DarkLight1337 DarkLight1337 changed the title [Bug]: [Bug]: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. Jan 26, 2025
@NickLucche
Copy link
Contributor

duplicate #12440

@2catycm
Copy link

2catycm commented Feb 7, 2025

NotImplementedError: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist f
or this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTo
rch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [HIP, 
Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, Autograd
Other, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, Autoc
astXPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, 
FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

@DC-Lin
Copy link

DC-Lin commented Feb 8, 2025

NotImplementedError: Could not run '_C::rms_norm' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist f
or this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTo
rch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. '_C::rms_norm' is only available for these backends: [HIP,
Meta, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, Autograd
Other, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, Autoc
astXPU, AutocastCUDA, FuncTorchBatched, BatchedNestedTensor, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot,
FuncTorchDynamicLayerFrontMode, PreDispatch, PythonDispatcher].

same problem

@SuperBruceJia
Copy link

SuperBruceJia commented Feb 9, 2025

It is because of the torch version:

vllm==0.6.3 and torch==2.4.0 solved my problem!

@2catycm
Copy link

2catycm commented Feb 10, 2025 via email

@hmellor hmellor marked this as a duplicate of #12440 Feb 11, 2025
@hmellor hmellor marked this as a duplicate of #13075 Feb 11, 2025
@hmellor
Copy link
Member

hmellor commented Feb 11, 2025

Am I right in saying that this is an environment issue when incompatible versions of vLLM and torch are installed?

@851780266
Copy link
Author

851780266 commented Feb 28, 2025 via email

@1556900941lizerui
Copy link

Am I right in saying that this is an environment issue when incompatible versions of vLLM and torch are installed?

Am I right in saying that this is an environment issue when incompatible versions of vLLM and torch are installed?

I compiled VLLM under ARM architecture using torch 2.7, and later reverted to torch 2.6 due to compatibility issues. Will this problem also exist?

@hmellor
Copy link
Member

hmellor commented Feb 28, 2025

Yes, at some point we will be updating the torch version. See #12721 for the upgrade to 2.6

@hmellor hmellor closed this as completed Feb 28, 2025
@1556900941lizerui
Copy link

Yes, at some point we will be updating the torch version. See #12721 for the upgrade to 2.6

Thank you, I successfully ran it in torch version 2.6, but I have two questions and hope you can help me answer them:
It seems that 1.xformer does not yet support ARM architecture, and there were no related reminders or error messages during the execution of VLLM. I also noticed some compilation-related files for flashattention during the compilation process. I would like to inquire whether flashattention is used directly or indirectly without xformers?
When 'enforce_eager=True' is not set, will it indeed affect the inference results of the model?

@hmellor
Copy link
Member

hmellor commented Feb 28, 2025

I would like to inquire whether flashattention is used directly or indirectly without xformers?

I believe so yes, xFormers installs its own version of FlashAttention.

When 'enforce_eager=True' is not set, will it indeed affect the inference results of the model?

Anything that changes the order of flooating point operations will affect the inference results in some way. Sometimes it's not noticeable, sometimes it is.

@1556900941lizerui
Copy link

I would like to inquire whether flashattention is used directly or indirectly without xformers?

I believe so yes, xFormers installs its own version of FlashAttention.

When 'enforce_eager=True' is not set, will it indeed affect the inference results of the model?

Anything that changes the order of flooating point operations will affect the inference results in some way. Sometimes it's not noticeable, sometimes it is.

Thank you very much

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants