Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Installation]: Nvidia runtime issue? On new VLLM 0.7.0 #12505

Closed
1 task done
Playerrrrr opened this issue Jan 28, 2025 · 17 comments
Closed
1 task done

[Installation]: Nvidia runtime issue? On new VLLM 0.7.0 #12505

Playerrrrr opened this issue Jan 28, 2025 · 17 comments
Labels
installation Installation problems

Comments

@Playerrrrr
Copy link

Your current environment

The output of `python collect_env.py`

docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host -e VLLM_ENABLE_PREFIX_CACHING=true --name qwen2.5_20250128 vllm/vllm-openai:v0.7.0 --model Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size=4 --gpu-memory-utilization=0.90 --enforce-eager --rope-scaling '{"type": "yarn","factor": 4,"original_max_position_embeddings": 32768}'
error:
/usr/bin/ld: cannot find -lcuda: No such file or directory

How you are installing vllm

docker

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@Playerrrrr Playerrrrr added the installation Installation problems label Jan 28, 2025
@DarkLight1337 DarkLight1337 changed the title [Installation]: Nvidia runtime issue? On new VLLM 7.0 [Installation]: Nvidia runtime issue? On new VLLM 0.7.0 Jan 28, 2025
@jamesbraza
Copy link

jamesbraza commented Jan 28, 2025

I hit this as well during vllm serve --tensor-parallel-size 2 today with vllm==0.7.0:

INFO 01-28 13:54:38 weight_utils.py:251] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=1241396) INFO 01-28 13:54:38 weight_utils.py:251] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.14it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.18it/s]

INFO 01-28 13:54:40 model_runner.py:1115] Loading model weights took 7.1441 GB
(VllmWorkerProcess pid=1241396) INFO 01-28 13:54:41 model_runner.py:1115] Loading model weights took 7.1441 GB
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status

@mgoin
Copy link
Member

mgoin commented Jan 29, 2025

@tlrmchlsmth would you have an idea? This seems related to #12424

@robertgshaw2-redhat
Copy link
Collaborator

@russellb this looks similar to what you were helping dan with

@tlrmchlsmth
Copy link
Collaborator

@tlrmchlsmth would you have an idea? This seems related to #12424

Yep, does seem suspicious. Not sure what's going wrong though

@tlrmchlsmth
Copy link
Collaborator

@dhuangnm
Copy link
Contributor

I hit similar issue on my build instance (Ubuntu 20.04) and here is what i did to workaround the error:

  1. find out where libcuda.so is installed on the instance, e.g. on my machine with CUDA 12.4 installed, it's located under:
    /usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs/libcuda.so

  2. softlink /usr/lib64/libcuda.so to the libcuda.so you found above:
    sudo ln -s /usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs/libcuda.so /usr/lib64/libcuda.so

The ld command looks only looking for libraries under certain locations. Since the libcuda.so is not under where it's looking for thus the error. After setting the softlink, vllm can build and run successfully.

@tlrmchlsmth
Copy link
Collaborator

I've put up #12552 to revert #12424.

For those having issues with vLLM 0.7.0, the easiest solution will be adding the directory containing libcuda.so to your LD_LIBRARY_PATH environment variable

@dhuangnm
Copy link
Contributor

I tried setting LD_LIBRARY_PATH initially but it didn't work for me for some reason. The ld command still complained about -lcuda not found and I had to use the softlink.

@gargnipungarg
Copy link

+1

@mgoin
Copy link
Member

mgoin commented Jan 30, 2025

This should have been fixed with #12552 so please wait for the next release to include the revert

@stefanobranco
Copy link

I assume this only happens for 0.7.0 for everyone here then, since the reverted change is relatively recent? I'm asking because I've been having this issue ever since 0.6.5 which would suggest a different root cause (or a different issue altogether) as also mentioned here #11643

@gargnipungarg
Copy link

+1
Issue has been happening since 0.6.5
I built the latest main changes as well, dint work for me

@Playerrrrr
Copy link
Author

++1
@mgoin @dhuangnm

  1. Tried setting LD_LIBRARY_PATH -> didnt work
  2. Tried softlink -> also didnt work

@Playerrrrr
Copy link
Author

thanks it started working normally again in v0.7.1
@mgoin

@gargnipungarg
Copy link

Did it work for anyone else?
Not for me, even with 0.7.1.

@OswaldoBornemann
Copy link

+1

@Playerrrrr
Copy link
Author

It works normally again for me since 0.7.1

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
installation Installation problems
Projects
None yet
Development

No branches or pull requests

9 participants