[Bug] Can not run vLLM with tensor parallel #1354

KevinWu2017 · 2024-07-23T06:38:43Z

先决条件

我已经搜索过问题和讨论但未得到预期的帮助。
错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA GeForce RTX 4090',
 'MMEngine': '0.10.4',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.4, V12.4.131',
 'OpenCV': '4.10.0',
 'PyTorch': '2.3.1+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.3.6 (Git Hash '
                              '86e6af5974177e513fd3fee58425e1063e7f1361)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]',
 'TorchVision': '0.18.1+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.6+c5074c0',
 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

Just the built in run.py file.

重现问题 - 命令或脚本

CUDA_VISIBLE_DEVICES=4,5 python run.py --models vllm_mixtral_8x7b_v0_1 --datasets mmlu_gen -m infer --max-num-workers 1 --debug

重现问题 - 错误信息

/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
07/23 14:32:38 - OpenCompass - INFO - Loading mmlu_gen: configs/datasets/mmlu/mmlu_gen.py
07/23 14:32:38 - OpenCompass - INFO - Loading vllm_mixtral_8x7b_v0_1: configs/models/mistral/vllm_mixtral_8x7b_v0_1.py
07/23 14:32:38 - OpenCompass - INFO - Loading example: configs/summarizers/example.py
07/23 14:32:38 - OpenCompass - INFO - Current exp folder: outputs/default/20240723_143238
07/23 14:32:38 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners
07/23 14:32:38 - OpenCompass - DEBUG - Get class `NumWorkerPartitioner` from "partitioner" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `NumWorkerPartitioner` instance is built from registry, and its implementation can be found in opencompass.partitioners.num_worker
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Additional config: {}
07/23 14:32:38 - OpenCompass - INFO - Partitioned into 1 tasks.
07/23 14:32:38 - OpenCompass - DEBUG - Task 0: [mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_biology,mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_chemistry]
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners
07/23 14:32:38 - OpenCompass - DEBUG - Get class `LocalRunner` from "runner" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `LocalRunner` instance is built from registry, and its implementation can be found in opencompass.runners.local
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks
07/23 14:32:38 - OpenCompass - DEBUG - Get class `OpenICLInferTask` from "task" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `OpenICLInferTask` instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer
07/23 14:32:39 - OpenCompass - INFO - Task [mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_biology,mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_chemistry]
07/23 14:32:39 - OpenCompass - DEBUG - Modules of opencompass's model registry have been automatically imported from opencompass.models
07/23 14:32:39 - OpenCompass - DEBUG - Get class `VLLM` from "model" registry in "opencompass"
INFO 07-23 14:32:39 config.py:695] Defaulting to use mp for distributed inference
INFO 07-23 14:32:39 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='mistralai/Mixtral-8x7B-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mixtral-8x7B-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-23 14:32:40 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2577027) Process VllmWorkerProcess:
(VllmWorkerProcess pid=2577027) Traceback (most recent call last):
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=2577027)     self.run()
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=2577027)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=2577027)     worker = worker_factory()
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 70, in _create_worker
(VllmWorkerProcess pid=2577027)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 326, in init_worker
(VllmWorkerProcess pid=2577027)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 90, in __init__
(VllmWorkerProcess pid=2577027)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 229, in __init__
(VllmWorkerProcess pid=2577027)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=2577027)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/attention/selector.py", line 148, in which_attn_to_use
(VllmWorkerProcess pid=2577027)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=2577027)     prop = get_device_properties(device)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=2577027)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=2577027)     raise RuntimeError(
(VllmWorkerProcess pid=2577027) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-23 14:32:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 2577027 died, exit code: 1
INFO 07-23 14:32:41 multiproc_worker_utils.py:123] Killing local vLLM worker processes

其他信息

No response

The text was updated successfully, but these errors were encountered:

Mor-Li · 2024-07-23T06:57:15Z

Hi, the issue appears to be due to vLLM's inability to run the Mixtral model internally, rather than an issue with OpenCompass. I suggest trying to create a minimal reproducible script that excludes OpenCompass components. Instead, write a simple Python file to run this model using vLLM and see if it can be loaded successfully.

KevinWu2017 · 2024-07-23T07:48:50Z

Thank you for your reply, I created a minimal reproducible script vllm_mixtral.py

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="mistralai/Mixtral-8x7B-v0.1", tensor_parallel_size=8, download_dir="/home/data/huggingface", gpu_memory_utilization=0.9)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and run it with command HF_HUB_OFFLINE=1 python vllm_mixtral.py
and it successfully execute the model.

KevinWu2017 · 2024-07-23T08:27:55Z

After trying more models, it appears that this issue seems to be related to tensor parallelism. When adjusting the configuration file config/models/qwen/vllm_qwen1_5_moe_a2_7b.py and set tensor_parallel_size=2 and num_gpus=2, the same issue occurred.

Mor-Li · 2024-07-23T08:33:33Z

Thank you for reporting the issue. To resolve this, try modify the tensor parallel parameter in the configuration file configs/models/mistral/vllm_mixtral_8x7b_v0_1.py to tensor_parallel_size=8. This change may enable the model to run correctly.

KevinWu2017 · 2024-07-23T08:57:50Z

After modified the configs/models/mistral/vllm_mixtral_8x7b_v0_1.py file with tensor_parallel_size=8 and num_gpus=8
And run with command python run.py --models vllm_mixtral_8x7b_v0_1 --datasets mmlu_gen -m infer --max-num-workers 1
The log still shows the same problem.

Is there a specific environment version that can successfully run with tensor parallel? Are there any vllm, torch or opencompass version requirements?

KevinWu2017 · 2024-07-23T12:01:37Z

After some searching, this should be caused by a behavior change of vLLM since vllm-0.5.1. As metioned here: vllm-project/vllm#5669 (comment).
So an easy workaround is using VLLM_WORKER_MULTIPROC_METHOD=spawn prior to the python run.py command.

mm-assistant bot assigned liushz Jul 23, 2024

KevinWu2017 changed the title ~~[Bug] Can not run vLLM Mixtral 8x7b~~ [Bug] Can not run vLLM with tensor parallel Jul 23, 2024

KevinWu2017 closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Can not run vLLM with tensor parallel #1354

[Bug] Can not run vLLM with tensor parallel #1354

KevinWu2017 commented Jul 23, 2024

Mor-Li commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024

Mor-Li commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024 •

edited

Loading

KevinWu2017 commented Jul 23, 2024

[Bug] Can not run vLLM with tensor parallel #1354

[Bug] Can not run vLLM with tensor parallel #1354

Comments

KevinWu2017 commented Jul 23, 2024

先决条件

问题类型

环境

重现问题 - 代码/配置示例

重现问题 - 命令或脚本

重现问题 - 错误信息

其他信息

Mor-Li commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024

Mor-Li commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024 • edited Loading

KevinWu2017 commented Jul 23, 2024

KevinWu2017 commented Jul 23, 2024 •

edited

Loading