-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Supportd distributed ray for vllm #453
base: main
Are you sure you want to change the base?
Conversation
JingofXin
commented
Mar 3, 2025
lazyllm/components/deploy/vllm.py
Outdated
return {"decode_unicode": False, "delimiter": b"\0"} | ||
if Vllm.vllm_version is None: | ||
Vllm.vllm_version = importlib.import_module('vllm').__version__ | ||
if Vllm.vllm_version <= "0.5.0": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个判断不对,比如'0.5.0' < '0.11.0',结果会是False
lazyllm/components/deploy/ray.py
Outdated
lazyllm.config.add('num_gpus_per_node', int, 8, 'NUM_GPUS_PER_NODE') | ||
|
||
def reallocate_launcher(launcher): | ||
if not isinstance(launcher, (launchers.ScoLauncher, launchers.SlurmLauncher, launchers.RemoteLauncher)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteLauncher
目前没有实例
lazyllm/components/deploy/ray.py
Outdated
f"limit{(lazyllm.config['num_gpus_per_node'])}. Please check the actual " | ||
'number of GPUs in a single node and set the environment variable: LAZYLLM_NUM_GPUS_PER_NODE. ' | ||
'Now LazyLLM will reconfigure the number of nodes and GPUs') | ||
nnode = nnode if nnode > 0 else 1 # avoid 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里直接用assert做检查吧
if not isinstance(launcher, (launchers.ScoLauncher, launchers.SlurmLauncher, launchers.RemoteLauncher)): | ||
return [], launcher | ||
nnode = launcher.nnode | ||
ngpus = launcher.ngpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在多机的情况下,ngpus定义是一共多少个,还是每个node多少个?
如果是每个机器,是不是应该叫ngpus_per_node,然后超了直接报错,而不是给他重算一下加机器?
lazyllm/components/deploy/vllm.py
Outdated
master_ip = '' | ||
for launcher in self.launcher_list: | ||
m = Distributed(launcher=launcher, master_ip=master_ip) | ||
m() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在算cmd的时候就已经把任务启动是有问题的,要一起等到推理任务启动时候再启动
lazyllm/components/deploy/vllm.py
Outdated
@@ -53,7 +53,13 @@ def __init__(self, trust_remote_code=True, launcher=launchers.remote(ngpus=1), s | |||
self.temp_folder = make_log_dir(log_path, 'vllm') if log_path else None | |||
if self.launcher_list: | |||
ray_launcher = [Distributed(launcher=launcher) for launcher in self.launcher_list] | |||
self._prepare_deploy = pipeline(*ray_launcher) | |||
with lazyllm.pipeline() as ppl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个情况可以试一下post_action
parall_launcher = [lazyllm.pipeline(sleep_moment, launcher) for launcher in ray_launcher[1:]]
self._prepare_deploy = lazyllm.pipeline(ray_launcher[0], post_action=(lazyllm.parallel(*parall_launcher) if len(parall_launcher) else None))