Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Supportd distributed ray for vllm #453

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

JingofXin
Copy link
Contributor

image

return {"decode_unicode": False, "delimiter": b"\0"}
if Vllm.vllm_version is None:
Vllm.vllm_version = importlib.import_module('vllm').__version__
if Vllm.vllm_version <= "0.5.0":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断不对,比如'0.5.0' < '0.11.0',结果会是False

lazyllm.config.add('num_gpus_per_node', int, 8, 'NUM_GPUS_PER_NODE')

def reallocate_launcher(launcher):
if not isinstance(launcher, (launchers.ScoLauncher, launchers.SlurmLauncher, launchers.RemoteLauncher)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteLauncher目前没有实例

f"limit{(lazyllm.config['num_gpus_per_node'])}. Please check the actual "
'number of GPUs in a single node and set the environment variable: LAZYLLM_NUM_GPUS_PER_NODE. '
'Now LazyLLM will reconfigure the number of nodes and GPUs')
nnode = nnode if nnode > 0 else 1 # avoid 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接用assert做检查吧

if not isinstance(launcher, (launchers.ScoLauncher, launchers.SlurmLauncher, launchers.RemoteLauncher)):
return [], launcher
nnode = launcher.nnode
ngpus = launcher.ngpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在多机的情况下,ngpus定义是一共多少个,还是每个node多少个?
如果是每个机器,是不是应该叫ngpus_per_node,然后超了直接报错,而不是给他重算一下加机器?

master_ip = ''
for launcher in self.launcher_list:
m = Distributed(launcher=launcher, master_ip=master_ip)
m()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在算cmd的时候就已经把任务启动是有问题的,要一起等到推理任务启动时候再启动

@@ -53,7 +53,13 @@ def __init__(self, trust_remote_code=True, launcher=launchers.remote(ngpus=1), s
self.temp_folder = make_log_dir(log_path, 'vllm') if log_path else None
if self.launcher_list:
ray_launcher = [Distributed(launcher=launcher) for launcher in self.launcher_list]
self._prepare_deploy = pipeline(*ray_launcher)
with lazyllm.pipeline() as ppl:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个情况可以试一下post_action

parall_launcher = [lazyllm.pipeline(sleep_moment, launcher) for launcher in ray_launcher[1:]]
self._prepare_deploy = lazyllm.pipeline(ray_launcher[0], post_action=(lazyllm.parallel(*parall_launcher) if len(parall_launcher) else None))

lwj-st added a commit to LazyAGI/LazyLLM-Env that referenced this pull request Mar 6, 2025
@mergify mergify bot added the lint_pass label Mar 7, 2025
lwj-st added a commit to LazyAGI/LazyLLM-Env that referenced this pull request Mar 12, 2025
lwj-st added a commit to LazyAGI/LazyLLM-Env that referenced this pull request Mar 12, 2025
@lwj-st lwj-st removed the lint_pass label Mar 12, 2025
@mergify mergify bot added the lint_pass label Mar 12, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants