Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Integrate vLLM Evaluator #23

Open
adivekar-utexas opened this issue Feb 15, 2025 · 2 comments
Open

Integrate vLLM Evaluator #23

adivekar-utexas opened this issue Feb 15, 2025 · 2 comments
Assignees
Labels
enhancement 🚀 New feature or request

Comments

@adivekar-utexas
Copy link
Contributor

vLLM is a high-throughput LLM evaluator which runs on HuggingFace models, performing various kinds of model sharding across GPUs using Ray backend.
In its basic form, vLLM is a great speedup over AccelerateEvaluator, which is quite slow.

Basic requirements:

  1. Should be compatible with RayEvaluator (and GenerativeLM if needed).
  2. Should support only single-node models; scaling up models should require larger nodes (design choice for better execution speed).
  3. Should integrate with all HF transformers LLMs.
@adivekar-utexas adivekar-utexas self-assigned this Feb 15, 2025
@adivekar-utexas adivekar-utexas added the enhancement 🚀 New feature or request label Feb 15, 2025
@adivekar-utexas
Copy link
Contributor Author

Initial exploration

Seems like vLLM can run inside a Ray cluster just fine.

Basic working code example

import ray
from vllm import LLM

@ray.remote(num_gpus=2, num_cpus=2)
class VLLMActor:
    def __init__(self, model_name: str, tensor_parallel_size: int):
        # Create a vLLM instance that loads the model across 2 GPUs.
        # Ensure that the tensor_parallel_size is set to 2.
        import os
        from huggingface_hub import snapshot_download
        snapshot_download(
            model_name,
            token="hf_YOUR_KEY_HERE",
        )
        self.llm = LLM(
            model=model_name, 
            tensor_parallel_size=tensor_parallel_size,
            max_model_len=4000,
        )
        print(f"Loaded model {model_name} across {tensor_parallel_size} GPUs.")

    def generate_text(self, prompt: str, **kwargs) -> str:
        # Generate text using the vLLM instance.
        result = self.llm.generate(prompt, **kwargs)
        # Assume that the result object has an 'output' attribute containing the generated text.
        return result

Usage:

# Create an instance of the VLLM actor that loads the Llama‑3 7B model.
actors = [
    VLLMActor.remote(
        model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", 
        tensor_parallel_size=2,
    )
    for _ in range(2)
]

from bears.util import get_result, accumulate
from vllm.sampling_params import SamplingParams
prompt = ["Explain the theory of relativity in simple terms.", "Explain the theory of evolution in simple terms."]

# Generate text
res = (
    actors[0].generate_text.remote(prompt[0], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)
res2 = (
    actors[1].generate_text.remote(prompt[1], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)

print(accumulate(res)[0].outputs[0].text)

print(accumulate(res2)[0].outputs[0].text)

@adivekar-utexas
Copy link
Contributor Author

Notes on initial exploration:

  • Downloading the model is pretty slow (took an hour using snapshot_download). Can we speed this up somehow? I was downloading to the SSD of a g5.12xlarge, not EFS.
  • vLLM itself works very smoothly, was able to run Deepseek R1 Qwen 14B across 2 GPUs with about 90% vRAM usage per GPU (~20GB vRAM used). GPU Util was ~100%.
  • Token generation was decent (~30 tokens/sec). This is a reasoning model so it generated 933 tokens for prompt1 above.

I think this approach is good enough to use. vLLMEvaluator can be a simple usage of vllm, but will need some adapters for sampling and returning logprobs.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement 🚀 New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant