Skip to content

Latest commit

 

History

History

pixtral-12b-2409

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Self-host Pixtral 12B 2409 with vLLM and BentoML

This is a BentoML example project, showing you how to serve and deploy Pixtral 12B 2409 using vLLM, a high-throughput and memory-efficient inference engine.

See here for a full list of BentoML example projects.

💡 This example is served as a basis for advanced code customization, such as custom model, inference logic or vLLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

Prerequisites

  • You have gained access to mistralai/Pixtral-12B-2409 on Hugging Face.
  • If you want to test the Service locally, we recommend you use an Nvidia GPU with at least 16G VRAM.

Install dependencies

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/pixtral-12b-2409

# Recommend Python 3.11

pip install -r requirements.txt

export HF_TOKEN=<your-api-key>

Run the BentoML Service

We have defined a BentoML Service in service.py. To run the service do the following:

$ bentoml serve service.py:VLLM

The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.

Note

This ships with a default max_model_len=32768. If you wish to change this value, set MAX_MODEL_LEN=<target_context_len>. Make sure that you have enough VRAM to use this context length. BentoVLLM will only set a conservative value based on this model configuration.

OpenAI-compatible endpoints
from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

chat_completion = client.chat.completions.create(
    model="mistralai/Pixtral-12B-2409",
    messages=[
        {
            "role": "user",
            "content": "Who are you? Please respond in pirate speak!"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

These OpenAI-compatible endpoints also support vLLM extra parameters. For example, you can force the chat completion output a JSON object by using the guided_json parameters:

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

json_schema = {
    "type": "object",
    "properties": {
        "city": {"type": "string"}
    }
}

chat_completion = client.chat.completions.create(
    model="mistralai/Pixtral-12B-2409",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    extra_body=dict(guided_json=json_schema),
)
print(chat_completion.choices[0].message.content)  # will return something like: {"city": "Paris"}

All supported extra parameters are listed in vLLM documentation.

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.

client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')
cURL
curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Who are you? Please respond in pirate speak!",
}'

This is also a vision LM. there is also a /sights endpoint:

curl -X 'POST' \
  'http://localhost:3000/sights' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: multipart/form-data' \
  -F 'prompt=Describe this image' \
  -F 'image=@file.jpeg;type=image/jpeg'
Python SDK
import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.generate(
        prompt="Who are you? Please respond in pirate speak!",
    )
    for response in response_generator:
        print(response, end='')

This is also a vision LM. there is also a /sights endpoint:

import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.sights(
        prompt="Describe this image",
        image="./file.jpeg",
    )
    for response in response_generator:
        print(response, end='')

For detailed explanations of the Service code, see vLLM inference.

Deploy to BentoCloud

After the Service is ready, you can deploy the application to BentoCloud for better management and scalability. # if you haven't got a BentoCloud account.

Make sure you have logged in to BentoCloud.

bentoml cloud login

Create a BentoCloud secret to store the required environment variable and reference it for deployment.

bentoml secret create huggingface HF_TOKEN=$HF_TOKEN

bentoml deploy service:VLLM --secret huggingface

Once the application is up and running on BentoCloud, you can access it via the exposed URL.

Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.