This repository contains a Helm chart for deploying Large Language Models (LLMs) on Kubernetes. It is developed primarily for use as a pre-packaged application within Azimuth but is structured such that it can, in principle, be deployed on any Kubernetes cluster with at least 1 GPU node.
This app is provided as part of a standard deployment Azimuth, so no specific steps are required to use this app other than access to an up-to-date Azimuth deployment.
Alternatively, to set up the Helm repository and manually install this chart on an existing Kubernetes cluster, run
helm repo add <chosen-repo-name> https://stackhpc.github.io/azimuth-llm/
helm repo update
helm install <installation-name> <chosen-repo-name>/azimuth-llm --version <version>
where version
is the full name of the published version for the specified commit (e.g. 0.1.0-dev.0.main.125
). To see the latest published version, see this page.
The chart/values.yaml
file documents the various customisation options which are available. In order to access the LLM from outside the Kubernetes cluster, the API and/or UI service types may be changed to
api:
service:
type: LoadBalancer
zenith:
enabled: false
ui:
service:
type: LoadBalancer
zenith:
enabled: false
[!WARNING] Exposing the services in this way provides no authentication mechanism and anyone with access to the load balancer IPs will be able to query the language model. It is up to you to secure the running service as appropriate for your use case. In contrast, when deployed via Azimuth, authentication is provided via the standard Azimuth Identity Provider mechanisms and the authenticated services are exposed via Zenith.
The both the web-based interface and the backend OpenAI-compatible vLLM API server can also optionally be exposed using Kubernetes Ingress. See the ingress
section in values.yml
for available config options.
The application uses vLLM for model serving, therefore any of the vLLM supported models should work. Since vLLM pulls the model files directly from HuggingFace it is likely that some other models will also be compatible with vLLM but mileage may vary between models and model architectures. If a model is incompatible with vLLM then the API pod will likely enter a CrashLoopBackoff
state and any relevant error information will be found in the API pod logs. These logs can be viewed with
kubectl (-n <helm-release-namespace>) logs deploy/<helm-release-name>-api
If you suspect that a given error is not caused by the upstream vLLM support and a problem with this Helm chart then please open an issue.
The LLM chart integrates with kube-prometheus-stack by creating a ServiceMonitor
resource and installing two custom Grafana dashboard as Kubernetes ConfigMap
s. If the target cluster has an existing kube-prometheus-stack
deployment which is appropriately configured to watch all namespaces for new Grafana dashboards, the LLM dashboards will automatically appear in Grafana's dashboard list.
To disable the monitoring integrations, set the api.monitoring.enabled
value to false
.
The Helm chart consists of the following components:
-
A backend web API which runs vLLM's OpenAI compatible web server.
-
A choice of frontend web-apps built using Gradio (see web-apps). Each web interface is available as a pre-built container image hosted on ghcr.io and be configured for each Helm release by changing the
ui.image
section of the chart values.