@@ -2,9 +2,34 @@ Production Metrics
2
2
==================
3
3
4
4
vLLM exposes a number of metrics that can be used to monitor the health of the
5
- system. These metrics are exposed via the `/metrics ` endpoint on the vLLM
5
+ system. These metrics are exposed via the `` /metrics ` ` endpoint on the vLLM
6
6
OpenAI compatible API server.
7
7
8
+ You can start the server using Python, or using [Docker](deploying_with_docker.rst):
9
+
10
+ .. code-block :: console
11
+
12
+ $ vllm serve unsloth/Llama-3.2-1B-Instruct
13
+
14
+ Then query the endpoint to get the latest metrics from the server:
15
+
16
+ .. code-block :: console
17
+
18
+ $ curl http://0.0.0.0:8000/metrics
19
+
20
+ # HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
21
+ # TYPE vllm:iteration_tokens_total histogram
22
+ vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
23
+ vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
24
+ vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
25
+ vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
26
+ vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
27
+ vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
28
+ vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
29
+ vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
30
+ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
31
+ ...
32
+
8
33
The following metrics are exposed:
9
34
10
35
.. literalinclude :: ../../../vllm/engine/metrics.py
0 commit comments