From b3eced56b4d9d5d3b8597aa506a0bcf954d291bc Mon Sep 17 00:00:00 2001 From: sekyondaMeta <127536312+sekyondaMeta@users.noreply.github.com> Date: Fri, 15 Sep 2023 15:06:37 -0400 Subject: [PATCH] Metrics Docs Updates (#2560) * Metrics Docs Updates Updates to metrics docs to make them easier to follow * Update metrics.md * Update README.md Something weird with this link. It is possible the site itself rejects the lint check * Update index.rst * Metric docs updates Metrics docs updates based off comments. * Metrics Updates Updates per Naman's comments * Update metrics.md --------- Co-authored-by: Ankith Gunapal --- docs/index.rst | 8 ++ docs/metrics.md | 269 +++++++++++++++++++++------------------ docs/metrics_api.md | 7 +- kubernetes/AKS/README.md | 2 +- 4 files changed, 159 insertions(+), 127 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index d8ee4ee63c..f16037417e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -56,6 +56,14 @@ What's going on in TorchServe? :link: performance_guide.html :tags: Performance,Troubleshooting +.. customcarditem:: + :header: Metrics + :card_description: Collecting and viewing Torcherve metrics + :image: https://user-images.githubusercontent.com/5276346/234725829-7f60e0d8-c76d-4019-ac8f-7d60069c4e58.png + :link: metrics.html + :tags: Metrics,Performance,Troubleshooting + + .. customcarditem:: :header: Large Model Inference :card_description: Serving Large Models with TorchServe diff --git a/docs/metrics.md b/docs/metrics.md index 48b2065feb..5ae60b5cbd 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -3,35 +3,145 @@ ## Contents of this document * [Introduction](#introduction) -* [System metrics](#system-metrics) -* [Formatting](#formatting) +* [Getting Started](#getting-started-with-torchserve-metrics) * [Metric Types](#metric-types) -* [Central metrics yaml file definition](#central-metrics-yaml-file-definition) +* [Metrics Formatting](#metrics-formatting) * [Custom Metrics API](#custom-metrics-api) * [Logging custom metrics](#log-custom-metrics) * [Metrics YAML Parsing and Metrics API example](#Metrics-YAML-File-Parsing-and-Metrics-API-Custom-Handler-Example) * [Backwards compatibility warnings and upgrade guide](#backwards-compatibility-warnings-and-upgrade-guide) + ## Introduction Torchserve metrics can be broadly classified into frontend and backend metrics. + Frontend metrics include system level metrics. The host resource utilization frontend metrics are collected at regular intervals(default: every minute). -Torchserve provides an API to collect custom backend metrics. Metrics defined by a custom service or handler code can be collected per request or per a batch of requests. -Two metric modes are supported, i.e `log` and `prometheus`. The default mode is `log`. -Metrics mode can be configured using the `metrics_mode` configuration option in `config.properties` or `TS_METRICS_MODE` environment variable. + +Torchserve provides [an API](#custom-metrics-api) to collect custom backend metrics. Metrics defined by a custom service or handler code can be collected per request or per a batch of requests. + +Two metric modes are supported, i.e `log` and `prometheus` with the default mode being `log`. +The metrics mode can be configured using the `metrics_mode` configuration option in `config.properties` or `TS_METRICS_MODE` environment variable. For further details on `config.properties` and environment variable based configuration, refer [Torchserve config](configuration.md) docs. -In `log` mode, Metrics are logged and can be aggregated by metric agents. +**Log Mode** + +In `log` mode, metrics are logged and can be aggregated by metric agents. Metrics are collected by default at the following locations in `log` mode: * Frontend metrics - `log_directory/ts_metrics.log` -* Backend metrics - `log directory/model_metrics.log` +* Backend metrics - `log_directory/model_metrics.log` The location of log files and metric files can be configured in the [log4j2.xml](https://github.com/pytorch/serve/blob/master/frontend/server/src/main/resources/log4j2.xml) file -In `prometheus` mode, all metrics are made available in prometheus format via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md). +**Prometheus Mode** + +In `prometheus` mode, metrics defined in the metrics configuration file are made available in prometheus format via the [metrics API endpoint](metrics_api.md). + +## Getting Started with TorchServe Metrics + +TorchServe defines metrics in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`). +When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately. +The backend emits metrics logs as they are updated. The frontend parses these logs and makes the corresponding metrics available either as logs or via the [metrics API endpoint](metrics_api.md) based on the metrics_mode configuration. + + +Dynamic updates to the metrics configuration file is currently not supported. In order to account for updates made to the metrics configuration file, Torchserve will need to be restarted. + + +The `metrics.yaml` is formatted with Prometheus metric type terminology: + +```yaml +dimensions: # dimension aliases + - &model_name "ModelName" + - &level "Level" + +ts_metrics: # frontend metrics + counter: # metric type + - name: NameOfCounterMetric # name of metric + unit: ms # unit of metric + dimensions: [*model_name, *level] # dimension names of metric (referenced from the above dimensions dict) + gauge: + - name: NameOfGaugeMetric + unit: ms + dimensions: [*model_name, *level] + histogram: + - name: NameOfHistogramMetric + unit: ms + dimensions: [*model_name, *level] + +model_metrics: # backend metrics + counter: # metric type + - name: InferenceTimeInMS # name of metric + unit: ms # unit of metric + dimensions: [*model_name, *level] # dimension names of metric (referenced from the above dimensions dict) + - name: NumberOfMetrics + unit: count + dimensions: [*model_name] + gauge: + - name: GaugeModelMetricNameExample + unit: ms + dimensions: [*model_name, *level] + histogram: + - name: HistogramModelMetricNameExample + unit: ms + dimensions: [*model_name, *level] +``` -## Frontend Metrics + +Note that **only** the metrics defined in the **metrics configuration file** can be emitted to model_metrics.log or made available via the [metrics API endpoint](metrics_api.md). This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. + +Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are updated.\ +When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: +[add_metric](#function-api-to-add-generic-metrics-with-default-dimensions), [add_counter](#add-counter-based-metrics), +[add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics). + + +### Starting TorchServe Metrics + +Whenever torchserve starts, the [backend worker](https://github.com/pytorch/serve/blob/master/ts/model_service_worker.py) initializes `service.context.metrics` with the [MetricsCache](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_cache_yaml_impl.py) object. The `model_metrics` (backend metrics) section within the specified yaml file will be parsed, and Metric objects will be created based on the parsed section and added to the cache. + +This is all done internally, so the user does not have to do anything other than specifying the desired yaml file. + +*Users have the ability to parse other sections of the yaml file manually, but the primary purpose of this functionality is to +parse the backend metrics from the yaml file.* + +***How It Works*** + +1. Create a `metrics.yaml` file to parse metrics from ***OR*** utilize the default [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) + + +2. Set `metrics_config` argument equal to the yaml file path in the `config.properties` being used: + ```properties + ... + ... + workflow_store=../archive/src/test/resources/workflows + metrics_config=/////metrics.yaml + ... + ... + ``` + + If a `metrics_config` argument is not specified, the default yaml file will be used. + +3. Set the metrics mode you would like to use using the `metrics_mode` configuration option in `config.properties` or `TS_METRICS_MODE` environment variable. If not set, `log` mode will be used by default. + +4. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag: (example using [Huggingface_Transformers](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers)) + + ```torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs --ts-config /////config.properties``` + +5. Collect metrics depending on mode chosen. + + If `log` mode check : + * Frontend metrics - `log_directory/ts_metrics.log` + * Backend metrics - `log_directory/model_metrics.log` + + Else, if using `prometheus` mode, use the [Metrics API](metrics_api.md). + + +## Metric Types + +Metrics collected include: + +### Frontend Metrics | Metric Name | Type | Unit | Dimensions | Semantics | |-----------------------------------|---------|--------------|-------------------------------------|-----------------------------------------------------------------------------| @@ -55,14 +165,30 @@ In `prometheus` mode, all metrics are made available in prometheus format via th | GPUMemoryUsed | gauge | Megabytes | Level, DeviceId, Hostname | GPU memory used on host, DeviceId | | GPUUtilization | gauge | Percent | Level, DeviceId, Hostname | GPU utilization on host, DeviceId | -## Backend Metrics +### Backend Metrics: | Metric Name | Type | Unit | Dimensions | Semantics | |-----------------------------------|-------|------|----------------------------|-------------------------------| | HandlerTime | gauge | ms | ModelName, Level, Hostname | Time spent in backend handler | | PredictionTime | gauge | ms | ModelName, Level, Hostname | Backend prediction time | -## Formatting + +### Metric Types Enum + +TorchServe Metrics is introducing [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py) +that are in line with the [Prometheus API](https://github.com/prometheus/client_python) metric types. + +Metric types are an attribute of Metric objects. +Users will be restricted to the existing metric types when adding metrics via Metrics API. + +```python +class MetricTypes(enum.Enum): + COUNTER = "counter" + GAUGE = "gauge" + HISTOGRAM = "histogram" +``` + +## Metrics Formatting TorchServe emits metrics to log files by default. The metrics are formatted in a [StatsD](https://github.com/etsy/statsd) like format. @@ -124,113 +250,9 @@ EOE ``` -## Metric Types - -TorchServe Metrics is introducing [Metric Types](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_type_enum.py) -that are in line with the [Prometheus API](https://github.com/prometheus/client_python) metric types. - -Metric types are an attribute of Metric objects. -Users will be restricted to the existing metric types when adding metrics via Metrics API. - -```python -class MetricTypes(enum.Enum): - COUNTER = "counter" - GAUGE = "gauge" - HISTOGRAM = "histogram" -``` - -## Central metrics YAML file definition - -TorchServe defines metrics in a [yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) -file, including both frontend metrics (i.e. `ts_metrics`) and backend metrics (i.e. `model_metrics`). -When TorchServe is started, the metrics definition is loaded in the frontend and backend cache separately. -The backend flushes the metrics cache once a load model or inference request is completed. - -Dynamic updates between the frontend and backend are _not_ currently being handled. - -The `metrics.yaml` is formatted with Prometheus metric type terminology: - -```yaml -dimensions: # dimension aliases - - &model_name "ModelName" - - &level "Level" - -ts_metrics: # frontend metrics - counter: # metric type - - name: NameOfCounterMetric # name of metric - unit: ms # unit of metric - dimensions: [*model_name, *level] # dimension names of metric (referenced from the above dimensions dict) - gauge: - - name: NameOfGaugeMetric - unit: ms - dimensions: [*model_name, *level] - histogram: - - name: NameOfHistogramMetric - unit: ms - dimensions: [*model_name, *level] - -model_metrics: # backend metrics - counter: # metric type - - name: InferenceTimeInMS # name of metric - unit: ms # unit of metric - dimensions: [*model_name, *level] # dimension names of metric (referenced from the above dimensions dict) - - name: NumberOfMetrics - unit: count - dimensions: [*model_name] - gauge: - - name: GaugeModelMetricNameExample - unit: ms - dimensions: [*model_name, *level] - histogram: - - name: HistogramModelMetricNameExample - unit: ms - dimensions: [*model_name, *level] -``` - - -Note that **only** the metrics defined in the **metrics configuration file** can be emitted to logs or made available via the metrics API endpoint. This is done to ensure that the metrics configuration file serves as a central inventory of all the metrics that Torchserve can emit. - -Default metrics are provided in the [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) file, but the user can either delete them to their liking / ignore them altogether, because these metrics will not be emitted unless they are edited.\ -When adding custom `model_metrics` in the metrics configuration file, ensure to include `ModelName` and `Level` dimension names towards the end of the list of dimensions since they are included by default by the following custom metrics APIs: -[add_metric](#function-api-to-add-generic-metrics-with-default-dimensions), [add_counter](#add-counter-based-metrics), -[add_time](#add-time-based-metrics), [add_size](#add-size-based-metrics) or [add_percent](#add-percentage-based-metrics). - - -### How it works - -Whenever torchserve starts, the [backend worker](https://github.com/pytorch/serve/blob/master/ts/model_service_worker.py) initializes `service.context.metrics` with the [MetricsCache](https://github.com/pytorch/serve/blob/master/ts/metrics/metric_cache_yaml_impl.py) object. The `model_metrics` (backend metrics) section within the specified yaml file will be parsed, and Metric objects will be created based on the parsed section and added to the cache. - -This is all done internally, so the user does not have to do anything other than specifying the desired yaml file. - -*Users have the ability to parse other sections of the yaml file manually, but the primary purpose of this functionality is to -parse the backend metrics from the yaml file.* - -### User Manual - starting TorchServe with a yaml file specified - -1. Create a `metrics.yaml` file to parse metrics from OR utilize default [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml) - - -2. Set `metrics_config` argument equal to the yaml file path in the `config.properties` being used: - ```properties - ... - ... - workflow_store=../archive/src/test/resources/workflows - metrics_config=/////metrics.yaml - ... - ... - ``` - - If a `metrics_config` argument is not specified, the default yaml file will be used. - - -3. Run torchserve and specify the path of the `config.properties` after the `ts-config` flag: (example using [Huggingface_Transformers](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers)) - - ```torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs --ts-config /////config.properties``` - - ## Custom Metrics API -TorchServe enables the custom service code to emit metrics that are then made available based on the configured `metrics_mode`. +This is the API used in the backend handler to emit metrics. TorchServe enables the custom service code to emit metrics that are then made available based on the configured `metrics_mode`. The custom service code is provided with a [context](https://github.com/pytorch/serve/blob/master/ts/context.py) of the current request with a metrics object: @@ -242,6 +264,7 @@ metrics = context.metrics All metrics are collected within the context. +**Note** The custom metrics API is not to be confused with the [metrics API endpoint](metrics_api.md) which is a http API that is used to fetch metrics in the prometheus format. ### Specifying Metric Types @@ -249,8 +272,7 @@ When adding any metric via Metrics API, users have the ability to override the m `metric_type=MetricTypes.[COUNTER/GAUGE/HISTOGRAM]`. ```python -metric1 = metrics.add_metric_to_cache("GenericMetric", unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) -metric.add_or_update(value, dimension_values=["value1", "value2", ...]) +metrics.add_metric("GenericMetric", value, unit=unit, dimension_names=["name1", "name2", ...], metric_type=MetricTypes.GAUGE) # Backwards compatible, combines the above two method calls metrics.add_counter("CounterMetric", value=1, dimensions=[Dimension("name", "value"), ...]) @@ -415,7 +437,6 @@ Note that calling `add_metric_to_cache` will not emit the metric, `add_or_update metric = metrics.add_metric('DistanceInKM', value=10, unit='km', dimensions=[...]) ``` - ### Add time-based metrics **Time-based metrics are defaulted to a `GAUGE` metric type** @@ -622,7 +643,7 @@ context.metrics.add_counter(...) ``` This custom metrics information is logged in the `model_metrics.log` file configured through [log4j2.xml](https://github.com/pytorch/serve/blob/master/frontend/server/src/main/resources/log4j2.xml) file -or made available via the [metrics](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) API endpoint based on the `metrics_mode` configuration. +or made available via the [metrics](metrics_api.md) API endpoint based on the `metrics_mode` configuration. ## Metrics YAML File Parsing and Metrics API Custom Handler Example @@ -683,7 +704,7 @@ class CustomHandlerExample: In versions greater than v0.8.1 the `add_metric` API signature was updated to support backwards compatibility:\ from: [add_metric(metric_name, unit, dimension_names=None, metric_type=MetricTypes.COUNTER)](https://github.com/pytorch/serve/blob/35ef00f9e62bb7fcec9cec92630ae757f9fb0db0/ts/metrics/metric_cache_abstract.py#L272)\ to: `add_metric(name, value, unit, idx=None, dimensions=[], metric_type=MetricTypes.COUNTER)`\ - Usage of the new API is shown [above](#specifying-metric-types).\ + Usage of the new API is shown [above](#specifying-metric-types). **Upgrade paths**: - **[< v0.6.1] to [v0.6.1 - v0.8.1]**\ There are two approaches available when migrating to the new custom metrics API: @@ -703,11 +724,11 @@ class CustomHandlerExample: - **[v0.6.1 - v0.8.1] to [> v0.8.1]**\ Replace the call to `add_metric` with `add_metric_to_cache`. 2. Starting [v0.8.0](https://github.com/pytorch/serve/releases/tag/v0.8.0), only metrics that are defined in the metrics config file(default: [metrics.yaml](https://github.com/pytorch/serve/blob/master/ts/configs/metrics.yaml)) - are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](https://github.com/pytorch/serve/blob/master/docs/metrics_api.md) + are either all logged to `ts_metrics.log` and `model_metrics.log` or made available via the [metrics API endpoint](metrics_api.md) based on the `metrics_mode` configuration as described [above](#introduction).\ The default `metrics_mode` is `log` mode.\ This is unlike in previous versions where all metrics were only logged to `ts_metrics.log` and `model_metrics.log` except for `ts_inference_requests_total`, `ts_inference_latency_microseconds` and `ts_queue_latency_microseconds` which were only available via the metrics API endpoint.\ **Upgrade paths**: - **[< v0.8.0] to [>= v0.8.0]**\ - Specify all the custom metrics added to the custom handler in the metrics configuration file as shown [above](#central-metrics-yaml-file-definition). + Specify all the custom metrics added to the custom handler in the metrics configuration file as shown [above](#getting-started-with-torchserve-metrics). diff --git a/docs/metrics_api.md b/docs/metrics_api.md index c1ca10c2d5..6028dc5b1b 100644 --- a/docs/metrics_api.md +++ b/docs/metrics_api.md @@ -1,10 +1,13 @@ # Metrics API -Metrics API is listening on port 8082 and only accessible from localhost by default. To change the default setting, see [TorchServe Configuration](configuration.md). The default metrics endpoint returns Prometheus formatted metrics when [metrics_mode](https://github.com/pytorch/serve/blob/master/docs/metrics.md) configuration is set to `prometheus`. You can query metrics using curl requests or point a [Prometheus Server](#prometheus-server) to the endpoint and use [Grafana](#grafana) for dashboards. +Metrics API is a http API that is used to fetch metrics in the prometheus format. It is listening on port 8082 and only accessible from localhost by default. To change the default setting, see [TorchServe Configuration](configuration.md). The metrics endpoint is enabled by default and returns Prometheus formatted metrics when [metrics_mode](https://github.com/pytorch/serve/blob/master/docs/metrics.md) configuration is set to `prometheus`. You can query metrics using curl requests or point a [Prometheus Server](#prometheus-server) to the endpoint and use [Grafana](#grafana) for dashboards. -By default these APIs are enabled however same can be disabled by setting `enable_metrics_api=false` in torchserve config.properties file. +By default these APIs are enabled however it can be disabled by setting `enable_metrics_api=false` in torchserve config.properties file. For details refer [Torchserve config](configuration.md) docs. +**Note** This is not to be confused with torch serve's [custom metrics API](metrics.md). The custom metrics API is used to collect custom backend metrics based on the configured `metrics_mode` (log or prometheus). More information on this api can be found [here](metrics.md). + + ```console curl http://127.0.0.1:8082/metrics diff --git a/kubernetes/AKS/README.md b/kubernetes/AKS/README.md index 99b6074fe4..323f7df568 100644 --- a/kubernetes/AKS/README.md +++ b/kubernetes/AKS/README.md @@ -302,7 +302,7 @@ az group delete --name myResourceGroup --yes --no-wait **Troubleshooting Azure resource for AKS cluster creation** - * Check AKS available region, https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/ + * Check AKS available region, https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/?products=kubernetes-service * Check AKS quota and VM size limitation, https://docs.microsoft.com/en-us/azure/aks/quotas-skus-regions * Check whether your subscription has enough quota to create AKS cluster, https://docs.microsoft.com/en-us/azure/networking/check-usage-against-limits