Skip to content

Commit

Permalink
Support Openmetrics metrics collection (#10752)
Browse files Browse the repository at this point in the history
* Add logic for Envoy Openmetricsv2

* Add label remapper

* Add new metrics

* Finish adding other metrics

* reorganize metrics that should be transformed

* Introduce openmetrics_endpoint config option

* Add watchdog metrics transformers

* Add some more label extraction metrics

* refactor tests to move to legacy

* Add legacy and non legacy fixtures to test files

* Update readme

* Bump base req

* Fix style

* Mark legacy metrics

* Fix watchdog counter name

* document prometheus metrics in metadata csv

* Fix metadata csv

* Add e2e test

* FIx style

* Fix metadata format for validation

* Flaky metrics

* Only support openmetrics in latest api v3

* Fix test imports

* Enable Openmetrics option by default

* Fix import

* Fix style

* Update readme

* Update config stats_url wording

* Fix envoy import

* Remove py27 for openmetrics version

* Openmetrics endpoint should be optional

* Account for flaky metrics

* Document service checks

* Use unique name

* Update envoy/tests/legacy/test_bench.py

Co-authored-by: Ofek Lev <ofekmeister@gmail.com>

* Move metrics map to metrics.py

* Update with feedback

* Use lambda

* simplify match

* Refactor metadata utils

* Support metadata collection in V2

* Use urlunparse

* Reintroduce legacy config options as hidden

Co-authored-by: Ofek Lev <ofekmeister@gmail.com>
  • Loading branch information
ChristineTChen and ofek authored Dec 8, 2021
1 parent 57c4cc6 commit 3114930
Show file tree
Hide file tree
Showing 26 changed files with 2,636 additions and 1,175 deletions.
48 changes: 12 additions & 36 deletions envoy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@ The Envoy check is included in the [Datadog Agent][2] package, so you don't need

#### Istio

If you are using Envoy as part of [Istio][3], be sure to use the appropriate [Envoy admin endpoint][4] for the `stats_url`.
If you are using Envoy as part of [Istio][3], configure the Envoy integration to collect metrics from the Istio proxy metrics endpoint.

```yaml
instances:
- openmetrics_endpoint: localhost:15090/stats/prometheus
```

#### Standard

Expand Down Expand Up @@ -100,45 +105,16 @@ To configure this check for an Agent running on a host:
init_config:
instances:
## @param stats_url - string - required
## The admin endpoint to connect to. It must be accessible:
## https://www.envoyproxy.io/docs/envoy/latest/operations/admin
## Add a `?usedonly` on the end if you wish to ignore
## unused metrics instead of reporting them as `0`.
#
- stats_url: http://localhost:80/stats
## @param openmetrics_endpoint - string - required
## The URL exposing metrics in the OpenMetrics format.
#
- openmetrics_endpoint: http://localhost:8001/stats/prometheus
```

2. Check if the Datadog Agent can access Envoy's [admin endpoint][5].
3. [Restart the Agent][9].

###### Metric filtering
Metrics can be filtered with the parameters`included_metrics` or `excluded_metrics` using regular expressions. If both parameters are used, `included_metrics` is applied first, then `excluded_metrics` is applied on the resulting set.

The filtering occurs before tag extraction, so you have the option to have certain tags decide whether or not to keep or ignore metrics. An exhaustive list of all metrics and tags can be found in [metrics.py][10]. Let's walk through an example of Envoy metric tagging!

```python
...
'cluster.grpc.success': {
'tags': (
('<CLUSTER_NAME>', ),
('<GRPC_SERVICE>', '<GRPC_METHOD>', ),
(),
),
...
},
...
```

Here there are `3` tag sequences: `('<CLUSTER_NAME>')`, `('<GRPC_SERVICE>', '<GRPC_METHOD>')`, and empty `()`. The number of sequences corresponds exactly to how many metric parts there are. For this metric, there are `3` parts: `cluster`, `grpc`, and `success`. Envoy separates everything with a `.`, hence the final metric name would be:

`cluster.<CLUSTER_NAME>.grpc.<GRPC_SERVICE>.<GRPC_METHOD>.success`

If you care only about the cluster name and grpc service, you would add this to your `included_metrics`:

`^cluster\.<CLUSTER_NAME>\.grpc\.<GRPC_SERVICE>\.`

##### Log collection

<!-- partial
Expand Down Expand Up @@ -180,7 +156,7 @@ For containerized environments, see the [Autodiscovery Integration Templates][11
| -------------------- | ------------------------------------------- |
| `<INTEGRATION_NAME>` | `envoy` |
| `<INIT_CONFIG>` | blank or `{}` |
| `<INSTANCE_CONFIG>` | `{"stats_url": "http://%%host%%:80/stats"}` |
| `<INSTANCE_CONFIG>` | `{"openmetrics_endpoint": "http://%%host%%:80/stats/prometheus"}` |

##### Log collection

Expand Down
Empty file added envoy/__init__.py
Empty file.
42 changes: 18 additions & 24 deletions envoy/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,35 @@ files:
- template: init_config/default
- template: instances
options:
- template: instances/openmetrics
overrides:
openmetrics_endpoint.value.example: http://localhost:80/stats/prometheus
openmetrics_endpoint.display_priority: 1
openmetrics_endpoint.required: false
openmetrics_endpoint.enabled: true
- name: stats_url
required: true
display_priority: 3
display_priority: 1
description: |
The admin endpoint to connect to. It must be accessible:
The check will collect and parse metrics from the admin /stats/ endpoint.
It must be accessible:
https://www.envoyproxy.io/docs/envoy/latest/operations/admin
Add a `?usedonly` on the end if you wish to ignore
unused metrics instead of reporting them as `0`.
Note: see the configuration options specific to this option here,
https://github.com/DataDog/integrations-core/blob/7.33.x/envoy/datadog_checks/envoy/data/conf.yaml.example
value:
example: http://localhost:80/stats
type: string
- name: included_metrics
hidden: true
description: |
Includes metrics using regular expressions.
The filtering occurs before tag extraction, so you have the option
to have certain tags decide whether or not to keep or ignore metrics.
For an exhaustive list of all metrics and tags, see:
https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py
If you surround patterns by quotes, be sure to escape backslashes with an extra backslash.
The example list below will include:
- cluster.in.0000.lb_subsets_active
- cluster.out.alerting-event-evaluator-test.datadog.svc.cluster.local
Expand All @@ -39,15 +47,14 @@ files:
example:
- cluster\.(in|out)\..*
- name: excluded_metrics
hidden: true
description: |
Excludes metrics using regular expressions.
The filtering occurs before tag extraction, so you have the option
to have certain tags decide whether or not to keep or ignore metrics.
For an exhaustive list of all metrics and tags, see:
https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py
If you surround patterns by quotes, be sure to escape backslashes with an extra backslash.
The example list below will exclude:
- http.admin.downstream_cx_active
- http.http.rds.0000.control_plane.rate_limit_enforced
Expand All @@ -58,26 +65,30 @@ files:
example:
- ^http\..*
- name: cache_metrics
hidden: true
description: |
Results are cached by default to decrease CPU utilization, at
the expense of some memory. Disable by setting this to false.
value:
type: boolean
example: true
- name: parse_unknown_metrics
hidden: true
description: |
Attempt parsing of metrics that are unknown and will otherwise be skipped.
value:
type: boolean
example: false
- name: collect_server_info
hidden: true
description: |
Collect Envoy version by accessing the `/server_info` endpoint.
Disable this if this endpoint is not reachable by the agent.
value:
type: boolean
example: true
- name: disable_legacy_cluster_tag
hidden: true
description: |
Enable to stop submitting the tags `cluster_name` and `virtual_cluster_name`,
which has been renamed to `envoy_cluster` and `virtual_envoy_cluster`.
Expand All @@ -86,23 +97,6 @@ files:
type: boolean
display_default: false
example: true
- template: instances/default
- template: instances/http
overrides:
username.description: |
The username to use if services are behind basic auth.
Note: The Envoy admin endpoint does not support auth until:
https://github.com/envoyproxy/envoy/issues/2763
For an alternative, see:
https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8
username.display_priority: 2
password.description: |
The password to use if services are behind basic or NTLM auth.
Note: The Envoy admin endpoint does not support auth until:
https://github.com/envoyproxy/envoy/issues/2763
For an alternative, see:
https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8
password.display_priority: 1
- template: logs
example:
- type: file
Expand Down
14 changes: 14 additions & 0 deletions envoy/assets/service_checks.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,19 @@
],
"name": "Can Connect",
"description": "Returns `CRITICAL` if the agent can't connect to Envoy to collect metrics, otherwise `OK`."
},
{
"agent_version": "7.34.0",
"integration": "Envoy",
"check": "envoy.openmetrics.health",
"statuses": [
"ok",
"critical"
],
"groups": [
"endpoint"
],
"name": "Openmetrics Can Connect",
"description": "Returns `CRITICAL` if the agent can't connect to Envoy to collect metrics, otherwise `OK`."
}
]
156 changes: 156 additions & 0 deletions envoy/datadog_checks/envoy/check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# (C) Datadog, Inc. 2021-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import re
from collections import defaultdict

from six.moves.urllib.parse import urljoin, urlparse, urlunparse

from datadog_checks.base import AgentCheck, OpenMetricsBaseCheckV2

from .metrics import PROMETHEUS_METRICS_MAP
from .utils import _get_server_info

ENVOY_VERSION = {'istio_build': {'type': 'metadata', 'label': 'tag', 'name': 'version'}}

LABEL_MAP = {
'cluster_name': 'envoy_cluster',
'envoy_cluster_name': 'envoy_cluster',
'envoy_http_conn_manager_prefix': 'stat_prefix', # tracing
'envoy_listener_address': 'address', # listener
'envoy_virtual_cluster': 'virtual_envoy_cluster', # vhost
'envoy_virtual_host': 'virtual_host_name', # vhost
}


METRIC_WITH_LABEL_NAME = {
r'^envoy_server_(.+\_.+)_watchdog_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'server.watchdog_miss.count',
},
r'^envoy_server_(.+\_.+)_watchdog_mega_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'server.watchdog_mega_miss.count',
},
r'^envoy_(.+\_.+)_watchdog_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'watchdog_miss.count',
},
r'^envoy_(.+\_.+)_watchdog_mega_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'watchdog_mega_miss.count',
},
r'^envoy_cluster_circuit_breakers_(\w+)_cx_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.cx_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_cx_pool_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.cx_pool_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_pending_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_pending_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_retry_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_retry_open',
},
r'^envoy_listener_admin_(.+\_.+)_downstream_cx_active$': {
'label_name': 'handler',
'metric_type': 'gauge',
'new_name': 'listener.admin.downstream_cx_active',
},
r'^envoy_listener_(.+\_.+)_downstream_cx_active$': {
'label_name': 'handler',
'metric_type': 'gauge',
'new_name': 'listener.downstream_cx_active',
},
r'^envoy_listener_admin_(.+\_.+)_downstream_cx$': {
'label_name': 'handler',
'metric_type': 'monotonic_count',
'new_name': 'listener.admin.downstream_cx.count',
},
r'^envoy_listener_(.+)_downstream_cx$': {
'label_name': 'handler',
'metric_type': 'monotonic_count',
'new_name': 'listener.downstream_cx.count',
},
}


class EnvoyCheckV2(OpenMetricsBaseCheckV2):
__NAMESPACE__ = 'envoy'

DEFAULT_METRIC_LIMIT = 0

def __init__(self, name, init_config, instances):
super().__init__(name, init_config, instances)
self.check_initializations.append(self.configure_additional_transformers)
openmetrics_endpoint = self.instance.get('openmetrics_endpoint')
self.base_url = None
try:
parts = urlparse(openmetrics_endpoint)
self.base_url = urlunparse(parts[:2] + ('', '', None, None))

except Exception as e:
self.log.debug("Unable to determine the base url for version collection: %s", str(e))

def check(self, _):
self._collect_metadata()
super(EnvoyCheckV2, self).check(None)

def get_default_config(self):
return {
'metrics': [PROMETHEUS_METRICS_MAP],
'rename_labels': LABEL_MAP,
}

def configure_transformer_label_in_name(self, metric_pattern, new_name, label_name, metric_type):
method = getattr(self, metric_type)
cached_patterns = defaultdict(lambda: re.compile(metric_pattern))

def transform(metric, sample_data, runtime_data):
for sample, tags, hostname in sample_data:
parsed_sample_name = sample.name
if sample.name.endswith("_total"):
parsed_sample_name = re.match("(.*)_total$", sample.name).groups()[0]
label_value = cached_patterns[metric_pattern].match(parsed_sample_name).groups()[0]

tags.append('{}:{}'.format(label_name, label_value))
method(new_name, sample.value, tags=tags, hostname=hostname)

return transform

def configure_additional_transformers(self):
for metric, data in METRIC_WITH_LABEL_NAME.items():
self.scrapers[self.instance['openmetrics_endpoint']].metric_transformer.add_custom_transformer(
metric, self.configure_transformer_label_in_name(metric, **data), pattern=True
)

@AgentCheck.metadata_entrypoint
def _collect_metadata(self):
# Replace in favor of built-in Openmetrics metadata when PR is available
# https://github.com/envoyproxy/envoy/pull/18991
if not self.base_url:
self.log.debug("Skipping server info collection due to malformed url: %s", self.base_url)
return
# From http://domain/thing/stats to http://domain/thing/server_info
server_info_url = urljoin(self.base_url, 'server_info')
raw_version = _get_server_info(server_info_url, self.log, self.http)

if raw_version:
self.set_metadata('version', raw_version)
Loading

0 comments on commit 3114930

Please # to comment.