Improve Operator Pod Mutation Observability #3702

cjp421 · 2025-02-07T21:39:21Z

Component(s)

auto-instrumentation

Is your feature request related to a problem? Please describe.

I'm currently working on rolling out the OpenTelemetry Operator across all of the kubernetes (OpenShift) clusters in our environment. The capability of auto-instrumenting our application workloads will become crucial in our ability to support our systems. If something happens to the operator that results in pods NOT getting auto-instrumented, we'd potentially be "flying blind".

I'd like the ability to have finer insights into the counts of auto-instrumentation attempts and failures to build the proper alerting (SLOs).

Describe the solution you'd like

Instrument the pod mutator to create/increment metrics that indicate that a pod contains the instrumentation annotation and is subject to receive auto-instrumentation. Some initial ideas on the types of scenarios/metrics to expose:

pod contained instrumentation/sidecar annotation (may or may not be valid config) -> increment some counter saying "the podmutator will attempt to process"
pod contained invalid "inject" type -> pod mutation didn't happen, increment a counter to reflect this scenario
pod contained invalid instrumentation or sidecar reference in the annotation value -> pod mutation didn't happen, increment a counter to reflect this scenario
pod contained valid instrumentation or sidecar annotation/reference, but an unexpected error occurred -> pod mutation failed, increment a counter to reflect

I know some of these scenarios may be available in container or kubernetes logs, but for managing a fleet of operator across multiple clusters is much easier to do with aggregate metrics to feed to our alerting infrastructure.

Describe alternatives you've considered

I'm currently leveraging the metrics provided by the kubernetes api server admission controller to see the counts of webhook invocations sent to the mpod.kb.io and it does provide some insights, but not all pod creations will be eligible for OTel instrumentation (i.e. they may or may not have the instrumentation.opentelemetry.io annotations.

Additional context

No response

iblancasa · 2025-02-10T08:37:17Z

I think we need to take a look into this because is not the first time we talk about it. I'll add a note to be discussed during the next SIG.

swiatekm · 2025-02-13T18:05:54Z

We discussed this during the SIG meeting on 13.02.2025, and agreed that this would be a desirable feature. There's some performance issues related to reporting the number of Pods that should be instrumented, but aren't, but simply counting errors as they happen should be fine.

What we need to do next is propose names for the new metrics and attributes. If anyone has suggestions, feel free to post them in this issue.

cjp421 added enhancement New feature or request needs triage labels Feb 7, 2025

iblancasa added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Feb 10, 2025

swiatekm removed the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Operator Pod Mutation Observability #3702

Improve Operator Pod Mutation Observability #3702

cjp421 commented Feb 7, 2025

iblancasa commented Feb 10, 2025

swiatekm commented Feb 13, 2025

Improve Operator Pod Mutation Observability #3702

Improve Operator Pod Mutation Observability #3702

Comments

cjp421 commented Feb 7, 2025

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

iblancasa commented Feb 10, 2025

swiatekm commented Feb 13, 2025