Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve Operator Pod Mutation Observability #3702

Open
cjp421 opened this issue Feb 7, 2025 · 2 comments
Open

Improve Operator Pod Mutation Observability #3702

cjp421 opened this issue Feb 7, 2025 · 2 comments
Labels
enhancement New feature or request needs triage

Comments

@cjp421
Copy link

cjp421 commented Feb 7, 2025

Component(s)

auto-instrumentation

Is your feature request related to a problem? Please describe.

I'm currently working on rolling out the OpenTelemetry Operator across all of the kubernetes (OpenShift) clusters in our environment. The capability of auto-instrumenting our application workloads will become crucial in our ability to support our systems. If something happens to the operator that results in pods NOT getting auto-instrumented, we'd potentially be "flying blind".

I'd like the ability to have finer insights into the counts of auto-instrumentation attempts and failures to build the proper alerting (SLOs).

Describe the solution you'd like

Instrument the pod mutator to create/increment metrics that indicate that a pod contains the instrumentation annotation and is subject to receive auto-instrumentation. Some initial ideas on the types of scenarios/metrics to expose:

  • pod contained instrumentation/sidecar annotation (may or may not be valid config) -> increment some counter saying "the podmutator will attempt to process"
  • pod contained invalid "inject" type -> pod mutation didn't happen, increment a counter to reflect this scenario
  • pod contained invalid instrumentation or sidecar reference in the annotation value -> pod mutation didn't happen, increment a counter to reflect this scenario
  • pod contained valid instrumentation or sidecar annotation/reference, but an unexpected error occurred -> pod mutation failed, increment a counter to reflect

I know some of these scenarios may be available in container or kubernetes logs, but for managing a fleet of operator across multiple clusters is much easier to do with aggregate metrics to feed to our alerting infrastructure.

Describe alternatives you've considered

I'm currently leveraging the metrics provided by the kubernetes api server admission controller to see the counts of webhook invocations sent to the mpod.kb.io and it does provide some insights, but not all pod creations will be eligible for OTel instrumentation (i.e. they may or may not have the instrumentation.opentelemetry.io annotations.

Additional context

No response

@cjp421 cjp421 added enhancement New feature or request needs triage labels Feb 7, 2025
@iblancasa
Copy link
Contributor

I think we need to take a look into this because is not the first time we talk about it. I'll add a note to be discussed during the next SIG.

@iblancasa iblancasa added the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Feb 10, 2025
@swiatekm swiatekm removed the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Feb 13, 2025
@swiatekm
Copy link
Contributor

We discussed this during the SIG meeting on 13.02.2025, and agreed that this would be a desirable feature. There's some performance issues related to reporting the number of Pods that should be instrumented, but aren't, but simply counting errors as they happen should be fine.

What we need to do next is propose names for the new metrics and attributes. If anyone has suggestions, feel free to post them in this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request needs triage
Projects
None yet
Development

No branches or pull requests

3 participants