Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

spanmetricsprocessor doesn't prune histograms when metric cache is pruned #27080

Closed
nijave opened this issue Sep 22, 2023 · 6 comments
Closed
Assignees
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor

Comments

@nijave
Copy link
Contributor

nijave commented Sep 22, 2023

Component(s)

processor/spanmetrics

What happened?

Description

span metrics processor doesn't drop old histograms
Graphs in grafana/agent#5271

Steps to Reproduce

leave the collector running a while, watch exported metric count drop indefinitely

Expected Result

metric series should be pruned if they haven't been updated a while

Actual Result

metric series dimension cache is pruned but histograms are not

Collector version

v0.80.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

Configured is automatically generated by Grafana Agent. See https://github.com/grafana/agent/blob/main/pkg/traces/config.go#L647

Log output

N/A

Additional context

It looks like histograms map should have been pruned/LRU'd in addition to metricsKeyToDimensions #2179

I think this is the same/similar but it's closed so I figured I'd collect everything into a bug report #17306 (comment)

@nijave nijave added bug Something isn't working needs triage New item requiring triage labels Sep 22, 2023
@github-actions github-actions bot added the processor/spanmetrics Span Metrics processor label Sep 22, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@nijave
Copy link
Contributor Author

nijave commented Sep 23, 2023

Took a crack at a PR #27083

@Frapschen Frapschen removed the needs triage New item requiring triage label Sep 25, 2023
@mfilipe
Copy link

mfilipe commented Sep 25, 2023

Hello @nijave, I can confirm the issue in my environment:

Screenshot 2023-09-25 at 17 48 46

How you can see, the metrics exposed in /metrics only grow over the time.

There are two details about your issue that doesn't match with the environment that I have: spanmetricsconnector and v0.85.0. Could you consider using those on your work? spanmetricsprocessor is deprecated.

@mfilipe
Copy link

mfilipe commented Sep 25, 2023

My current config:

receivers:
  otlp:
    protocols:
      grpc:
exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    metric_expiration: 60s
connectors:
  spanmetrics:
    histogram:
      unit: "ms"
      explicit:
        buckets: []
    metrics_flush_interval: 15s
    dimensions:
      - name: build_name
      - name: build_number
    exclude_dimensions:
      - span.kind
    dimensions_cache_size: 100
processors:
  batch:
  attributes/spanmetrics:
    actions:
      - action: extract
        key: host.name
        pattern: ^(?P<kubernetes_cluster>.+)-jenkins-(?P<organization>tantofaz|whatever-org)-(?P<build_name>.+)-(?P<build_number>[0-9]+)(?P<build_id>(?:-[^-]+){2}|--.*?)$
  filter/spanmetrics:
    error_mode: ignore
    metrics:
      metric:
        - 'resource.attributes["service.name"] != "jenkins"'
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes/spanmetrics, batch]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      processors: [filter/spanmetrics]
      exporters: [prometheus]

@mfilipe
Copy link

mfilipe commented Sep 29, 2023

This issue should be considerable critical, once /metrics grows in a way that makes the metric backend unstable over the time to scrape the metrics and there isn't a workaround. Basically, /metrics starts with a few kilobytes and, after a few days, it goes to hundreds of megabytes, making the backend unstable. The environment where I have the problem is common: spanmetrics connector saving the metrics in Prometheus with many repositories generating metrics.

Based on this article, Prometheus only supports cumulative metrics, so I cannot use delta metrics to avoid the issue.

If there is a workaround for the issue, please let me know. AFAIK the workaround doesn't exist, making this issue critical.

MovieStoreGuy added a commit that referenced this issue Oct 4, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 #27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
nijave added a commit to nijave/opentelemetry-collector-contrib that referenced this issue Oct 4, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
@crobert-1
Copy link
Member

crobert-1 commented Oct 12, 2023

@nijave From #27083 and your results shared here, it looks like this has been fixed. Is that correct? If so we can close this issue.

Thanks for your help here!

@nijave nijave closed this as completed Oct 13, 2023
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor
Projects
None yet
Development

No branches or pull requests

5 participants