otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

utezduyar · 2023-06-23T10:33:31Z

Component(s)

processor/tailsampling

What happened?

Description

The help text of the metric indicates that it is a gauge. This metric is only increasing, as if it is just the count of spans processed.

Steps to Reproduce

Run this collector and send 300 spans. Wait 2 minutes. See the metric is not going down.

# HELP otelcol_processor_tail_sampling_sampling_traces_on_memory Tracks the number of traces current on memory
# TYPE otelcol_processor_tail_sampling_sampling_traces_on_memory gauge
otelcol_processor_tail_sampling_sampling_traces_on_memory{service_instance_id="490bb5de-9af9-4d47-8b16-afb69583fbc7",service_name="otelcol-contrib",service_version="0.79.0"} 300

Expected Result

Actual Result

Collector version

0.79.0

Environment information

otelcol-contrib_0.79.0_darwin_arm64

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:

exporters:
  logging:
    verbosity: detailed
    sampling_initial: 1000
    sampling_thereafter: 1000

processors:
  batch:
  tail_sampling:
    decision_wait: 60s
    policies:
      [
        {
          name: composite-policy,
          type: composite,
          composite:
            {
              max_total_spans_per_second: 10,
              policy_order: [composite-policy-errors, test-composite-always],
              composite_sub_policy:
                [
                  {
                    name: composite-policy-errors,
                    type: status_code,
                    status_code: {status_codes: [ERROR]}
                  },
                ],
              rate_allocation:
                [
                  {
                    policy: composite-policy-errors,
                    percent: 100
                  },
                ]
            }
        },
        {
          name: test-policy-8,
          type: rate_limiting,
          rate_limiting: {spans_per_second: 2}
        },
#        {
#          name: test-policy-1,
#          type: always_sample
#        },
#        {
#          name: test-policy-5,
#          type: status_code,
#          status_code: {status_codes: [ERROR]}
#        },
#        {
#          name: test-policy-4,
#          type: probabilistic,
#          probabilistic: {sampling_percentage: 50}
#        },
      ]

service:
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: ":8888"
  pipelines:
    traces:
      receivers: [otlp]
      #processors: [batch, tail_sampling]
      processors: [tail_sampling]
      exporters: [logging]

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-23T10:33:49Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jiekun · 2023-06-25T16:09:31Z

Thanks. I'm trying to locate the issue.

processor/tailsamplingprocessor/processor.go

func (tsp *tailSamplingSpanProcessor) processTraces(resourceSpans ptrace.ResourceSpans) {
    ...
                if loaded {
			...
		} else {
			newTraceIDs++
			tsp.decisionBatcher.AddToCurrentBatch(id)

                        // Here we increase the gauge value
			tsp.numTracesOnMap.Add(1)  
			postDeletion := false
			currTime := time.Now()
			for !postDeletion {
				select {
				case tsp.deleteChan <- id:
					postDeletion = true
				default:
					traceKeyToDrop := <-tsp.deleteChan
                                        
                                        // Here we decrease the gauge value
					tsp.dropTrace(traceKeyToDrop, currTime)
				}
			}
		}
    ...

As tsp.deleteChan is a channel with buffer (for example 10):

We can keep sending the traceId to the channel until it's full. the default case won't be triggered before we have 10 traces in memory.
Now we have 10 traces in the channel. You'll have to retrieve one before sending another one.
If no more trace needs to be handled, the metric never drops.

So the sampling_traces_on_memory gauge will keep increasing until it reaches the num_traces in config (default 50000)

	// NumTraces is the number of traces kept on memory. Typically most of the data
	// of a trace is released after a sampling decision is taken.
	NumTraces uint64 `mapstructure:"num_traces"`

@utezduyar To verify, you could config num_traces to a lower value and see if the sampling_traces_on_memory gauge will reach that value and keep stable.

Ping @jpkrohling for help. I am not sure if it's written by design. I will be happy to follow up the fix/improvement if necessary. Possible solution:

Execute dropTrace to remove the data in sync.Map by another goroutine.
Add more documents to the metric/memory usage so that people could know tailsamplingprocessor do keep num_traces traces in memory, the gauge is correct and will keep increasing before reaching the configured num_traces.

Frapschen · 2023-06-27T15:53:25Z

@jiekun I support 2, should I assign this issue to you?

jiekun · 2023-06-28T02:26:12Z

@Frapschen Sure I want to optimize it. Do we need more discussion or I could just go ahead now?

utezduyar · 2023-06-28T05:10:50Z

@jiekun I have tried your suggestion of lowering the number of traces and the metric capped at the num_traces but stayed at that number.

I believe this metric is still not a gauge. It will go from 0 to num_traces and it will not go up or down. Am I missing something?

jiekun · 2023-06-28T05:16:43Z

@jiekun I have tried your suggestion of lowering the number of traces and the metric capped at the num_traces but stayed at that number.

I believe this metric is still not a gauge. It will go from 0 to num_traces and it will not go up or down. Am I missing something?

Thanks for validating it.

This metric is a gauge. It refers to "how many traces are stored in an in-memory sync.Map". The current implementation won't drop those traces and release the memory space if no other traces are coming in. In other words:

Do we have enough space?
- Yes -> Keep collecting new trace. gague keep increasing.
- No -> Drop 1 old trace and collect 1 new trace. gague -1 and then +1, which looks stable.

That's why the gauge looks more like a counter before reaching the memory limit.

utezduyar · 2023-06-28T05:21:45Z

Do you know why these traces are kept in memory if the decision_wait has passed for them? They have already expired. Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

jiekun · 2023-06-28T05:29:56Z

I think decision_wait control the workflow before the code block I posted above. It's not relevant at all.

Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

I think you are correct. To me, it sounds like an expected behavior more than the current one.

We may just need to set up a goroutine to drop those useless in-memory trace data. But currently it's not impl'ed like this. And we may need a PR to optimize it.

I suggest we could have a little discussion on the Collector SIG meeting tomorrow. Feel free to attend and comment: https://docs.google.com/document/d/1r2JC5MB7GupCE7N32EwGEXs9V_YIsPgoFiLP4VWVMkE/edit#heading=h.rbf22rxu3mij

utezduyar · 2023-06-29T06:49:31Z

I have opened a Feature Request PR and linked it to this one. If that one is fixed, I believe we can close this PR, otherwise maybe we should change the title to "documentation improvement" somehow.

utezduyar · 2023-06-29T06:59:08Z

@jiekun I mixed up the SIG time therefore I missed yesterday's SIG. I believe you have also said you were going to miss it. Do you know if this item has been discussed? If not, can we put it in the agenda of the next SIG?

mizhexiaoxiao · 2023-07-03T09:40:37Z

Do you know why these traces are kept in memory if the decision_wait has passed for them? They have already expired. Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

Trace span data has been deleted after decision_wait. The sync.Map idToTrace now retains all trace data except for ReceivedBatches. the following is the relevant code

// Sampled or not, remove the batches
trace.Lock()
allSpans := trace.ReceivedBatches
trace.FinalDecision = decision
trace.ReceivedBatches = ptrace.NewTraces()
trace.Unlock()

I don't think the trace id in the memory should be deleted as the decision_wait expires, because the trace data will still be received and sent after decision_wait, and the trace at this time will judge whether to sample according to the FinalDecision before the trace id retained in the memory.

d, loaded := tsp.idToTrace.Load(id)
if !loaded {
	d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{
		Decisions:       initialDecisions,
		ArrivalTime:     time.Now(),
		SpanCount:       atomic.NewInt64(lenSpans),
		ReceivedBatches: ptrace.NewTraces(),
	})
}
if loaded {
	actualData.SpanCount.Add(lenSpans)
} else {
	...
}

// The only thing we really care about here is the final decision.
actualData.Lock()
finalDecision := actualData.FinalDecision

num traces must be set large enough to ensure that the trace id is still in the idToTrace when make decision, the otelcol_processor_tail_sampling_sampling_trace_dropped_too_early metric can know whether the trace id has been dropped early

statDroppedTooEarlyCount = stats.Int64("sampling_trace_dropped_too_early", "Count of traces that needed to be dropped the configured wait time", stats.UnitDimensionless)

for _, id := range batch {
    d, ok := tsp.idToTrace.Load(id)
    if !ok {
	    metrics.idNotFoundOnMapCount++
	    continue
    }

statDroppedTooEarlyCount.M(metrics.idNotFoundOnMapCount),

github-actions · 2023-09-04T03:29:21Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jiekun · 2023-10-11T15:43:39Z

touch #23648

I assume we should add some doc on this based on previous discussion?

utezduyar · 2023-10-13T05:28:43Z

@jpkrohling and I chatted about this and he was also confirming that there is a bug.

github-actions · 2023-12-13T03:29:12Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions · 2024-02-12T03:30:17Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-04-15T04:08:17Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-07-01T03:32:00Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling · 2024-07-08T13:57:14Z

I believe this was fixed on #33426

roandudel · 2024-07-09T09:14:52Z

Hello @jpkrohling, I recently configured the Tail Sampling Processor and encountered this exact issue. I now understand that the metric is functioning correctly and that it is not a bug. My question is, what is the purpose of this metric? If the num_traces value is reached, does the metric become meaningless? Or am I missing something important? I came across this metric because I wanted to determine the number of spans currently in the decision loop.

jpkrohling · 2024-07-09T14:52:52Z

Upon a further review, I see that the behavior is still the same, so I'm reopening.

To you as a user, the behavior is meaningless. Here's what happens behind the scenes:

a trace is kept in memory indefinitely until the internal map is full
once the map is full, the oldest entry would be removed in favor of a newer one
on the happy path, a sampling decision for the entry being removed would have long been made
under stress, an older trace might be dropped without a decision having been made

Only the situation on 4 would be relevant to you, and it should be visible through the metric processor_tail_sampling_sampling_trace_dropped_too_early.

I believe we should just remove this metric, it's confusing.

@jamesmoessis, what do you think?

tommyallieri · 2024-08-01T08:53:24Z

my 2 cents on the matter @jpkrohling as I'm exploring this metric. I understand that processor_tail_sampling_sampling_trace_dropped_too_early helps understand when the boat is sinking, but this means data loss has already occurred.

A metric to understand how much is "in memory and active" would be helpful to better understand what an appropriate value for num_traces is and being able to possibly resize the value for num_traces before data loss occurs. I too was hoping otelcol_processor_tail_sampling_sampling_traces_on_memory could be this metric.

Or perhaps a new metric is needed for active metrics only otelcol_processor_tail_sampling_sampling_active_traces_on_memory.

jpkrohling · 2024-08-13T12:21:10Z

I agree with you that we should have a better metric for that. This does require a refactor of the tail-sampling processor's internal cache, which should be one as part of #31580 .

github-actions · 2024-10-14T03:36:12Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2025-01-28T03:31:51Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

utezduyar added bug Something isn't working needs triage New item requiring triage labels Jun 23, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Jun 23, 2023

Frapschen added enhancement New feature or request and removed bug Something isn't working labels Jun 27, 2023

Frapschen assigned jiekun Jun 28, 2023

Frapschen removed the needs triage New item requiring triage label Jun 28, 2023

utezduyar mentioned this issue Jun 29, 2023

num_traces many spans that passed decision_wait time are still kept in the memory #23825

Closed

github-actions bot added the Stale label Sep 4, 2023

jpkrohling removed the Stale label Sep 5, 2023

Liubey mentioned this issue Oct 26, 2023

Question about tail_sampling result #28638

Closed

github-actions bot added the Stale label Dec 13, 2023

jpkrohling removed the Stale label Dec 13, 2023

github-actions bot added the Stale label Feb 12, 2024

jpkrohling removed the Stale label Feb 14, 2024

github-actions bot added the Stale label Apr 15, 2024

jpkrohling removed the Stale label Apr 30, 2024

jpkrohling assigned jpkrohling and unassigned jiekun Apr 30, 2024

github-actions bot added the Stale label Jul 1, 2024

jpkrohling removed the Stale label Jul 8, 2024

jpkrohling closed this as completed Jul 8, 2024

jpkrohling reopened this Jul 9, 2024

github-actions bot added the Stale label Oct 14, 2024

jpkrohling removed the Stale label Nov 28, 2024

github-actions bot added the Stale label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

utezduyar commented Jun 23, 2023

github-actions bot commented Jun 23, 2023

jiekun commented Jun 25, 2023 •

edited

Loading

Frapschen commented Jun 27, 2023

jiekun commented Jun 28, 2023

utezduyar commented Jun 28, 2023

jiekun commented Jun 28, 2023 •

edited

Loading

utezduyar commented Jun 28, 2023

jiekun commented Jun 28, 2023 •

edited

Loading

utezduyar commented Jun 29, 2023

utezduyar commented Jun 29, 2023

mizhexiaoxiao commented Jul 3, 2023 •

edited

Loading

github-actions bot commented Sep 4, 2023

jiekun commented Oct 11, 2023

utezduyar commented Oct 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Feb 12, 2024

github-actions bot commented Apr 15, 2024

github-actions bot commented Jul 1, 2024

jpkrohling commented Jul 8, 2024

roandudel commented Jul 9, 2024

jpkrohling commented Jul 9, 2024 •

edited

Loading

tommyallieri commented Aug 1, 2024

jpkrohling commented Aug 13, 2024

github-actions bot commented Oct 14, 2024

github-actions bot commented Jan 28, 2025

otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

Comments

utezduyar commented Jun 23, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jun 23, 2023

jiekun commented Jun 25, 2023 • edited Loading

Frapschen commented Jun 27, 2023

jiekun commented Jun 28, 2023

utezduyar commented Jun 28, 2023

jiekun commented Jun 28, 2023 • edited Loading

utezduyar commented Jun 28, 2023

jiekun commented Jun 28, 2023 • edited Loading

utezduyar commented Jun 29, 2023

utezduyar commented Jun 29, 2023

mizhexiaoxiao commented Jul 3, 2023 • edited Loading

github-actions bot commented Sep 4, 2023

jiekun commented Oct 11, 2023

utezduyar commented Oct 13, 2023

github-actions bot commented Dec 13, 2023

github-actions bot commented Feb 12, 2024

github-actions bot commented Apr 15, 2024

github-actions bot commented Jul 1, 2024

jpkrohling commented Jul 8, 2024

roandudel commented Jul 9, 2024

jpkrohling commented Jul 9, 2024 • edited Loading

tommyallieri commented Aug 1, 2024

jpkrohling commented Aug 13, 2024

github-actions bot commented Oct 14, 2024

github-actions bot commented Jan 28, 2025

jiekun commented Jun 25, 2023 •

edited

Loading

jiekun commented Jun 28, 2023 •

edited

Loading

jiekun commented Jun 28, 2023 •

edited

Loading

mizhexiaoxiao commented Jul 3, 2023 •

edited

Loading

jpkrohling commented Jul 9, 2024 •

edited

Loading