Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648

Open
utezduyar opened this issue Jun 23, 2023 · 25 comments
Assignees
Labels
enhancement New feature or request processor/tailsampling Tail sampling processor Stale

Comments

@utezduyar
Copy link
Contributor

Component(s)

processor/tailsampling

What happened?

Description

The help text of the metric indicates that it is a gauge. This metric is only increasing, as if it is just the count of spans processed.

Steps to Reproduce

Run this collector and send 300 spans. Wait 2 minutes. See the metric is not going down.

# HELP otelcol_processor_tail_sampling_sampling_traces_on_memory Tracks the number of traces current on memory
# TYPE otelcol_processor_tail_sampling_sampling_traces_on_memory gauge
otelcol_processor_tail_sampling_sampling_traces_on_memory{service_instance_id="490bb5de-9af9-4d47-8b16-afb69583fbc7",service_name="otelcol-contrib",service_version="0.79.0"} 300

Expected Result

Actual Result

Collector version

0.79.0

Environment information

otelcol-contrib_0.79.0_darwin_arm64

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:

exporters:
  logging:
    verbosity: detailed
    sampling_initial: 1000
    sampling_thereafter: 1000

processors:
  batch:
  tail_sampling:
    decision_wait: 60s
    policies:
      [
        {
          name: composite-policy,
          type: composite,
          composite:
            {
              max_total_spans_per_second: 10,
              policy_order: [composite-policy-errors, test-composite-always],
              composite_sub_policy:
                [
                  {
                    name: composite-policy-errors,
                    type: status_code,
                    status_code: {status_codes: [ERROR]}
                  },
                ],
              rate_allocation:
                [
                  {
                    policy: composite-policy-errors,
                    percent: 100
                  },
                ]
            }
        },
        {
          name: test-policy-8,
          type: rate_limiting,
          rate_limiting: {spans_per_second: 2}
        },
#        {
#          name: test-policy-1,
#          type: always_sample
#        },
#        {
#          name: test-policy-5,
#          type: status_code,
#          status_code: {status_codes: [ERROR]}
#        },
#        {
#          name: test-policy-4,
#          type: probabilistic,
#          probabilistic: {sampling_percentage: 50}
#        },
      ]

service:
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: ":8888"
  pipelines:
    traces:
      receivers: [otlp]
      #processors: [batch, tail_sampling]
      processors: [tail_sampling]
      exporters: [logging]

Log output

No response

Additional context

No response

@utezduyar utezduyar added bug Something isn't working needs triage New item requiring triage labels Jun 23, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Jun 23, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jiekun
Copy link
Member

jiekun commented Jun 25, 2023

Thanks. I'm trying to locate the issue.

processor/tailsamplingprocessor/processor.go

func (tsp *tailSamplingSpanProcessor) processTraces(resourceSpans ptrace.ResourceSpans) {
    ...
                if loaded {
			...
		} else {
			newTraceIDs++
			tsp.decisionBatcher.AddToCurrentBatch(id)

                        // Here we increase the gauge value
			tsp.numTracesOnMap.Add(1)  
			postDeletion := false
			currTime := time.Now()
			for !postDeletion {
				select {
				case tsp.deleteChan <- id:
					postDeletion = true
				default:
					traceKeyToDrop := <-tsp.deleteChan
                                        
                                        // Here we decrease the gauge value
					tsp.dropTrace(traceKeyToDrop, currTime)
				}
			}
		}
    ...

As tsp.deleteChan is a channel with buffer (for example 10):

  1. We can keep sending the traceId to the channel until it's full. the default case won't be triggered before we have 10 traces in memory.
  2. Now we have 10 traces in the channel. You'll have to retrieve one before sending another one.
  3. If no more trace needs to be handled, the metric never drops.

So the sampling_traces_on_memory gauge will keep increasing until it reaches the num_traces in config (default 50000)

	// NumTraces is the number of traces kept on memory. Typically most of the data
	// of a trace is released after a sampling decision is taken.
	NumTraces uint64 `mapstructure:"num_traces"`

@utezduyar To verify, you could config num_traces to a lower value and see if the sampling_traces_on_memory gauge will reach that value and keep stable.

Ping @jpkrohling for help. I am not sure if it's written by design. I will be happy to follow up the fix/improvement if necessary. Possible solution:

  1. Execute dropTrace to remove the data in sync.Map by another goroutine.
  2. Add more documents to the metric/memory usage so that people could know tailsamplingprocessor do keep num_traces traces in memory, the gauge is correct and will keep increasing before reaching the configured num_traces.

@Frapschen Frapschen added enhancement New feature or request and removed bug Something isn't working labels Jun 27, 2023
@Frapschen
Copy link
Contributor

@jiekun I support 2, should I assign this issue to you?

@jiekun
Copy link
Member

jiekun commented Jun 28, 2023

@Frapschen Sure I want to optimize it. Do we need more discussion or I could just go ahead now?

@utezduyar
Copy link
Contributor Author

@jiekun I have tried your suggestion of lowering the number of traces and the metric capped at the num_traces but stayed at that number.

I believe this metric is still not a gauge. It will go from 0 to num_traces and it will not go up or down. Am I missing something?

@jiekun
Copy link
Member

jiekun commented Jun 28, 2023

@jiekun I have tried your suggestion of lowering the number of traces and the metric capped at the num_traces but stayed at that number.

I believe this metric is still not a gauge. It will go from 0 to num_traces and it will not go up or down. Am I missing something?

Thanks for validating it.

This metric is a gauge. It refers to "how many traces are stored in an in-memory sync.Map". The current implementation won't drop those traces and release the memory space if no other traces are coming in. In other words:

  • Do we have enough space?
    • Yes -> Keep collecting new trace. gague keep increasing.
    • No -> Drop 1 old trace and collect 1 new trace. gague -1 and then +1, which looks stable.

That's why the gauge looks more like a counter before reaching the memory limit.

@utezduyar
Copy link
Contributor Author

Do you know why these traces are kept in memory if the decision_wait has passed for them? They have already expired. Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

@jiekun
Copy link
Member

jiekun commented Jun 28, 2023

I think decision_wait control the workflow before the code block I posted above. It's not relevant at all.

Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

I think you are correct. To me, it sounds like an expected behavior more than the current one.

We may just need to set up a goroutine to drop those useless in-memory trace data. But currently it's not impl'ed like this. And we may need a PR to optimize it.

I suggest we could have a little discussion on the Collector SIG meeting tomorrow. Feel free to attend and comment: https://docs.google.com/document/d/1r2JC5MB7GupCE7N32EwGEXs9V_YIsPgoFiLP4VWVMkE/edit#heading=h.rbf22rxu3mij

@utezduyar
Copy link
Contributor Author

I have opened a Feature Request PR and linked it to this one. If that one is fixed, I believe we can close this PR, otherwise maybe we should change the title to "documentation improvement" somehow.

@utezduyar
Copy link
Contributor Author

@jiekun I mixed up the SIG time therefore I missed yesterday's SIG. I believe you have also said you were going to miss it. Do you know if this item has been discussed? If not, can we put it in the agenda of the next SIG?

@mizhexiaoxiao
Copy link
Contributor

mizhexiaoxiao commented Jul 3, 2023

Do you know why these traces are kept in memory if the decision_wait has passed for them? They have already expired. Shouldn't they be thrown away and lower the otelcol_processor_tail_sampling_sampling_traces_on_memory accordingly?

Trace span data has been deleted after decision_wait. The sync.Map idToTrace now retains all trace data except for ReceivedBatches. the following is the relevant code

// Sampled or not, remove the batches
trace.Lock()
allSpans := trace.ReceivedBatches
trace.FinalDecision = decision
trace.ReceivedBatches = ptrace.NewTraces()
trace.Unlock()

I don't think the trace id in the memory should be deleted as the decision_wait expires, because the trace data will still be received and sent after decision_wait, and the trace at this time will judge whether to sample according to the FinalDecision before the trace id retained in the memory.

d, loaded := tsp.idToTrace.Load(id)
if !loaded {
	d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{
		Decisions:       initialDecisions,
		ArrivalTime:     time.Now(),
		SpanCount:       atomic.NewInt64(lenSpans),
		ReceivedBatches: ptrace.NewTraces(),
	})
}
if loaded {
	actualData.SpanCount.Add(lenSpans)
} else {
	...
}

// The only thing we really care about here is the final decision.
actualData.Lock()
finalDecision := actualData.FinalDecision

num traces must be set large enough to ensure that the trace id is still in the idToTrace when make decision, the otelcol_processor_tail_sampling_sampling_trace_dropped_too_early metric can know whether the trace id has been dropped early

statDroppedTooEarlyCount = stats.Int64("sampling_trace_dropped_too_early", "Count of traces that needed to be dropped the configured wait time", stats.UnitDimensionless)

for _, id := range batch {
    d, ok := tsp.idToTrace.Load(id)
    if !ok {
	    metrics.idNotFoundOnMapCount++
	    continue
    }

statDroppedTooEarlyCount.M(metrics.idNotFoundOnMapCount),

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 4, 2023
@jpkrohling jpkrohling removed the Stale label Sep 5, 2023
@jiekun
Copy link
Member

jiekun commented Oct 11, 2023

touch #23648

I assume we should add some doc on this based on previous discussion?

@utezduyar
Copy link
Contributor Author

@jpkrohling and I chatted about this and he was also confirming that there is a bug.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 12, 2024
@jpkrohling jpkrohling removed the Stale label Feb 14, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Apr 15, 2024
@jpkrohling jpkrohling removed the Stale label Apr 30, 2024
@jpkrohling jpkrohling assigned jpkrohling and unassigned jiekun Apr 30, 2024
Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 1, 2024
@jpkrohling jpkrohling removed the Stale label Jul 8, 2024
@jpkrohling
Copy link
Member

I believe this was fixed on #33426

@roandudel
Copy link

Hello @jpkrohling, I recently configured the Tail Sampling Processor and encountered this exact issue. I now understand that the metric is functioning correctly and that it is not a bug. My question is, what is the purpose of this metric? If the num_traces value is reached, does the metric become meaningless? Or am I missing something important? I came across this metric because I wanted to determine the number of spans currently in the decision loop.

@jpkrohling
Copy link
Member

jpkrohling commented Jul 9, 2024

Upon a further review, I see that the behavior is still the same, so I'm reopening.

To you as a user, the behavior is meaningless. Here's what happens behind the scenes:

  1. a trace is kept in memory indefinitely until the internal map is full
  2. once the map is full, the oldest entry would be removed in favor of a newer one
  3. on the happy path, a sampling decision for the entry being removed would have long been made
  4. under stress, an older trace might be dropped without a decision having been made

Only the situation on 4 would be relevant to you, and it should be visible through the metric processor_tail_sampling_sampling_trace_dropped_too_early.

I believe we should just remove this metric, it's confusing.

@jamesmoessis, what do you think?

@jpkrohling jpkrohling reopened this Jul 9, 2024
@tommyallieri
Copy link

my 2 cents on the matter @jpkrohling as I'm exploring this metric. I understand that processor_tail_sampling_sampling_trace_dropped_too_early helps understand when the boat is sinking, but this means data loss has already occurred.

A metric to understand how much is "in memory and active" would be helpful to better understand what an appropriate value for num_traces is and being able to possibly resize the value for num_traces before data loss occurs. I too was hoping otelcol_processor_tail_sampling_sampling_traces_on_memory could be this metric.

Or perhaps a new metric is needed for active metrics only otelcol_processor_tail_sampling_sampling_active_traces_on_memory.

@jpkrohling
Copy link
Member

I agree with you that we should have a better metric for that. This does require a refactor of the tail-sampling processor's internal cache, which should be one as part of #31580 .

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 28, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

No branches or pull requests

7 participants