-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
otelcol_processor_tail_sampling_sampling_traces_on_memory is only incrementing, it is not a gauge #23648
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Thanks. I'm trying to locate the issue.
func (tsp *tailSamplingSpanProcessor) processTraces(resourceSpans ptrace.ResourceSpans) {
...
if loaded {
...
} else {
newTraceIDs++
tsp.decisionBatcher.AddToCurrentBatch(id)
// Here we increase the gauge value
tsp.numTracesOnMap.Add(1)
postDeletion := false
currTime := time.Now()
for !postDeletion {
select {
case tsp.deleteChan <- id:
postDeletion = true
default:
traceKeyToDrop := <-tsp.deleteChan
// Here we decrease the gauge value
tsp.dropTrace(traceKeyToDrop, currTime)
}
}
}
... As
So the // NumTraces is the number of traces kept on memory. Typically most of the data
// of a trace is released after a sampling decision is taken.
NumTraces uint64 `mapstructure:"num_traces"` @utezduyar To verify, you could config Ping @jpkrohling for help. I am not sure if it's written by design. I will be happy to follow up the fix/improvement if necessary. Possible solution:
|
@jiekun I support 2, should I assign this issue to you? |
@Frapschen Sure I want to optimize it. Do we need more discussion or I could just go ahead now? |
@jiekun I have tried your suggestion of lowering the number of traces and the metric capped at the num_traces but stayed at that number. I believe this metric is still not a gauge. It will go from 0 to num_traces and it will not go up or down. Am I missing something? |
Thanks for validating it. This metric is a gauge. It refers to "how many traces are stored in an in-memory
That's why the gauge looks more like a counter before reaching the memory limit. |
Do you know why these traces are kept in memory if the |
I think
I think you are correct. To me, it sounds like an expected behavior more than the current one. We may just need to set up a goroutine to drop those useless in-memory trace data. But currently it's not impl'ed like this. And we may need a PR to optimize it. I suggest we could have a little discussion on the Collector SIG meeting tomorrow. Feel free to attend and comment: https://docs.google.com/document/d/1r2JC5MB7GupCE7N32EwGEXs9V_YIsPgoFiLP4VWVMkE/edit#heading=h.rbf22rxu3mij |
I have opened a Feature Request PR and linked it to this one. If that one is fixed, I believe we can close this PR, otherwise maybe we should change the title to "documentation improvement" somehow. |
@jiekun I mixed up the SIG time therefore I missed yesterday's SIG. I believe you have also said you were going to miss it. Do you know if this item has been discussed? If not, can we put it in the agenda of the next SIG? |
Trace span data has been deleted after // Sampled or not, remove the batches
trace.Lock()
allSpans := trace.ReceivedBatches
trace.FinalDecision = decision
trace.ReceivedBatches = ptrace.NewTraces()
trace.Unlock() I don't think the trace id in the memory should be deleted as the d, loaded := tsp.idToTrace.Load(id)
if !loaded {
d, loaded = tsp.idToTrace.LoadOrStore(id, &sampling.TraceData{
Decisions: initialDecisions,
ArrivalTime: time.Now(),
SpanCount: atomic.NewInt64(lenSpans),
ReceivedBatches: ptrace.NewTraces(),
})
}
if loaded {
actualData.SpanCount.Add(lenSpans)
} else {
...
}
// The only thing we really care about here is the final decision.
actualData.Lock()
finalDecision := actualData.FinalDecision
statDroppedTooEarlyCount = stats.Int64("sampling_trace_dropped_too_early", "Count of traces that needed to be dropped the configured wait time", stats.UnitDimensionless)
for _, id := range batch {
d, ok := tsp.idToTrace.Load(id)
if !ok {
metrics.idNotFoundOnMapCount++
continue
}
statDroppedTooEarlyCount.M(metrics.idNotFoundOnMapCount), |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I assume we should add some doc on this based on previous discussion? |
@jpkrohling and I chatted about this and he was also confirming that there is a bug. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I believe this was fixed on #33426 |
Hello @jpkrohling, I recently configured the Tail Sampling Processor and encountered this exact issue. I now understand that the metric is functioning correctly and that it is not a bug. My question is, what is the purpose of this metric? If the |
Upon a further review, I see that the behavior is still the same, so I'm reopening. To you as a user, the behavior is meaningless. Here's what happens behind the scenes:
Only the situation on 4 would be relevant to you, and it should be visible through the metric I believe we should just remove this metric, it's confusing. @jamesmoessis, what do you think? |
my 2 cents on the matter @jpkrohling as I'm exploring this metric. I understand that A metric to understand how much is "in memory and active" would be helpful to better understand what an appropriate value for Or perhaps a new metric is needed for active metrics only |
I agree with you that we should have a better metric for that. This does require a refactor of the tail-sampling processor's internal cache, which should be one as part of #31580 . |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
processor/tailsampling
What happened?
Description
The help text of the metric indicates that it is a gauge. This metric is only increasing, as if it is just the count of spans processed.
Steps to Reproduce
Run this collector and send 300 spans. Wait 2 minutes. See the metric is not going down.
Expected Result
Actual Result
Collector version
0.79.0
Environment information
otelcol-contrib_0.79.0_darwin_arm64
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: