bugfix: native histogram: data race in addExemplar #1609

krajorama · 2024-08-31T08:38:06Z

Slices cap function is not thread safe. Reuse instead the ttl to store the ttl atomically with -1 indicating exemplars turned off.

Includes fix for #1607 from #1608 .

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Instead of checking non thread safe slice capacity Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

mdIdx was redundant when len(exemplars)>1, so got rid of it, rIdx is enough. Don't compare timestamp of incoming exemplar to timestamp of minimal distance exemplar. Most of the time the incoming exemplar will be newer. And if not, the previous code just replaced an exemplar one index after the minimal distance exemplar. Which had an index out of range bug, plus is essentially random. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> # Conflicts: # prometheus/histogram.go

krajorama · 2024-08-31T11:08:51Z

cc @fatsheep9146

beorn7 · 2024-09-01T18:08:21Z

Code looks good, but I'm not quite sure if we should do this kind of optimization via ttl as an atomic int. Those atomic operations still have a cost (and it could be significant on systems with many cores). The usual code path is to call ObserveWithExemplar on a histogram with configured exemplars, and just call Observe on a histogram where exemplars are not configured. The former case becomes even slower with this change here, while the latter case isn't affected at all.
I'm wondering if there is a legitimate use case for calling ObserveWithExemplar on a histogram with exemplars disabled. And even if so, if it is common enough to make this PR a net gain.

krajorama · 2024-09-02T06:58:16Z

Code looks good, but I'm not quite sure if we should do this kind of optimization via ttl as an atomic int. Those atomic operations still have a cost (and it could be significant on systems with many cores). The usual code path is to call ObserveWithExemplar on a histogram with configured exemplars, and just call Observe on a histogram where exemplars are not configured. The former case becomes even slower with this change here, while the latter case isn't affected at all. I'm wondering if there is a legitimate use case for calling ObserveWithExemplar on a histogram with exemplars disabled. And even if so, if it is common enough to make this PR a net gain.

Well, TTL is constant, so in theory we don't need to use atomics. Pushed that version.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

ArthurSens

Thanks for updating the test in the first commit to show the bug. Helps a lot with the review :)

I do understand that using ttl instead of cap(n.exemplars) solves the data race condition, so LGTM now that tests pass.

About the index out of range problem. I'm having a hard time understanding the codebase, but I believe in Beorn's expertise here. Out of scope for this PR but I think more descriptive variable names could help

ArthurSens · 2024-09-02T12:41:45Z

prometheus/histogram.go

+		md   = -1.0  // Logarithm of the delta of the closest pair of exemplars.
+		rIdx = -1    // Index of the older exemplar within the closest pair and where we need to insert the new exemplar.
+		cLog float64 // Logarithm of the current exemplar.
+		pLog float64 // Logarithm of the previous exemplar.


Hmmm, totally out of scope for this PR, but more descriptive variable names would remove the necessity for commenting on their meaning 😕

I'm not sure if I'm missing something, but is there a reason to name these variables rIdx, cLog, pLog instead of removeIndex, currentLog or previousLog?

Urgh, reading again replaceIndex is probably a better name for rIdx. Does it make nIdx obsolete? (Do I understand correctly that nIdx is newIndex? 😅

Historically, we had very different "tastes" playing out between various Prometheus developers, all the way from very long and descriptive names to "use one letter names whenever possible" (slightly exaggerated).

My observation is that Go developers tend towards shorter variable names (have a look at the standard library, for example). But we never explicitly stated a preference for the Prometheus project. In lack of a specific guideline, I would go with generally good practices of programming like "the wider the scope of a variable, the more descriptive its name".

In this case, the variables are local to a (fairly long) function. So I wouldn't be too demanding about the descriptiveness of their names. Furthermore, even previousLog wouldn't really be self explanatory. (Previous log line?) And a name like logarithmOfPreviousExemplar would be clearly too clunky. Given that the comment is needed anyway, I actually like the names as they are. Still very short, but better than just c, p, and r (which others might even prefer).

ArthurSens · 2024-09-02T12:47:20Z

Should we close #1608 in favor of this one?

krajorama · 2024-09-02T12:49:09Z

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

ArthurSens · 2024-09-02T12:53:27Z

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

ttl doesn't receive any writes after its creation. Look safe and efficient to me :)

krajorama · 2024-09-02T12:55:38Z

Should we close #1608 in favor of this one?

I'd like to keep discussion of the data race topic here. If Beorn says ok, I'll close this PR and just udpate the solution in #1608 with the one agreed here and we can discuss the algorithm there.

ttl doesn't receive any writes after its creation. Look safe and efficient to me :)

Ack, I'll update the other PR and we can just close this if we merge that.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama · 2024-09-03T12:45:59Z

Discussed on chat with @beorn7 , using ttl==-1 is ok. Closing this as concluded. Solution in delivered #1607

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama added 2 commits August 31, 2024 10:27

native histogram: use exemplars in concurrency test

8fb77a3

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Use an atomic variable to check if exemplars are disabled

5123705

Instead of checking non thread safe slice capacity Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama mentioned this pull request Aug 31, 2024

bugfix: native histogram: exemplars index out of range #1608

Merged

krajorama marked this pull request as ready for review August 31, 2024 11:08

krajorama requested a review from beorn7 August 31, 2024 11:08

Do not use atomics

6c6aa53

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama force-pushed the data-race-in-histogram branch from 5b005fe to 6c6aa53 Compare September 2, 2024 07:25

ArthurSens reviewed Sep 2, 2024

View reviewed changes

krajorama added a commit to krajorama/client_golang that referenced this pull request Sep 2, 2024

Use simplified solution from prometheus#1609 for the data race

dfbd8eb

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama closed this Sep 3, 2024

krajorama added a commit to krajorama/client_golang that referenced this pull request Sep 4, 2024

Use simplified solution from prometheus#1609 for the data race

504566f

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: native histogram: data race in addExemplar #1609

bugfix: native histogram: data race in addExemplar #1609

krajorama commented Aug 31, 2024 •

edited

Loading

krajorama commented Aug 31, 2024

beorn7 commented Sep 1, 2024

krajorama commented Sep 2, 2024

ArthurSens left a comment

ArthurSens Sep 2, 2024

ArthurSens Sep 2, 2024

beorn7 Sep 3, 2024

ArthurSens commented Sep 2, 2024

krajorama commented Sep 2, 2024

ArthurSens commented Sep 2, 2024

krajorama commented Sep 2, 2024

krajorama commented Sep 3, 2024 •

edited

Loading

bugfix: native histogram: data race in addExemplar #1609

bugfix: native histogram: data race in addExemplar #1609

Conversation

krajorama commented Aug 31, 2024 • edited Loading

krajorama commented Aug 31, 2024

beorn7 commented Sep 1, 2024

krajorama commented Sep 2, 2024

ArthurSens left a comment

Choose a reason for hiding this comment

ArthurSens Sep 2, 2024

Choose a reason for hiding this comment

ArthurSens Sep 2, 2024

Choose a reason for hiding this comment

beorn7 Sep 3, 2024

Choose a reason for hiding this comment

ArthurSens commented Sep 2, 2024

krajorama commented Sep 2, 2024

ArthurSens commented Sep 2, 2024

krajorama commented Sep 2, 2024

krajorama commented Sep 3, 2024 • edited Loading

krajorama commented Aug 31, 2024 •

edited

Loading

krajorama commented Sep 3, 2024 •

edited

Loading