Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix SaltScanner race condition on spans maps #1651

Merged

Conversation

neilfordyce
Copy link
Contributor

I believe this fixes one of the race condition issues described here #823

For context, we have 3 read only m5.2xlarge nodes, each serving query traffic at a rate of around 1,000 requests per minute. Several times per day, we observed a TSD process becoming stuck, maxing out 1 core on the machine. A closer look revealed one thread spending all of its time inside SaltScanner.mergeDataPoints() in a TreeMap.get(). I think there is a race condition which can trigger if validateAndTriggerCallback() and a handleException() occur concurrently.

We have a health check process which restarts the TSD process if it starts to fail. Since we have deployed this fix, we have noticed a significant decrease in restarts.

@johann8384 johann8384 merged commit b5133bd into OpenTSDB:master May 29, 2019
johann8384 pushed a commit that referenced this pull request Oct 26, 2020
* Fix SaltScanner race condition on spans maps

* Fix 1.6 compatibility
johann8384 pushed a commit that referenced this pull request Feb 24, 2021
#1458)

* For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that.

* add some unit tests for FirstDifference

Bump version to 2.5.0-SNAPSHOT.

Fix a compilation error about missing FirstDifference (#1471)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

Bugfix of FsckOptions. (#1464)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

CORE: (#1472)

- Add RpcResponder for handling callbacks asynchronously
UTILS:
 - Add two convenient methods in Config

Signed-off-by: Chris Larsen <clarsen@verizonmedia.com>

fix #1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (#1582)

Dockerfile that works without a script. (#1739)

replace FOREVER with a valid value in table creation (#1967)

Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com>

Jackson has a serious security problem in 2.9.5, which will cause RCE (#2034)

* Jackson has a serious security problem in 2.9.5, which will cause RCE

FasterXML/jackson-databind#2295

* Jackson has a serious security problem in 2.9.5, which will cause RCE

FasterXML/jackson-databind#2295

Co-authored-by: chi-chi weng <949409306@qq.com>

Pr 1663 (#1966)

* Make UniqueIdRpc aware of the mode

* Update javadoc on new method and rename test methods to be more descriptive

Co-authored-by: Simon Matic Langford <simon@exemel.co.uk>

Re-introduce query timeouts. (#2035)

Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org>

Updating maven central urls and versions to match what is available now (#2039)

Fixes #1899
Fixes #1941

always write cli tools to stdout (#1488)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

Fix #1632 (#1634)

Add "check_tsd_v2" script (#1567)

Enhanced check_tsd script evaluates each individual metric group separately when given a filter

Collect stats from meta cache plugin if configured (#1649)

Fix SaltScanner race condition on spans maps (#1651)

* Fix SaltScanner race condition on spans maps

* Fix 1.6 compatibility

Synchronise the KVs list for scanner results

Synchronises the list that holds the KeyValues that have been produced
by the scanner callbacks. The list is accessed from multiple threads at
a time and wasn't thread-safe, causing inconsistent results and partial
loss of data in the response.

Relates to: #1753
Resolves: #1760

Allow rollup downsample and series aggregator to be different

Fix TestSaltScannerHistogram, looks like the method was renamed and the
UTs were not adjusted.

ExplicitTags filtering with FuzzyFilters

Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter
and properly ignore rows that don't match the explicit filter. Also sort the
fuzzy filter list in ascending order and implement a static comparator instead
of instantiating one on each call.

Test rollup filter fix for #1083

Fix concurrent result reporting from scanners

Fixes a concurrency bug where scanners report their results into a map
and would overwrite each other's results

Resolves: #1753

Update Maven jars URLs with HTTPS access

Remove excess param in javadoc for RpcHandler

Fix check_tsd_v2 (#1937)

* renamed instancename of logger

The previous name was copied from another script, cosmetic change only

* Change behaviour of --ignore-recent option

Previous option would fetch data from opentsdb from --duration seconds
ago to time.now(), and then try to remove timestamps that was inside the
--ignore-recent seconds ago, however the logic was flawed and it
actually only included these seconds. Furthermore opentsdb supports
setting an "end" parameter, so we use this to only get the data we want.

for example -d 180 -I 80, would render a query parameter that looks like
`?start=180s-ago&end=80s-ago`. Keeps it simple.

Also added debuglogging to output the actual query sent to OpenTSDB if
--debug option is enabled.

* fixed logic of --percent-over parameter

Previous behaviour didn't work due to wrong logic, would set "crit" or
"warn" to True regardless. This change fixes that.

* better output from logging

Add logmessages to be consistent across alerting-scenarios, and changed
format of some floats. Fixed a log messaged that displayed "crit" value
where it should have been "warn" value.

* Fixed bug in logic that parses results

Removed an if statement that `continue`:ed the for-loop if a result was
neither a `crit` or `warn` already, however this check also made the logic
skip the test to see if no values were returned by opentsdb and -A flag was
specified to alert in such scenarios.

* changed check for timestamps type

Previous behaviour was to check if a timestamp could be cast as a float,
which is a bit weird, because opentsdb will return integers.

I do doubt that opentsdb would return a timestamp that is not an integer
to begin with, so i suspect this check is redundant, but leaving it in
for now regardless, as per discussion in PR.

Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (#1875)

Fix the missing index from #1754 in the salt scanner.

Force Sunday as first day of week.

Tweak TestTsdbQueryQueries to pass in older java versions.

Fix the min case for doubles in AggregationIterator.

Fix the Screw Driver config.

Fix UT for JDK8

PR for SD config.

Fix: Rollup queries with count aggregator produce unexpected results (#1895)

Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net>

Fixed function description Fixes #841 (#2040)

Added tracking of metrics which are null due to auto_metric being disabled Fixes #786 (#2042)

Add support for splitting rollup queries (#1853)

* Add an SLA config flag for rollup intervals

Adds a configuration option for rollup intervals to specify their
maximum acceptable delay. Queries that cover a time between now and that
maximum delay will need to query other tables for that time interval.

* Add global config flag to enable splitting queries

Adds a global config flag to enable splitting queries that would hit the
rollup table, but the rollup table has a delay SLA configured.
In that case, this feature allows splitting a query into to; one that
gets the data from the rollups table until the time where it's
guaranteed to be available, and the rest from the raw table.

* Add a new SplitRollupQuery

Adds a SplitRollupQuery class that suports splitting a rollup query into
two separate queries.
This is useful for when a rollup table is filled by e.g. a batch job
that processes the data from the previous day on a daily basis. Rollup
data for yesterday will then only be available some time today. This
delay SLA can be configured on a per-table basis. The delay would
specify by how much time the table can be behind real time.

If a query comes in that would query data from that blackout period
where data is only available in the raw table, but not yet guaranteed to
be in the rollup table, the incoming query can be split into two using
the SplitRollupQuery class. It wraps a query that queries the rollup
table until the last guaranteed to be available timestamp based on the
SLA; and one that gets the remaining data from the raw table.

* Extract an AbstractQuery

Extracts an AbstractQuery from the TsdbQuery implementation since we'd
like to reuse some parts of it in other Query classes (in this case
SplitRollupQuery)

* Extract an AbstractSpanGroup

* Avoid NullPointerException when setting start time

Avoids a NullPointerException that happened when we were trying to set
the start time on a query that would be eligible to split, but due to
the SLA config only hit the raw table anyway.

* Scale timestamps to milliseconds for split queries

Scales all timestamps for split queries to milliseconds. It's important
to maintain consistent units between all the partial queries that make
up the bigger one.

* Fix starting time error for split queries

Fixes a bug that would happen when the start time of a query aligns
perfectly with the time configured in the SLA for the delay of a rollup
table.
For a defined SLA, e.g. 1 day, if the start time of the query was
exactly 1 day ago, the end time of the rollups part of the query would
be updated and then be equal to its start time. That isn't allowed and
causes a query exception.
bshakur8 pushed a commit to bshakur8/opentsdb that referenced this pull request Oct 27, 2021
OpenTSDB#1458)

* For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that.

* add some unit tests for FirstDifference

Bump version to 2.5.0-SNAPSHOT.

Fix a compilation error about missing FirstDifference (OpenTSDB#1471)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

Bugfix of FsckOptions. (OpenTSDB#1464)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

CORE: (OpenTSDB#1472)

- Add RpcResponder for handling callbacks asynchronously
UTILS:
 - Add two convenient methods in Config

Signed-off-by: Chris Larsen <clarsen@verizonmedia.com>

fix OpenTSDB#1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (OpenTSDB#1582)

Dockerfile that works without a script. (OpenTSDB#1739)

replace FOREVER with a valid value in table creation (OpenTSDB#1967)

Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com>

Jackson has a serious security problem in 2.9.5, which will cause RCE (OpenTSDB#2034)

* Jackson has a serious security problem in 2.9.5, which will cause RCE

FasterXML/jackson-databind#2295

* Jackson has a serious security problem in 2.9.5, which will cause RCE

FasterXML/jackson-databind#2295

Co-authored-by: chi-chi weng <949409306@qq.com>

Pr 1663 (OpenTSDB#1966)

* Make UniqueIdRpc aware of the mode

* Update javadoc on new method and rename test methods to be more descriptive

Co-authored-by: Simon Matic Langford <simon@exemel.co.uk>

Re-introduce query timeouts. (OpenTSDB#2035)

Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org>

Updating maven central urls and versions to match what is available now (OpenTSDB#2039)

Fixes OpenTSDB#1899
Fixes OpenTSDB#1941

always write cli tools to stdout (OpenTSDB#1488)

Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>

Fix OpenTSDB#1632 (OpenTSDB#1634)

Add "check_tsd_v2" script (OpenTSDB#1567)

Enhanced check_tsd script evaluates each individual metric group separately when given a filter

Collect stats from meta cache plugin if configured (OpenTSDB#1649)

Fix SaltScanner race condition on spans maps (OpenTSDB#1651)

* Fix SaltScanner race condition on spans maps

* Fix 1.6 compatibility

Synchronise the KVs list for scanner results

Synchronises the list that holds the KeyValues that have been produced
by the scanner callbacks. The list is accessed from multiple threads at
a time and wasn't thread-safe, causing inconsistent results and partial
loss of data in the response.

Relates to: OpenTSDB#1753
Resolves: OpenTSDB#1760

Allow rollup downsample and series aggregator to be different

Fix TestSaltScannerHistogram, looks like the method was renamed and the
UTs were not adjusted.

ExplicitTags filtering with FuzzyFilters

Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter
and properly ignore rows that don't match the explicit filter. Also sort the
fuzzy filter list in ascending order and implement a static comparator instead
of instantiating one on each call.

Test rollup filter fix for OpenTSDB#1083

Fix concurrent result reporting from scanners

Fixes a concurrency bug where scanners report their results into a map
and would overwrite each other's results

Resolves: OpenTSDB#1753

Update Maven jars URLs with HTTPS access

Remove excess param in javadoc for RpcHandler

Fix check_tsd_v2 (OpenTSDB#1937)

* renamed instancename of logger

The previous name was copied from another script, cosmetic change only

* Change behaviour of --ignore-recent option

Previous option would fetch data from opentsdb from --duration seconds
ago to time.now(), and then try to remove timestamps that was inside the
--ignore-recent seconds ago, however the logic was flawed and it
actually only included these seconds. Furthermore opentsdb supports
setting an "end" parameter, so we use this to only get the data we want.

for example -d 180 -I 80, would render a query parameter that looks like
`?start=180s-ago&end=80s-ago`. Keeps it simple.

Also added debuglogging to output the actual query sent to OpenTSDB if
--debug option is enabled.

* fixed logic of --percent-over parameter

Previous behaviour didn't work due to wrong logic, would set "crit" or
"warn" to True regardless. This change fixes that.

* better output from logging

Add logmessages to be consistent across alerting-scenarios, and changed
format of some floats. Fixed a log messaged that displayed "crit" value
where it should have been "warn" value.

* Fixed bug in logic that parses results

Removed an if statement that `continue`:ed the for-loop if a result was
neither a `crit` or `warn` already, however this check also made the logic
skip the test to see if no values were returned by opentsdb and -A flag was
specified to alert in such scenarios.

* changed check for timestamps type

Previous behaviour was to check if a timestamp could be cast as a float,
which is a bit weird, because opentsdb will return integers.

I do doubt that opentsdb would return a timestamp that is not an integer
to begin with, so i suspect this check is redundant, but leaving it in
for now regardless, as per discussion in PR.

Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (OpenTSDB#1875)

Fix the missing index from OpenTSDB#1754 in the salt scanner.

Force Sunday as first day of week.

Tweak TestTsdbQueryQueries to pass in older java versions.

Fix the min case for doubles in AggregationIterator.

Fix the Screw Driver config.

Fix UT for JDK8

PR for SD config.

Fix: Rollup queries with count aggregator produce unexpected results (OpenTSDB#1895)

Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net>

Fixed function description Fixes OpenTSDB#841 (OpenTSDB#2040)

Added tracking of metrics which are null due to auto_metric being disabled Fixes OpenTSDB#786 (OpenTSDB#2042)

Add support for splitting rollup queries (OpenTSDB#1853)

* Add an SLA config flag for rollup intervals

Adds a configuration option for rollup intervals to specify their
maximum acceptable delay. Queries that cover a time between now and that
maximum delay will need to query other tables for that time interval.

* Add global config flag to enable splitting queries

Adds a global config flag to enable splitting queries that would hit the
rollup table, but the rollup table has a delay SLA configured.
In that case, this feature allows splitting a query into to; one that
gets the data from the rollups table until the time where it's
guaranteed to be available, and the rest from the raw table.

* Add a new SplitRollupQuery

Adds a SplitRollupQuery class that suports splitting a rollup query into
two separate queries.
This is useful for when a rollup table is filled by e.g. a batch job
that processes the data from the previous day on a daily basis. Rollup
data for yesterday will then only be available some time today. This
delay SLA can be configured on a per-table basis. The delay would
specify by how much time the table can be behind real time.

If a query comes in that would query data from that blackout period
where data is only available in the raw table, but not yet guaranteed to
be in the rollup table, the incoming query can be split into two using
the SplitRollupQuery class. It wraps a query that queries the rollup
table until the last guaranteed to be available timestamp based on the
SLA; and one that gets the remaining data from the raw table.

* Extract an AbstractQuery

Extracts an AbstractQuery from the TsdbQuery implementation since we'd
like to reuse some parts of it in other Query classes (in this case
SplitRollupQuery)

* Extract an AbstractSpanGroup

* Avoid NullPointerException when setting start time

Avoids a NullPointerException that happened when we were trying to set
the start time on a query that would be eligible to split, but due to
the SLA config only hit the raw table anyway.

* Scale timestamps to milliseconds for split queries

Scales all timestamps for split queries to milliseconds. It's important
to maintain consistent units between all the partial queries that make
up the bigger one.

* Fix starting time error for split queries

Fixes a bug that would happen when the start time of a query aligns
perfectly with the time configured in the SLA for the delay of a rollup
table.
For a defined SLA, e.g. 1 day, if the start time of the query was
exactly 1 day ago, the end time of the rollups part of the query would
be updated and then be equal to its start time. That isn't allowed and
causes a query exception.
rstropoli pushed a commit to MadDogTechnology/opentsdb that referenced this pull request Mar 22, 2024
* Fix SaltScanner race condition on spans maps

* Fix 1.6 compatibility
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants