-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Fix SaltScanner race condition on spans maps #1651
Merged
johann8384
merged 2 commits into
OpenTSDB:master
from
neilfordyce:fix-salt-scanner-race-condition
May 29, 2019
Merged
Fix SaltScanner race condition on spans maps #1651
johann8384
merged 2 commits into
OpenTSDB:master
from
neilfordyce:fix-salt-scanner-race-condition
May 29, 2019
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
johann8384
pushed a commit
that referenced
this pull request
Oct 26, 2020
* Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility
johann8384
pushed a commit
that referenced
this pull request
Feb 24, 2021
#1458) * For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that. * add some unit tests for FirstDifference Bump version to 2.5.0-SNAPSHOT. Fix a compilation error about missing FirstDifference (#1471) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Bugfix of FsckOptions. (#1464) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> CORE: (#1472) - Add RpcResponder for handling callbacks asynchronously UTILS: - Add two convenient methods in Config Signed-off-by: Chris Larsen <clarsen@verizonmedia.com> fix #1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (#1582) Dockerfile that works without a script. (#1739) replace FOREVER with a valid value in table creation (#1967) Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com> Jackson has a serious security problem in 2.9.5, which will cause RCE (#2034) * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 Co-authored-by: chi-chi weng <949409306@qq.com> Pr 1663 (#1966) * Make UniqueIdRpc aware of the mode * Update javadoc on new method and rename test methods to be more descriptive Co-authored-by: Simon Matic Langford <simon@exemel.co.uk> Re-introduce query timeouts. (#2035) Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Updating maven central urls and versions to match what is available now (#2039) Fixes #1899 Fixes #1941 always write cli tools to stdout (#1488) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Fix #1632 (#1634) Add "check_tsd_v2" script (#1567) Enhanced check_tsd script evaluates each individual metric group separately when given a filter Collect stats from meta cache plugin if configured (#1649) Fix SaltScanner race condition on spans maps (#1651) * Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility Synchronise the KVs list for scanner results Synchronises the list that holds the KeyValues that have been produced by the scanner callbacks. The list is accessed from multiple threads at a time and wasn't thread-safe, causing inconsistent results and partial loss of data in the response. Relates to: #1753 Resolves: #1760 Allow rollup downsample and series aggregator to be different Fix TestSaltScannerHistogram, looks like the method was renamed and the UTs were not adjusted. ExplicitTags filtering with FuzzyFilters Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter and properly ignore rows that don't match the explicit filter. Also sort the fuzzy filter list in ascending order and implement a static comparator instead of instantiating one on each call. Test rollup filter fix for #1083 Fix concurrent result reporting from scanners Fixes a concurrency bug where scanners report their results into a map and would overwrite each other's results Resolves: #1753 Update Maven jars URLs with HTTPS access Remove excess param in javadoc for RpcHandler Fix check_tsd_v2 (#1937) * renamed instancename of logger The previous name was copied from another script, cosmetic change only * Change behaviour of --ignore-recent option Previous option would fetch data from opentsdb from --duration seconds ago to time.now(), and then try to remove timestamps that was inside the --ignore-recent seconds ago, however the logic was flawed and it actually only included these seconds. Furthermore opentsdb supports setting an "end" parameter, so we use this to only get the data we want. for example -d 180 -I 80, would render a query parameter that looks like `?start=180s-ago&end=80s-ago`. Keeps it simple. Also added debuglogging to output the actual query sent to OpenTSDB if --debug option is enabled. * fixed logic of --percent-over parameter Previous behaviour didn't work due to wrong logic, would set "crit" or "warn" to True regardless. This change fixes that. * better output from logging Add logmessages to be consistent across alerting-scenarios, and changed format of some floats. Fixed a log messaged that displayed "crit" value where it should have been "warn" value. * Fixed bug in logic that parses results Removed an if statement that `continue`:ed the for-loop if a result was neither a `crit` or `warn` already, however this check also made the logic skip the test to see if no values were returned by opentsdb and -A flag was specified to alert in such scenarios. * changed check for timestamps type Previous behaviour was to check if a timestamp could be cast as a float, which is a bit weird, because opentsdb will return integers. I do doubt that opentsdb would return a timestamp that is not an integer to begin with, so i suspect this check is redundant, but leaving it in for now regardless, as per discussion in PR. Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (#1875) Fix the missing index from #1754 in the salt scanner. Force Sunday as first day of week. Tweak TestTsdbQueryQueries to pass in older java versions. Fix the min case for doubles in AggregationIterator. Fix the Screw Driver config. Fix UT for JDK8 PR for SD config. Fix: Rollup queries with count aggregator produce unexpected results (#1895) Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net> Fixed function description Fixes #841 (#2040) Added tracking of metrics which are null due to auto_metric being disabled Fixes #786 (#2042) Add support for splitting rollup queries (#1853) * Add an SLA config flag for rollup intervals Adds a configuration option for rollup intervals to specify their maximum acceptable delay. Queries that cover a time between now and that maximum delay will need to query other tables for that time interval. * Add global config flag to enable splitting queries Adds a global config flag to enable splitting queries that would hit the rollup table, but the rollup table has a delay SLA configured. In that case, this feature allows splitting a query into to; one that gets the data from the rollups table until the time where it's guaranteed to be available, and the rest from the raw table. * Add a new SplitRollupQuery Adds a SplitRollupQuery class that suports splitting a rollup query into two separate queries. This is useful for when a rollup table is filled by e.g. a batch job that processes the data from the previous day on a daily basis. Rollup data for yesterday will then only be available some time today. This delay SLA can be configured on a per-table basis. The delay would specify by how much time the table can be behind real time. If a query comes in that would query data from that blackout period where data is only available in the raw table, but not yet guaranteed to be in the rollup table, the incoming query can be split into two using the SplitRollupQuery class. It wraps a query that queries the rollup table until the last guaranteed to be available timestamp based on the SLA; and one that gets the remaining data from the raw table. * Extract an AbstractQuery Extracts an AbstractQuery from the TsdbQuery implementation since we'd like to reuse some parts of it in other Query classes (in this case SplitRollupQuery) * Extract an AbstractSpanGroup * Avoid NullPointerException when setting start time Avoids a NullPointerException that happened when we were trying to set the start time on a query that would be eligible to split, but due to the SLA config only hit the raw table anyway. * Scale timestamps to milliseconds for split queries Scales all timestamps for split queries to milliseconds. It's important to maintain consistent units between all the partial queries that make up the bigger one. * Fix starting time error for split queries Fixes a bug that would happen when the start time of a query aligns perfectly with the time configured in the SLA for the delay of a rollup table. For a defined SLA, e.g. 1 day, if the start time of the query was exactly 1 day ago, the end time of the rollups part of the query would be updated and then be equal to its start time. That isn't allowed and causes a query exception.
bshakur8
pushed a commit
to bshakur8/opentsdb
that referenced
this pull request
Oct 27, 2021
OpenTSDB#1458) * For branch next, add an expression function named FirstDifference, which calculates the first difference of a time series. I noticed there is MovingAverage calculation, so I thought maybe I can enrich the mathematics functions into that. * add some unit tests for FirstDifference Bump version to 2.5.0-SNAPSHOT. Fix a compilation error about missing FirstDifference (OpenTSDB#1471) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Bugfix of FsckOptions. (OpenTSDB#1464) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> CORE: (OpenTSDB#1472) - Add RpcResponder for handling callbacks asynchronously UTILS: - Add two convenient methods in Config Signed-off-by: Chris Larsen <clarsen@verizonmedia.com> fix OpenTSDB#1581 by correcting an edge case in TsdbQuery.getScanEndTimeSeconds() (OpenTSDB#1582) Dockerfile that works without a script. (OpenTSDB#1739) replace FOREVER with a valid value in table creation (OpenTSDB#1967) Co-authored-by: Ion DULGHERU <ion.dulgheru@gmail.com> Jackson has a serious security problem in 2.9.5, which will cause RCE (OpenTSDB#2034) * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 * Jackson has a serious security problem in 2.9.5, which will cause RCE FasterXML/jackson-databind#2295 Co-authored-by: chi-chi weng <949409306@qq.com> Pr 1663 (OpenTSDB#1966) * Make UniqueIdRpc aware of the mode * Update javadoc on new method and rename test methods to be more descriptive Co-authored-by: Simon Matic Langford <simon@exemel.co.uk> Re-introduce query timeouts. (OpenTSDB#2035) Co-authored-by: Itamar Turner-Trauring <itamar@itamarst.org> Updating maven central urls and versions to match what is available now (OpenTSDB#2039) Fixes OpenTSDB#1899 Fixes OpenTSDB#1941 always write cli tools to stdout (OpenTSDB#1488) Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com> Fix OpenTSDB#1632 (OpenTSDB#1634) Add "check_tsd_v2" script (OpenTSDB#1567) Enhanced check_tsd script evaluates each individual metric group separately when given a filter Collect stats from meta cache plugin if configured (OpenTSDB#1649) Fix SaltScanner race condition on spans maps (OpenTSDB#1651) * Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility Synchronise the KVs list for scanner results Synchronises the list that holds the KeyValues that have been produced by the scanner callbacks. The list is accessed from multiple threads at a time and wasn't thread-safe, causing inconsistent results and partial loss of data in the response. Relates to: OpenTSDB#1753 Resolves: OpenTSDB#1760 Allow rollup downsample and series aggregator to be different Fix TestSaltScannerHistogram, looks like the method was renamed and the UTs were not adjusted. ExplicitTags filtering with FuzzyFilters Fix PR 1896 with the fuzzy filter list so that it will honor the regex filter and properly ignore rows that don't match the explicit filter. Also sort the fuzzy filter list in ascending order and implement a static comparator instead of instantiating one on each call. Test rollup filter fix for OpenTSDB#1083 Fix concurrent result reporting from scanners Fixes a concurrency bug where scanners report their results into a map and would overwrite each other's results Resolves: OpenTSDB#1753 Update Maven jars URLs with HTTPS access Remove excess param in javadoc for RpcHandler Fix check_tsd_v2 (OpenTSDB#1937) * renamed instancename of logger The previous name was copied from another script, cosmetic change only * Change behaviour of --ignore-recent option Previous option would fetch data from opentsdb from --duration seconds ago to time.now(), and then try to remove timestamps that was inside the --ignore-recent seconds ago, however the logic was flawed and it actually only included these seconds. Furthermore opentsdb supports setting an "end" parameter, so we use this to only get the data we want. for example -d 180 -I 80, would render a query parameter that looks like `?start=180s-ago&end=80s-ago`. Keeps it simple. Also added debuglogging to output the actual query sent to OpenTSDB if --debug option is enabled. * fixed logic of --percent-over parameter Previous behaviour didn't work due to wrong logic, would set "crit" or "warn" to True regardless. This change fixes that. * better output from logging Add logmessages to be consistent across alerting-scenarios, and changed format of some floats. Fixed a log messaged that displayed "crit" value where it should have been "warn" value. * Fixed bug in logic that parses results Removed an if statement that `continue`:ed the for-loop if a result was neither a `crit` or `warn` already, however this check also made the logic skip the test to see if no values were returned by opentsdb and -A flag was specified to alert in such scenarios. * changed check for timestamps type Previous behaviour was to check if a timestamp could be cast as a float, which is a bit weird, because opentsdb will return integers. I do doubt that opentsdb would return a timestamp that is not an integer to begin with, so i suspect this check is redundant, but leaving it in for now regardless, as per discussion in PR. Rename maxScannerUidtoStringTime into maxScannerUidToStringTime (OpenTSDB#1875) Fix the missing index from OpenTSDB#1754 in the salt scanner. Force Sunday as first day of week. Tweak TestTsdbQueryQueries to pass in older java versions. Fix the min case for doubles in AggregationIterator. Fix the Screw Driver config. Fix UT for JDK8 PR for SD config. Fix: Rollup queries with count aggregator produce unexpected results (OpenTSDB#1895) Co-authored-by: Tony Di Nucci <tony.dinucci@skyscanner.net> Fixed function description Fixes OpenTSDB#841 (OpenTSDB#2040) Added tracking of metrics which are null due to auto_metric being disabled Fixes OpenTSDB#786 (OpenTSDB#2042) Add support for splitting rollup queries (OpenTSDB#1853) * Add an SLA config flag for rollup intervals Adds a configuration option for rollup intervals to specify their maximum acceptable delay. Queries that cover a time between now and that maximum delay will need to query other tables for that time interval. * Add global config flag to enable splitting queries Adds a global config flag to enable splitting queries that would hit the rollup table, but the rollup table has a delay SLA configured. In that case, this feature allows splitting a query into to; one that gets the data from the rollups table until the time where it's guaranteed to be available, and the rest from the raw table. * Add a new SplitRollupQuery Adds a SplitRollupQuery class that suports splitting a rollup query into two separate queries. This is useful for when a rollup table is filled by e.g. a batch job that processes the data from the previous day on a daily basis. Rollup data for yesterday will then only be available some time today. This delay SLA can be configured on a per-table basis. The delay would specify by how much time the table can be behind real time. If a query comes in that would query data from that blackout period where data is only available in the raw table, but not yet guaranteed to be in the rollup table, the incoming query can be split into two using the SplitRollupQuery class. It wraps a query that queries the rollup table until the last guaranteed to be available timestamp based on the SLA; and one that gets the remaining data from the raw table. * Extract an AbstractQuery Extracts an AbstractQuery from the TsdbQuery implementation since we'd like to reuse some parts of it in other Query classes (in this case SplitRollupQuery) * Extract an AbstractSpanGroup * Avoid NullPointerException when setting start time Avoids a NullPointerException that happened when we were trying to set the start time on a query that would be eligible to split, but due to the SLA config only hit the raw table anyway. * Scale timestamps to milliseconds for split queries Scales all timestamps for split queries to milliseconds. It's important to maintain consistent units between all the partial queries that make up the bigger one. * Fix starting time error for split queries Fixes a bug that would happen when the start time of a query aligns perfectly with the time configured in the SLA for the delay of a rollup table. For a defined SLA, e.g. 1 day, if the start time of the query was exactly 1 day ago, the end time of the rollups part of the query would be updated and then be equal to its start time. That isn't allowed and causes a query exception.
rstropoli
pushed a commit
to MadDogTechnology/opentsdb
that referenced
this pull request
Mar 22, 2024
* Fix SaltScanner race condition on spans maps * Fix 1.6 compatibility
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I believe this fixes one of the race condition issues described here #823
For context, we have 3 read only m5.2xlarge nodes, each serving query traffic at a rate of around 1,000 requests per minute. Several times per day, we observed a TSD process becoming stuck, maxing out 1 core on the machine. A closer look revealed one thread spending all of its time inside SaltScanner.mergeDataPoints() in a TreeMap.get(). I think there is a race condition which can trigger if validateAndTriggerCallback() and a handleException() occur concurrently.
We have a health check process which restarts the TSD process if it starts to fail. Since we have deployed this fix, we have noticed a significant decrease in restarts.