-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Potential infinite loop from multithreaded use of HashMap by way of "skip" HashSet in SaltScanner? #823
Comments
There are 128 pool threads because in mapr-asynchbase-1.7.0.201603301412 (and in 1.6.0) from http://repository.mapr.com/nexus/content/groups/mapr-public/org/hbase/asynchbase/ MapRThreadPool.java contains:
In version 1.5.0 https://github.com/mapr/asynchbase/blob/v1.5.0-mapr/src/MapRThreadPool.java I see:
so maybe with fewer threads in the pool a race condition was less likely. |
This is the minimal change I went with:
|
Do you want to submit a PR? |
The minimal change in #823 (comment) looks to have stopped threads becoming stuck in CPU loops in our opentsdb 2.2.0 instances, we do not have plans to upgrade the opentsdb version at the moment so I am sorry I am unable to test the change against more recent versions. I could make a PR against https://github.com/OpenTSDB/opentsdb/tree/v2.2.0 , would a v2.2.1 release with the change be appropriate? Or would such a change only go into the master branch and next release? I will make the same change to |
9197b63 added the I did see a few other I was also wondering about the code needing to support older versions of Java, in particular the parameterless constructor for ConcurrentHashMap, and if the default configuration would be suitable when using Java 6: https://ria101.wordpress.com/2011/12/12/concurrenthashmap-avoid-a-common-misuse/ I read about how in Java 8 netty/netty#1052 - "Consider replace ConcurrentHashMap with ConcurrentHashMapV8 backport" I read http://opentsdb.net/docs/build/html/development/index.html#guidelines and note that: .) Bug fixes should be done in the master branch In order to make a PR I will need my employer to agree to waiver any rights they have to contributions before I can sign the CLA. I will start the internal process to obtain this agreement, but please note that it can take some time (normally weeks). I am happy for anyone else to apply a fix. Thanks. |
It would be most helpful if the changes are made against the |
My employer agreed to waiver any rights they have to contributions to this project, I haved signed the CLA. Pull request #835 is just for the HashSet seen in the thread stack of threads in infinite loops, I can
If the "nid" from the jstack is converted to a decimal:
It would line up with the thread id from ps output on the pid of the opentsdb instance:
A Java Poor Man's Profiler (
Formatted to not be on one line:
I was wondering if this patch could also be applied to a 2.2.1 release? Thanks. |
On an instance of opentsdb 2.2.0 with the patch from #835 we have a thread stuck in a different infinte loop:
This is https://github.com/OpenTSDB/opentsdb/blob/v2.2.0/src/core/SaltScanner.java#L580 , where https://github.com/OpenTSDB/opentsdb/blob/v2.2.0/src/core/SaltScanner.java#L307
|
For OpenTSDB#823 Concurrent use of a HashSet can trigger race conditions and an infinite loop(s), use the thread-safe ConcurrentHashMap to avoid this. Making the change just for the thread stack seen looping.
For #823 Concurrent use of a HashSet can trigger race conditions and an infinite loop(s), use the thread-safe ConcurrentHashMap to avoid this. Making the change just for the thread stack seen looping. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
For #823 Concurrent use of a HashSet can trigger race conditions and an infinite loop(s), use the thread-safe ConcurrentHashMap to avoid this. Making the change just for the thread stack seen looping. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
The patch looks good for the first one. For regex and wildcard queries there is the possibility that the sets could be modified from different threads at the same time due to UID resolution. The patch should solve that with concurrent maps and I'll merge that upstream. For the second issue, we'll try out manolama@8b1368a to see if just wrapping the TreeMap with a synchronized collection doesn't work properly. |
For #823 Concurrent use of a HashSet can trigger race conditions and an infinite loop(s), use the thread-safe ConcurrentHashMap to avoid this. Making the change just for the thread stack seen looping. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
I applied the first ("skips" and "keepers") and second ("annotations") patch to the 2.2.0 tagged release and deployed with a staggered release to 10 read-only dev opentsdb instances over 5 dev hosts. The dev hosts have been under a very light load, total test time so far is 18 hours ( 2 read-only instances x ( ( 1 host x 3hrs ) + ( 2 x 2hrs ) + ( 2 x 1hr) ) ), no problems so far. We will test these opentsdb instances under a higher load either tomorrow or next week (probably next week), and report back. Thanks again for the patch for the second issue and for merging the first pull request. |
The 10 read-only opentsdb instances with the first and second patch applied have had no problems and no CPUs in infinite loops while operating under a light load. We just switched them over to service a heavier load of queries. Weekend workload is very light, so I will update again on Wednesday next week. Thanks. |
I noticed one thread (running with first and second patch against 2.2.0) that remained at high CPU utilization since the restart:
Note that line numbers differ from manolama@8b1368a because I don't have manolama@050d507#diff-4d6a61187451e2070d4b688cd3fb6726 .
|
Just to update, the first and second patch have been running on opentsdb instances under a heavy load and we have seen high usertime CPU, when we do the threads are seen in these stacks:
We are currently working on disabling support for annotations, either when being sent from Grafana or in OpenTSDB. |
I got a little over zealous moving stuff out of 2.3.0, probably should make sure this is resolved for the 2.3.0 RC2. |
Just to update, we continued (and continue) to see the threads spinning in code to do with annotations. A proxy was implemented to disable annotations in queries, we found some access (requesting annotations) was by-passing the proxy, so I think the latest is that there is a plan to disable this by-pass route and force the use of the proxy that disables asking for annotations in queries. |
Thanks, I am not sure if the annotation queries we see are arriving over the "annotation" RPC. Sorry, I am working on this from the inside to the out. My hack with annotations support disabled has these two diffs, the first has already been applied, and the second initialises some variables (I read that an uninitialised bool in java can be evaluated as false, so I set
|
Hi @thatsafunnyname Still seeing the spinning threads? And do you actually have some Annotations stored in the data that's scanned? |
Yes we still see spinning threads in the annotations code. |
Just to update. The query "load" sent to an instance running with the hack from my previous comment (to disable annotations support) was unable to cause threads to get stuck spinning (in annotations code). We have rolled out the .jar file to other development hosts and plan to start rolling out to production in around a week from now. We see threads stuck spinning in annotations code reasonably frequently in production, so I expect to be able to confirm that disabling annotations ( in the manner done in #823 (comment) ) resolves the problem for us. |
As a workaround for the problem I can make it possible to diasble annotations cleanly from the configuration. Of course, we'll still want to find the cause of the problem in the annotation code. |
Thanks @johann8384 , that would be great. Being able to disable (and enable) annotations cleanly in the config would help us. |
Just to provide an update. The changes to disable annotations from #823 (comment) were deployed to a production environment, they helped but did not completely eradicate opentsdb processes seen spinning CPU in code supporting annotations. The chart below over 3 days shows the reduction in avg percent user CPU time, hosts have 24 CPUs. After the deployment our monit logs still showed our monit CPU theshold being hit by opentsdb processes on occasion, the monit threshold is based on usr+sys CPU, so not perfect and the events could well be false positives - as far as infinite loops are concerned. The below chart over 3 days shows the percent user CPU time for the 24 CPUs on one of the hosts from the previous chart. We are also logging the HTTP responses from the api/stats/threads URL for the opentsdb instance that has had the CPU spinning. The log of thread stacks shows that when the opentsdb processes are using the most CPU they are in annotations code, in particular we have logged 100 threads in:
these are the following lines in
I removed this code using the
then it will have helped. A jar file with these changes has been deployed to our development environment. If testing is without incident then a production deployment will follow. I will report back.
The tests needed these changes to pass:
|
The jar file with the diff from #823 (comment) (which removes some code supporting annotations) is in the process of being rolled out to a production environment where user CPU time spikes had been seen at times when many threads were seen in the annotations code. So far, the user CPU time spikes are not seen once the jar file has been deployed, and the monit actions for the opentsdb processes exceeding CPU thresholds have not fired. This chart is for 3 host (24 CPUs each), it is the sum of 1min averages of user CPU time. So being able to cleanly disable annotations in the opentsdb config ( #873 ) will help. I just wanted to also mention that we are in the process of making some changes to our systems that could change the number of threads, and may mean that we will not hit the race condition as frequently. So our ability to easily test a 'proper' fix may be reduced. |
We are eagerly awaiting this fix, when is it expected for this change to be merged in? Do we have a new expected release date for v2.3.0? |
For OpenTSDB#823 Concurrent use of a HashSet can trigger race conditions and an infinite loop(s), use the thread-safe ConcurrentHashMap to avoid this. Making the change just for the thread stack seen looping. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
… map were not synchronized, potentially leading to a race condition and stuck threads. Now we'll send a fresh list to the compaction code if notes were found, synchronize on the local byte map before making any modifications. Thanks to @thatsafunnyname.
manolama@ee1bf44 should handle this by synchronizing on the map before making modifications. Will merge before cutting 2.3.0 |
were not synchronized, potentially leading to a race condition and stuck threads. Now we'll send a fresh list to the compaction code if notes were found, synchronize on the local byte map before making any modifications. Thanks to @thatsafunnyname. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
were not synchronized, potentially leading to a race condition and stuck threads. Now we'll send a fresh list to the compaction code if notes were found, synchronize on the local byte map before making any modifications. Thanks to @thatsafunnyname. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
were not synchronized, potentially leading to a race condition and stuck threads. Now we'll send a fresh list to the compaction code if notes were found, synchronize on the local byte map before making any modifications. Thanks to @thatsafunnyname. Signed-off-by: Chris Larsen <clarsen@yahoo-inc.com>
Hello and thank you for opentsdb,
We are using
opentsdb-2.2.0
withand
mapr-hadoop-core-2.7.0.37549.GA-1.x86_64
, both usingjava-1.7.0-openjdk-1.7.0.101.x86_64
onRHEL6.6
on2.6.32-573.12.1.el6.x86_64
We see the opentsdb process occasionally spinning CPU in user time, logs from querying the
api/stats/threads
HTTP interface show a thread resizing aHashMap
at these times. We do not have a reproducer. We have a monit process that detects the high CPU and restarts opentsdb.I read http://mailinator.blogspot.co.uk/2009/06/beautiful-race-condition.html and was wondering if we are hitting this race, and if this might explain what we see.
I will attach the
api/stats/threads
log files. Some process stats for opentsdb process (the host has 12 vCPUS):Thanks in advance for any help you can provide.
( Whilst looking at the stacks I also noticed that opentsdb is using version 1.4.0 of
stumbleupon/async
(opentsdb/third_party/suasync/include.mk
Line 16 in e2efe9a
Thread seen resizing HashMap
opentsdb/src/core/SaltScanner.java
Line 309 in e2efe9a
opentsdb/src/core/SaltScanner.java
Line 420 in e2efe9a
Log files from
api/stats/threads
:monster4_3dow_16h_57m_log.txt
monster4_3dow_16h_58m_log.txt
monster4_3dow_16h_59m_log.txt
monster4_3dow_17h_00m_log.txt
The config (some names changed):
The text was updated successfully, but these errors were encountered: