Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

jmangs · 2014-02-21T19:36:27Z

Fix for Issue: #283

If you have a large enough dataset when running metasync with OpenTSDB 2.0, it will eventually hang up and simply stop executing. Taking a thread dump afterwards you can see that there's an I/O thread worker waiting on tree_lock to become free:

"New I/O worker #3" prio=10 tid=0x00007f61a0e9e000 nid=0x747 waiting on condition [0x00007f6165e85000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at net.opentsdb.tree.TreeBuilder.processAllTrees(TreeBuilder.java:535)
at net.opentsdb.core.TSDB.processTSMetaThroughTrees(TSDB.java:961)
at net.opentsdb.meta.TSMeta$1TSMetaCB$1FetchNewCB.call(TSMeta.java:574)
at net.opentsdb.meta.TSMeta$1TSMetaCB$1FetchNewCB.call(TSMeta.java:565)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.callback(Deferred.java:1005)
at org.hbase.async.HBaseRpc.callback(HBaseRpc.java:450)
at org.hbase.async.RegionClient.decode(RegionClient.java:1185)
at org.hbase.async.RegionClient.decode(RegionClient.java:82)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1008)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:2431)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

Looking at the code immediately following the lock, it seems that after 300 seconds since the last tree load it will attempt to reload the tree data through a deferred call:

// if we haven't loaded our trees in a while or we've just started, load
    if (((System.currentTimeMillis() / 1000) - last_tree_load) > 300) {
      final Deferred<List<Tree>> load_deferred = Tree.fetchAllTrees(tsdb)
        .addCallback(new FetchedTreesCB()).addErrback(new ErrorCB());
      last_tree_load = (System.currentTimeMillis() / 1000);
      return load_deferred.addCallbackDeferring(new ProcessTreesCB());
    }

However, only ErrorCB has an unlock call - this is only executed if it throws an exception in FetchedTreesCB - the fix for this is to call unlock() before returning local_trees in FetchedTreesCB.

I've also added an unlock if the trees are empty as well since it returns without giving up the lock.

With this code change on our local OpenTSDB 2.0 installation, we're now able to run metasync for more than 5 minutes without it deadlocking on us.

manolama · 2014-02-26T03:25:28Z

Pushed, thanks!

Fix thread deadlocks when running metasync for more than 5 minutes.

06238bc

manolama closed this Feb 26, 2014

manolama mentioned this pull request Feb 26, 2014

Metasync from RC2.0 hangs after running for 5 minutes #283

Closed

jmangs deleted the tree-builder-deadlock branch February 6, 2015 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

jmangs commented Feb 21, 2014

manolama commented Feb 26, 2014

Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

Conversation

jmangs commented Feb 21, 2014

manolama commented Feb 26, 2014