Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284

Closed
wants to merge 1 commit into from

Conversation

jmangs
Copy link
Contributor

@jmangs jmangs commented Feb 21, 2014

Fix for Issue: #283

If you have a large enough dataset when running metasync with OpenTSDB 2.0, it will eventually hang up and simply stop executing. Taking a thread dump afterwards you can see that there's an I/O thread worker waiting on tree_lock to become free:

"New I/O worker #3" prio=10 tid=0x00007f61a0e9e000 nid=0x747 waiting on condition [0x00007f6165e85000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at net.opentsdb.tree.TreeBuilder.processAllTrees(TreeBuilder.java:535)
at net.opentsdb.core.TSDB.processTSMetaThroughTrees(TSDB.java:961)
at net.opentsdb.meta.TSMeta$1TSMetaCB$1FetchNewCB.call(TSMeta.java:574)
at net.opentsdb.meta.TSMeta$1TSMetaCB$1FetchNewCB.call(TSMeta.java:565)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1278)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.handleContinuation(Deferred.java:1313)
at com.stumbleupon.async.Deferred.doCall(Deferred.java:1284)
at com.stumbleupon.async.Deferred.runCallbacks(Deferred.java:1257)
at com.stumbleupon.async.Deferred.callback(Deferred.java:1005)
at org.hbase.async.HBaseRpc.callback(HBaseRpc.java:450)
at org.hbase.async.RegionClient.decode(RegionClient.java:1185)
at org.hbase.async.RegionClient.decode(RegionClient.java:82)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:500)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.hbase.async.RegionClient.handleUpstream(RegionClient.java:1008)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
at org.hbase.async.HBaseClient$RegionClientPipeline.sendUpstream(HBaseClient.java:2431)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

Looking at the code immediately following the lock, it seems that after 300 seconds since the last tree load it will attempt to reload the tree data through a deferred call:

// if we haven't loaded our trees in a while or we've just started, load
    if (((System.currentTimeMillis() / 1000) - last_tree_load) > 300) {
      final Deferred<List<Tree>> load_deferred = Tree.fetchAllTrees(tsdb)
        .addCallback(new FetchedTreesCB()).addErrback(new ErrorCB());
      last_tree_load = (System.currentTimeMillis() / 1000);
      return load_deferred.addCallbackDeferring(new ProcessTreesCB());
    }

However, only ErrorCB has an unlock call - this is only executed if it throws an exception in FetchedTreesCB - the fix for this is to call unlock() before returning local_trees in FetchedTreesCB.

I've also added an unlock if the trees are empty as well since it returns without giving up the lock.

With this code change on our local OpenTSDB 2.0 installation, we're now able to run metasync for more than 5 minutes without it deadlocking on us.

@manolama
Copy link
Member

Pushed, thanks!

@manolama manolama closed this Feb 26, 2014
@jmangs jmangs deleted the tree-builder-deadlock branch February 6, 2015 13:30
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants