Fix thread deadlocks when running metasync for more than 5 minutes. (#283) #284
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix for Issue: #283
If you have a large enough dataset when running metasync with OpenTSDB 2.0, it will eventually hang up and simply stop executing. Taking a thread dump afterwards you can see that there's an I/O thread worker waiting on tree_lock to become free:
Looking at the code immediately following the lock, it seems that after 300 seconds since the last tree load it will attempt to reload the tree data through a deferred call:
However, only
ErrorCB
has an unlock call - this is only executed if it throws an exception inFetchedTreesCB
- the fix for this is to callunlock()
before returning local_trees inFetchedTreesCB
.I've also added an unlock if the trees are empty as well since it returns without giving up the lock.
With this code change on our local OpenTSDB 2.0 installation, we're now able to run metasync for more than 5 minutes without it deadlocking on us.