Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Recovery fails to commit dangling transactions if there are no in-flight transactions #53

Open
maxant opened this issue Oct 2, 2015 · 3 comments

Comments

@maxant
Copy link

maxant commented Oct 2, 2015

Test Case (BTM 2.1.4):

  • kill database before commit command can be sent to it (using debugger).
  • let thread continue to run
  • restart database
  • during recovery database reports transaction ID which is incomplete
  • Recoverer calls TransactionManagerServices.getTransactionManager().getOldestInFlightTransactionTimestamp() because transaction manager is running (line 137)
  • There are no in-flight transactions and so the TM returns Long.MIN_VALUE
  • Because the oldestInFlightTransactionTimestamp is now so low, the code on line 289 skips the dangling transaction which still needs to be committed.

The result is that the resource never has "commit" called again, unless BTM is restarted, or randomly a transaction is in-flight during recovery.

@maxant
Copy link
Author

maxant commented Oct 2, 2015

I think a possible patch for this could be to change the value which BitronixTransactionManager#getOldestInFlightTransactionTimestamp() returns when there are no in-flight transactions. Instead of returning Long.MIN_VALUE, it should return Long.MAX_VALUE, because you want to recovery all dangling transactions.

public long getOldestInFlightTransactionTimestamp() {
    if (inFlightTransactions.isEmpty()) {
        if (log.isDebugEnabled()) log.debug("oldest in-flight transaction's timestamp: " + Long.MIN_VALUE);
        return Long.MIN_VALUE; // CHANGE THIS VALUE?
    }

@maxant
Copy link
Author

maxant commented Oct 2, 2015

@lorban
Copy link
Contributor

lorban commented Oct 4, 2015

This specific check in BitronixTransactionManager.getOldestInFlightTransactionTimestamp() that makes it return Long.MIN_VALUE was added for a purpose. It took me time to remember why, but I think I found the original reason back.

The Recoverer is the only consumer of this API, and it uses this timestamp to avoid stepping on the toes of the 2PC engine while it's still running on in-flight transactions. Basically, any transaction with a timestamp older than or equal to the oldest in-flight transaction's timestamp is considered in-flight and ignored by the recoverer.

There's of course a special case: when there is no in-flight transaction. If we blindly returned Long.MAX_VALUE then all transaction would become a target for the recoverer, including the ones that started right after getOldestInFlightTransactionTimestamp() is read by the recoverer but before the recoverer started its job. This would create a race condition where the recoverer could conflict with in-flight transactions. This explains why Long.MIN_VALUE is returned in such case: to avoid that race condition.

As you noticed, there is that ill-effect that if there isn't any activity, the recoverer will postpone its job forever. An easy fix would be to return MonotonicClock.currentTimeMillis() instead of Long.MIN_VALUE and explicitly state that in getOldestInFlightTransactionTimestamp()'s javadoc.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants