-
Notifications
You must be signed in to change notification settings - Fork 422
fix(bft/abci): remove unnecessary mutex locks in QueryAsync and QuerySync methods #3746
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
6be82cc
to
60eeba4
Compare
60eeba4
to
f9b2dd5
Compare
a860145
to
57142dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a test for this somehow. probably a bit complex, but we need weird tests that check for weird conditions.
also i'm unsure about this as a whole. many of the mutex wraps are unnecessary (esp on the databases); I think we have a more general problem that due to a lack of "transactionality" support that the state of a running block execution can affect the state of queryAsync.
can you make a test that works through maybe two blocks, trying to send in transactions and queries in a random order and with random timing, and see if there's something unexpected?
tm2/pkg/db/boltdb/boltdb.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for boltdb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes right, nice catch ^^ not necessary for db that use transaction ^^
Done here: 2f5aa5e
tm2/pkg/db/goleveldb/go_level_db.go
Outdated
@@ -25,7 +26,8 @@ func init() { | |||
var _ db.DB = (*GoLevelDB)(nil) | |||
|
|||
type GoLevelDB struct { | |||
db *leveldb.DB | |||
db *leveldb.DB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be safe, no?
// The returned DB instance is safe for concurrent use. Which mean that all
// DB's methods may be called concurrently from multiple goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes nice catch ^^
tm2/pkg/store/cache/store.go
Outdated
store.mtx.Lock() | ||
defer store.mtx.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't access anything from the underlying store, so the mutex wrap is not necessary here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch ^^
While investigating this issue / fix, I realized that the
The fix seems to make the issue worse at low intervals (10 ms), to improve the error / success ratio at the default script interval (100 ms), and seems to completely fix the problem at higher interval (500 ms). Probably because at low intervals, the bottleneck is no longer due to the mutexes but to the CPU, which, being overloaded, takes too long to respond. I'm not quite sure how to conclude with this data, I'll take another look at the race condition part this morning to see if I spot an issue. |
@omarsy are you willing to take up the test system I mentioned? |
Yes I am trying to do a test but complicated to do. If you know how to do it I am open for help |
I asked @zivkovicmilos to take a look at this, because I think there's an underlying issue relating also to how we connect and use the database:
|
I've been digging into this, and I concluded it's just too touchy of a change to merge. Newer comet versions (>38) expose an The SDK’s keepers, caches, and the IAVL implementation we have were written under the assumption of single-threaded access, and just removing the outer mux without a full audit of every mutated path introduces more unknowns and kinder surprises that we can't see immediately |
Description
This PR addresses the issue where the RPC server becomes unresponsive due to a global mutex in the query function
Reproduction
To reproduce the issue:
gnodev
main.js:
package.json:
Running this script (e.g., with
npm run run
) will repeatedly send a transaction that contains an infinite loop (for {}
) in themain
package. Over time, this causes the RPC server to become blocked as thequerysync
function’s mutex prevents other operations from executing.Fix
This PR removes the mutex from the
querysync
function, thereby preventing it from blocking the entire RPC server during simulation. With the mutex removed, the RPC server should remain responsive even if a simulation enters an infinite loop.