Skip to content

fix(bft/abci): remove unnecessary mutex locks in QueryAsync and QuerySync methods #3746

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
wants to merge 16 commits into from

Conversation

omarsy
Copy link
Member

@omarsy omarsy commented Feb 13, 2025

Description

This PR addresses the issue where the RPC server becomes unresponsive due to a global mutex in the query function

Reproduction

To reproduce the issue:

  1. Run: gnodev
  2. use the following example:

main.js:

import { GnoWallet, GnoJSONRPCProvider, decodeTxMessages } from "@gnolang/gno-js-client";
import { MsgRun } from "@gnolang/gno-js-client/bin/proto/gno/vm.js";
import Long from 'long';

const wallet = await GnoWallet.fromMnemonic("source bonus chronic canvas draft south burst lottery vacant surface solve popular case indicate oppose farm nothing bullet exhibit title speed wink action roast");

await wallet.connect(new GnoJSONRPCProvider("http://127.0.0.1:26657"));
const tx = {
    "messages": [
        {
            typeUrl: "/vm.m_run",
            value: MsgRun.encode({
                caller: await wallet.getAddress(),
                send: "",
                package: {
                    "name": "main", "path": "",
                    "files": [{
                        "name": "inf.gno", "body": `package main
                    func main() {
                        for {}
                    }` }]
                }
            }).finish()
        }
    ],
    "fee": { "gasWanted": new Long(3000000000), "gasFee": "1000000ugnot" },
    signatures: []
};

setInterval(async () => {
    try {
        const signedTx = await wallet.signTransaction(tx, decodeTxMessages);
        const res = await wallet.estimateGas(signedTx);
        console.log(res);
    } catch (e) {
        console.error(e);
    }
}, 100);

package.json:

{
    "main": "main.js",
    "type": "module",
    "scripts": {
        "run": "node main.js"
    },
    "dependencies": {
        "@gnolang/gno-js-client": "^1.3.2",
        "ts-proto": "^2.2.0"
    }
}

Running this script (e.g., with npm run run) will repeatedly send a transaction that contains an infinite loop (for {}) in the main package. Over time, this causes the RPC server to become blocked as the querysync function’s mutex prevents other operations from executing.

Fix

This PR removes the mutex from the querysync function, thereby preventing it from blocking the entire RPC server during simulation. With the mutex removed, the RPC server should remain responsive even if a simulation enters an infinite loop.

@github-actions github-actions bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Feb 13, 2025
@Gno2D2 Gno2D2 requested a review from a team February 13, 2025 20:13
@Gno2D2
Copy link
Collaborator

Gno2D2 commented Feb 13, 2025

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)
🟢 Pending initial approval by a review team member, or review from tech-staff

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: omarsy/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Pending initial approval by a review team member, or review from tech-staff

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 Not (🔴 Pull request author is a member of the team: tech-staff)

Then

🟢 Requirement satisfied
└── 🟢 If
    ├── 🟢 Condition
    │   └── 🟢 Or
    │       ├── 🔴 At least 1 user(s) of the organization reviewed the pull request (with state "APPROVED")
    │       ├── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request
    │       └── 🔴 This pull request is a draft
    └── 🟢 Then
        └── 🟢 And
            ├── 🟢 Not (🔴 This label is applied to pull request: review/triage-pending)
            └── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 72.13115% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tm2/pkg/store/iavl/store.go 78.12% 5 Missing and 2 partials ⚠️
tm2/pkg/store/rootmulti/store.go 69.56% 5 Missing and 2 partials ⚠️
tm2/pkg/store/gas/store.go 50.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions github-actions bot added the 📦 ⛰️ gno.land Issues or PRs gno.land package related label Feb 16, 2025
@zivkovicmilos zivkovicmilos self-requested a review February 17, 2025 17:42
@omarsy omarsy force-pushed the not-use-mutex branch 2 times, most recently from a860145 to 57142dc Compare March 1, 2025 20:22
@omarsy omarsy requested a review from sw360cab March 11, 2025 19:10
@omarsy omarsy marked this pull request as ready for review March 12, 2025 20:03
@Gno2D2 Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Mar 12, 2025
@omarsy omarsy requested a review from thehowl March 20, 2025 09:54
Copy link
Member

@thehowl thehowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a test for this somehow. probably a bit complex, but we need weird tests that check for weird conditions.

also i'm unsure about this as a whole. many of the mutex wraps are unnecessary (esp on the databases); I think we have a more general problem that due to a lack of "transactionality" support that the state of a running block execution can affect the state of queryAsync.

can you make a test that works through maybe two blocks, trying to send in transactions and queries in a random order and with random timing, and see if there's something unexpected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for boltdb?

Copy link
Member Author

@omarsy omarsy Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes right, nice catch ^^ not necessary for db that use transaction ^^
Done here: 2f5aa5e

@@ -25,7 +26,8 @@ func init() {
var _ db.DB = (*GoLevelDB)(nil)

type GoLevelDB struct {
db *leveldb.DB
db *leveldb.DB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be safe, no?

// The returned DB instance is safe for concurrent use. Which mean that all
// DB's methods may be called concurrently from multiple goroutine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes nice catch ^^

Comment on lines 147 to 148
store.mtx.Lock()
defer store.mtx.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't access anything from the underlying store, so the mutex wrap is not necessary here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch ^^

@Gno2D2 Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Mar 21, 2025
@aeddi
Copy link
Contributor

aeddi commented Mar 24, 2025

While investigating this issue / fix, I realized that the ⁠socket hang up error still occurred with this fix after a few seconds, so I did some more benchs and here's the results FYI:

Branch (Req. interval) Master (10ms) With fix (10ms) Master (100ms) With fix (100ms) Master (500ms) With fix (500ms)
Success / Error (1000 req.) 20 / 980 2 / 998 45 / 955 71 / 929 442 / 558 1000 / 0
Duration (1000 req.) 29.075 33.348 2:15.653 3:22.665 8:32.523 8:22.026

The fix seems to make the issue worse at low intervals (10 ms), to improve the error / success ratio at the default script interval (100 ms), and seems to completely fix the problem at higher interval (500 ms).

Probably because at low intervals, the bottleneck is no longer due to the mutexes but to the CPU, which, being overloaded, takes too long to respond.

I'm not quite sure how to conclude with this data, I'll take another look at the race condition part this morning to see if I spot an issue.

@Kouteki Kouteki moved this from Triage to In Review in 🧙‍♂️gno.land core team Mar 25, 2025
@kristovatlas kristovatlas added the security Security-sensitive issue label Mar 28, 2025
@thehowl
Copy link
Member

thehowl commented Apr 3, 2025

@omarsy are you willing to take up the test system I mentioned?

@omarsy
Copy link
Member Author

omarsy commented Apr 3, 2025

@omarsy are you willing to take up the test system I mentioned?

Yes I am trying to do a test but complicated to do. If you know how to do it I am open for help

@thehowl
Copy link
Member

thehowl commented Apr 3, 2025

I asked @zivkovicmilos to take a look at this, because I think there's an underlying issue relating also to how we connect and use the database:

my fear is that because our DB usage is not remotely transactional, this leads to many inconsistent states
and that until we implement proper DB transactionality (so that earlier connections don't see results of later connections, leading to inconsistent states) we cannot remove the lock on the app
and if anything, we can look to make it a RWMutex for some marginal improvements

@zivkovicmilos
Copy link
Member

zivkovicmilos commented May 26, 2025

I've been digging into this, and I concluded it's just too touchy of a change to merge. Newer comet versions (>38) expose an UnsynchronizedLocalClient that chains can opt into (it drops the mux entirely from the client), and forces the application to deal with concurrency

The SDK’s keepers, caches, and the IAVL implementation we have were written under the assumption of single-threaded access, and just removing the outer mux without a full audit of every mutated path introduces more unknowns and kinder surprises that we can't see immediately

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related security Security-sensitive issue
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants