fix(bft/abci): remove unnecessary mutex locks in QueryAsync and QuerySync methods #3746

omarsy · 2025-02-13T20:12:30Z

Description

This PR addresses the issue where the RPC server becomes unresponsive due to a global mutex in the query function

Reproduction

To reproduce the issue:

Run: gnodev
use the following example:

main.js:

import { GnoWallet, GnoJSONRPCProvider, decodeTxMessages } from "@gnolang/gno-js-client";
import { MsgRun } from "@gnolang/gno-js-client/bin/proto/gno/vm.js";
import Long from 'long';

const wallet = await GnoWallet.fromMnemonic("source bonus chronic canvas draft south burst lottery vacant surface solve popular case indicate oppose farm nothing bullet exhibit title speed wink action roast");

await wallet.connect(new GnoJSONRPCProvider("http://127.0.0.1:26657"));
const tx = {
    "messages": [
        {
            typeUrl: "/vm.m_run",
            value: MsgRun.encode({
                caller: await wallet.getAddress(),
                send: "",
                package: {
                    "name": "main", "path": "",
                    "files": [{
                        "name": "inf.gno", "body": `package main
                    func main() {
                        for {}
                    }` }]
                }
            }).finish()
        }
    ],
    "fee": { "gasWanted": new Long(3000000000), "gasFee": "1000000ugnot" },
    signatures: []
};

setInterval(async () => {
    try {
        const signedTx = await wallet.signTransaction(tx, decodeTxMessages);
        const res = await wallet.estimateGas(signedTx);
        console.log(res);
    } catch (e) {
        console.error(e);
    }
}, 100);

package.json:

{
    "main": "main.js",
    "type": "module",
    "scripts": {
        "run": "node main.js"
    },
    "dependencies": {
        "@gnolang/gno-js-client": "^1.3.2",
        "ts-proto": "^2.2.0"
    }
}

Running this script (e.g., with npm run run) will repeatedly send a transaction that contains an infinite loop (for {}) in the main package. Over time, this causes the RPC server to become blocked as the querysync function’s mutex prevents other operations from executing.

Fix

This PR removes the mutex from the querysync function, thereby preventing it from blocking the entire RPC server during simulation. With the mutex removed, the RPC server should remain responsive even if a simulation enters an infinite loop.

…Sync methods

Gno2D2 · 2025-02-13T20:13:34Z

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):

IGNORE the bot requirements for this PR (force green CI check)

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)
🟢 Pending initial approval by a review team member, or review from tech-staff

☑️ Contributor Actions:

Fix any issues flagged by automated checks.
Follow the Contributor Checklist to ensure your PR is ready for review.
- Add new tests, or document why they are unnecessary.
- Provide clear examples/screenshots, if necessary.
- Update documentation, if required.
- Ensure no breaking changes, or include BREAKING CHANGE notes.
- Link related issues/PRs, where applicable.

☑️ Reviewer Actions:

Complete manual checks for the PR, including the guidelines and additional checks if applicable.

📚 Resources:

Debug

Automated Checks

Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: omarsy/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Pending initial approval by a review team member, or review from tech-staff

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 Not (🔴 Pull request author is a member of the team: tech-staff)

Then

🟢 Requirement satisfied
└── 🟢 If
    ├── 🟢 Condition
    │   └── 🟢 Or
    │       ├── 🔴 At least 1 user(s) of the organization reviewed the pull request (with state "APPROVED")
    │       ├── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request
    │       └── 🔴 This pull request is a draft
    └── 🟢 Then
        └── 🟢 And
            ├── 🟢 Not (🔴 This label is applied to pull request: review/triage-pending)
            └── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request

Manual Checks

**IGNORE** the bot requirements for this PR (force green CI check)
If
🟢 Condition met
└── 🟢 On every pull request
Can be checked by

Any user with comment edit permission

codecov · 2025-02-13T20:15:51Z

Codecov Report

Attention: Patch coverage is 72.13115% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tm2/pkg/store/iavl/store.go	78.12%	5 Missing and 2 partials ⚠️
tm2/pkg/store/rootmulti/store.go	69.56%	5 Missing and 2 partials ⚠️
tm2/pkg/store/gas/store.go	50.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

…tations

thehowl

we need a test for this somehow. probably a bit complex, but we need weird tests that check for weird conditions.

also i'm unsure about this as a whole. many of the mutex wraps are unnecessary (esp on the databases); I think we have a more general problem that due to a lack of "transactionality" support that the state of a running block execution can affect the state of queryAsync.

can you make a test that works through maybe two blocks, trying to send in transactions and queries in a random order and with random timing, and see if there's something unexpected?

thehowl · 2025-03-21T20:13:02Z

tm2/pkg/db/boltdb/boltdb.go

Is it necessary for boltdb?

yes right, nice catch ^^ not necessary for db that use transaction ^^
Done here: 2f5aa5e

thehowl · 2025-03-21T20:15:38Z

tm2/pkg/db/goleveldb/go_level_db.go

@@ -25,7 +26,8 @@ func init() {
 var _ db.DB = (*GoLevelDB)(nil)

 type GoLevelDB struct {
-	db *leveldb.DB
+	db  *leveldb.DB


This should also be safe, no?

// The returned DB instance is safe for concurrent use. Which mean that all // DB's methods may be called concurrently from multiple goroutine.

yes nice catch ^^

thehowl · 2025-03-21T20:16:45Z

tm2/pkg/store/cache/store.go

+	store.mtx.Lock()
+	defer store.mtx.Unlock()


doesn't access anything from the underlying store, so the mutex wrap is not necessary here

nice catch ^^

aeddi · 2025-03-24T09:08:48Z

While investigating this issue / fix, I realized that the ⁠socket hang up error still occurred with this fix after a few seconds, so I did some more benchs and here's the results FYI:

Branch (Req. interval)	Master (10ms)	With fix (10ms)	Master (100ms)	With fix (100ms)	Master (500ms)	With fix (500ms)
Success / Error (1000 req.)	20 / 980	2 / 998	45 / 955	71 / 929	442 / 558	1000 / 0
Duration (1000 req.)	29.075	33.348	2:15.653	3:22.665	8:32.523	8:22.026

The fix seems to make the issue worse at low intervals (10 ms), to improve the error / success ratio at the default script interval (100 ms), and seems to completely fix the problem at higher interval (500 ms).

Probably because at low intervals, the bottleneck is no longer due to the mutexes but to the CPU, which, being overloaded, takes too long to respond.

I'm not quite sure how to conclude with this data, I'll take another look at the race condition part this morning to see if I spot an issue.

thehowl · 2025-04-03T08:54:51Z

@omarsy are you willing to take up the test system I mentioned?

omarsy · 2025-04-03T08:59:36Z

@omarsy are you willing to take up the test system I mentioned?

Yes I am trying to do a test but complicated to do. If you know how to do it I am open for help

thehowl · 2025-04-03T10:46:24Z

I asked @zivkovicmilos to take a look at this, because I think there's an underlying issue relating also to how we connect and use the database:

my fear is that because our DB usage is not remotely transactional, this leads to many inconsistent states
and that until we implement proper DB transactionality (so that earlier connections don't see results of later connections, leading to inconsistent states) we cannot remove the lock on the app
and if anything, we can look to make it a RWMutex for some marginal improvements

zivkovicmilos · 2025-05-26T13:24:41Z

I've been digging into this, and I concluded it's just too touchy of a change to merge. Newer comet versions (>38) expose an UnsynchronizedLocalClient that chains can opt into (it drops the mux entirely from the client), and forces the application to deal with concurrency

The SDK’s keepers, caches, and the IAVL implementation we have were written under the assumption of single-threaded access, and just removing the outer mux without a full audit of every mutated path introduces more unknowns and kinder surprises that we can't see immediately

fix(bft/abci): remove unnecessary mutex locks in QueryAsync and Query…

223bd35

…Sync methods

github-actions bot assigned omarsy Feb 13, 2025

github-actions bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Feb 13, 2025

Gno2D2 requested a review from a team February 13, 2025 20:13

Merge branch 'master' into not-use-mutex

2246f4d

omarsy force-pushed the not-use-mutex branch from 6be82cc to 60eeba4 Compare February 16, 2025 16:42

wip

f9b2dd5

omarsy force-pushed the not-use-mutex branch from 60eeba4 to f9b2dd5 Compare February 16, 2025 22:27

github-actions bot added the 📦 ⛰️ gno.land Issues or PRs gno.land package related label Feb 16, 2025

zivkovicmilos self-requested a review February 17, 2025 17:42

zxxma mentioned this pull request Feb 17, 2025

Teritori's Builder Journey gnolang/hackerspace#7

Open

Merge branch 'master' into not-use-mutex

b5cf93f

Kouteki mentioned this pull request Feb 19, 2025

Minutes: Core Staff Weekly Syncs [every Monday] gnolang/meetings#36

Open

Merge branch 'master' into not-use-mutex

6bb207f

omarsy force-pushed the not-use-mutex branch 2 times, most recently from a860145 to 57142dc Compare March 1, 2025 20:22

to remove

d0bb109

omarsy force-pushed the not-use-mutex branch from 57142dc to d0bb109 Compare March 2, 2025 21:38

Merge branch 'master' into not-use-mutex

3477385

github-project-automation bot added this to 💪 Bounties & Worx Mar 11, 2025

remove debug print

5a5c579

omarsy requested a review from sw360cab March 11, 2025 19:10

omarsy and others added 3 commits March 11, 2025 20:24

Merge branch 'master' into not-use-mutex

374fcbe

feat(store): add mutex for thread-safe access in Store and multiStore

9c7fad2

implement mutex for thread-safe access across various store implemen…

5004c3d

…tations

omarsy marked this pull request as ready for review March 12, 2025 20:03

Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Mar 12, 2025

omarsy requested a review from thehowl March 20, 2025 09:54

Merge branch 'master' into not-use-mutex

ccbfbb1

thehowl reviewed Mar 21, 2025

View reviewed changes

Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Mar 21, 2025

Kouteki moved this from Triage to In Review in 🧙‍♂️gno.land core team Mar 25, 2025

kristovatlas added the security Security-sensitive issue label Mar 28, 2025

omarsy and others added 2 commits April 2, 2025 22:04

refactor: remove unnecessary mutex locks from database implementations

2f5aa5e

Merge branch 'master' into not-use-mutex

0986e27

Merge branch 'master' into not-use-mutex

e4dccac

jaekwon force-pushed the master branch from d3e0886 to 705e0df Compare May 3, 2025 20:18

Merge branch 'master' into not-use-mutex

8546dd5

zivkovicmilos closed this May 26, 2025

github-project-automation bot moved this from In Review to Done in 🧙‍♂️gno.land core team May 26, 2025

github-project-automation bot moved this to Done in 💪 Bounties & Worx May 26, 2025

fix(bft/abci): remove unnecessary mutex locks in QueryAsync and QuerySync methods #3746

fix(bft/abci): remove unnecessary mutex locks in QueryAsync and QuerySync methods #3746

Uh oh!

Conversation

omarsy commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reproduction

Fix

Uh oh!

Gno2D2 commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 PR Checks Summary

Manual Checks (for Reviewers):

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Then

If

Then

If

Can be checked by

Uh oh!

codecov bot commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thehowl left a comment

Choose a reason for hiding this comment

Uh oh!

thehowl Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

omarsy Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thehowl Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

omarsy Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

thehowl Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

omarsy Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

aeddi commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thehowl commented Apr 3, 2025

Uh oh!

omarsy commented Apr 3, 2025

Uh oh!

thehowl commented Apr 3, 2025

Uh oh!

zivkovicmilos commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

omarsy commented Feb 13, 2025 •

edited

Loading

Gno2D2 commented Feb 13, 2025 •

edited

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading

omarsy Apr 2, 2025 •

edited

Loading

aeddi commented Mar 24, 2025 •

edited

Loading

zivkovicmilos commented May 26, 2025 •

edited

Loading