-
Notifications
You must be signed in to change notification settings - Fork 12
Node Recovery
In the event of a consensus failure or app hash mismatch that cannot be recovered by simply restarting the node, there are a few possible resolutions.
If there are other healthy nodes on the network, the most straightforward resolution is just to delete all data: (1) recreate the PostgreSQL database, and (2) delete the data folders under ~/.kwild
but not the private key or configuration files. Then start up the node and have it sync blocks from genesis from other nodes.
This approach may require considerable time and compute resources, including network bandwidth.
Also, this is only applicable if the cause of corruption is not a bug in the code that affected the whole network.
kwild
has different heights:
- block height -- the block store and index internal to CometBFT
- consensus engine state height -- CometBFT's internal state
- application height -- our application's height that pertains to which blocks we have executed and committed our own state changes
It is possible to reset the application height to zero while keeping CometBFT's data unchanged, which will signal to CometBFT to "reapply" all of the blocks with the application. This is different from "catch up" mode, in which all data is reset and resynchronized from network peers.
To do this, you:
- recreate the
postgres
database. Either drop/create likepsql -c 'DROP DATABASE IF EXISTS kwild;' -c 'CREATE DATABASE kwild OWNER kwild;'
or delete the docker volume that contains thepostgres
database cluster. - delete the
signing
folder from the kwild "root directory". This is usually~/.kwild/signing
. - Optionally, delete
abci/last_commit_info.json
andabci/data/cs.wal
to prevent some error logs on startup, but do NOT delete all ofabci
If the recover actions above failed to fix the node, it may be required to rollback CometBFT's blocks and state by one or more blocks. The docs for the cometbft rollback
command explains this:
$ cometbft rollback -h
A state rollback is performed to recover from an incorrect application state transition,
when CometBFT has persisted an incorrect app hash and is thus unable to make
progress. Rollback overwrites a state at height n with the state at height n - 1.
The application should also roll back to height n - 1. If the --hard flag is not used,
no blocks will be removed so upon restarting CometBFT the transactions in block n will be
re-executed against the application. Using --hard will also remove block n. This can
be done multiple times.
Usage:
cometbft rollback [flags]
Flags:
--hard remove last block as well as state
-h, --help help for rollback
Global Flags:
--home string directory for config and data (default "/home/jon/.cometbft")
--log_level string log level (default "info")
--trace print out full stack trace on errors
If this were needed, it is likely that there is a bug that needs to be fixed first. It would also be likely that many other nodes in the network may be affected by, say, a determinism bug. In that event, to recover the network, it would be necessary to:
- fix the bug
- rollback one or more blocks on the affected nodes
- reset the application state (PostgreSQL)
- delete the
signing
folder in the root directory - deploy the new version of
kwild
- have it reapply all the block data up to the point before the bug forked (and halted) the network.
To perform a rollback, we may use the cometbft
command line application. Install to $GOPATH/bin
as follows:
go install -v github.com/cometbft/cometbft/cmd/cometbft@v0.38.12
Assuming $GOPATH/bin
is on your PATH
, it may then be used from and folder the command line.
Using the rollback
command as documented requires setting the --home
folder to CometBFT's root, which is the abci
subfolder in kwild
's root directory:
$ cometbft rollback --hard --home ~/.kwil/abci
Rolled back both state and block to height 823 and hash 5C9824172FF1717C32671420A02E76B41779687A7642F3A19D9B5A56ACF3278F
Note that it is not possible to keep the application state intact while having reset or partially rolled back cometbft's data. In this case, an error such as the following will be received:
error on replay: app block height (824) is higher than core (823)
The CometBFT data may be ahead of the application, but not the reverse.
While not a recovery process, it is helpful for debugging to use CometBFT's RPC service to inspect the databases of a stopped node:
$ cometbft inspect -h
inspect runs a subset of CometBFT's RPC endpoints that are useful for debugging
issues with CometBFT.
When the CometBFT detects inconsistent state, it will crash the
CometBFT process. CometBFT will not start up while in this inconsistent state.
The inspect command can be used to query the block and state store using CometBFT
RPC calls to debug issues of inconsistent state.
Usage:
cometbft inspect [flags]
As with cometbft rollback
, you specify the path the abci
subfolder using the --home
flag.
$ cometbft inspect --home ~/.kwild/abci
I[2024-03-14|16:44:58.830] starting inspect server module=main
I[2024-03-14|16:44:58.831] RPC HTTP server starting address=tcp://127.0.0.1:26657
I[2024-03-14|16:44:58.831] serve msg="Starting RPC HTTP server on 127.0.0.1:26657"
This starts the CometBFT RPC server (not ours), which a different set of RPCs that are blind to the existence of the Kwil DB application. They are documented thoroughly with examples here.
For instance, to use the /block
endpoint to get the best block height:
$ curl -s --insecure http://localhost:26657/block | jq '.result.block.header.height'
"823"
Try jq '{hash: .result.block_id.hash, app_hash: .result.block.header.app_hash}'
for both hash and height from the above endpoint.
To view a formatted and colorized summary of block number 5, use jq and less
:
curl --insecure 'http://localhost:26657/block?height=5' | jq -C | less -R
To list all of the transactions in block number 43241:
curl -s --insecure 'https://localhost:26657/tx_search?query="tx.height=43241"'