[OPS] FEATURE REQUEST: Pruning system - sign and forget #3637

albttx · 2025-01-29T12:39:51Z

Description

Currently, it is not possible to prune a node, all nodes store all blocks since block 1.

Once mainnet will arrive, there will be more transactions and storing all state since block 1 might create some storage issues on the nodes

There is multiple advantage about node pruning:

reduce start-up time
snapshots node are smaller, which make an easier and faster system to sync / recover / migrate a node.
take less spaces on disks
increase node performances

Another feature i would love to be implemented, which was already discuss once during a meeting with @jaekwon (long time ago) is the "Sign and forget" system.

Tendermint main bottleneck is disk usage, large chain must use NVMe for efficiency, an disks without pruning are growing really fast!

I believe a validator node don't require to store more than 1 block in memory, only the latest state should be required to store.

This feature i believe would increase BY A LOT performances, because we could almost get rid of disk usage and do everything in memory. If it's not done by code, it could be done by putting the data/ dir into a tmpfs which is a in-memory file system.

FYI: On injective, with a ~1s block time, and the amount for txs, there is no possiblity to set the pruning because it cause a lot of block miss signature, and node is growing to couple hundreads of Gb in couple days...
A pruning from a pruned snapshot node is require every 1-2 week.

Of course, we might not have the same amount of txs from day 1, but if we start to have oracle writing in gno.land on every blocks, we could have node storage issues faster than we can expect.

cc: @moul @zivkovicmilos @gnolang/devops wdyt ?

ps: this issue will be mentioned in my gnops.io article i'm writing about snapshots node.

The text was updated successfully, but these errors were encountered:

n2p5 · 2025-01-29T15:58:30Z

Thanks for writing this up @albttx.

We should also make sure to outline the disadvantages and tradeoffs of node pruning.
Also do we have documentation on who the current pruning processes work? Is this a "stop the world" operation? Or can this be done concurrently on live nodes? Again, what are the tradeoffs and what is the block height thresholds we want to shoot for?

FYI: On injective, with a ~1s block time, and the amount for txs, there is no possiblity to set the pruning because it cause a lot of block miss signature, and node is growing to couple hundreads of Gb in couple days...

What are the constraints here? and what is the cause of the misses?

This feature i believe would increase BY A LOT performances, because we could almost get rid of disk usage and do everything in memory. If it's not done by code, it could be done by putting the data/ dir into a tmpfs which is a in-memory file system.

It would be cool to setup a small experiment to quantify what A LOT looks like. Also I wonder if we could do some sort of WAL pattern where it goes RAMdisk > NVMe > block store with some sort of graceful degrade pattern. Sorting out what we can handle in churn and recorvery modes would be really interesting to sort out.

For instance, in the Kubernetes world, I could see working with LocalPVs where we mount a RAMDisk and NVMe as part of the configuration with replication rules. Again, all of this stuff has tradeoffs, so it would be important to formulate small experiments as well as performing degraded state testing (cascading failure in assumptions, etc)

I love working on these types of problems and it could lead to some really nice generalization if we approach it correctly.

albttx · 2025-01-29T16:03:11Z

Good to add in this thread, the cosmos-sdk configuration for pruning

# default: the last 362880 states are kept, pruning at 10 block intervals
# nothing: all historic states will be saved, nothing will be deleted (i.e. archiving node)
# everything: 2 latest states will be kept; pruning at 10 block intervals.
# custom: allow pruning options to be manually specified through 'pruning-keep-recent', and 'pruning-interval'
pruning = "default"

# These are applied if and only if the pruning strategy is custom.
pruning-keep-recent = "0"
pruning-interval = "0"

albttx · 2025-01-29T16:12:11Z

We should also make sure to outline the disadvantages and tradeoffs of node pruning.

The only tradeoffs is that if nobody run a full-node, it's become impossible to recover the state of a previous block.

It's a quite complexe system to run, FYI: a cosmoshub fullnode (ie: archive node) is over 13 TB of data, only one company is providing something: https://quicksync.io/cosmos

After, there is always explorers that will store block informations in standard database, where it's should be possible to verify blocks with signature.

What are the constraints here? and what is the cause of the misses?

The low block time (ie: timeout_commit reduced) + the amount of txs per blocks.

gno.land isn't out of trouble, as you can see in slack #gno-infra-alerts we had a lot of issues with test5 when gnoswap was doing a lot of txs, validators where missing up to 500 blocks in a row...

Imagine with multiple projects like gnoswap at the same time! networks was probably one of the reasons and milos PR #2852 might help a lot.

Test6 will run latest version with @zivkovicmilos fix. Let's see how the network will react under high load.

n2p5 · 2025-01-29T16:22:47Z

The only tradeoffs is that if nobody run a full-node, it's become impossible to recover the state of a previous block.

"only" 😂 .

This sounds like an interesting area for research, in that it would be useful to have a "hot", "warm" and "cold" storage pattern that keeps a progressively longer block history, with the ability to completely reconstruct the complete block history from cold (and cheap) storage for all to use.

n0izn0iz · 2025-01-30T12:56:09Z

It is a "stop the world" operation in cosmos-sdk 0.47 (not sure about latest versions)

github-project-automation bot added this to 🧙‍♂️gno.land core team Jan 29, 2025

github-project-automation bot moved this to Triage in 🧙‍♂️gno.land core team Jan 29, 2025

albttx added the devops label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPS] FEATURE REQUEST: Pruning system - sign and forget #3637

[OPS] FEATURE REQUEST: Pruning system - sign and forget #3637

albttx commented Jan 29, 2025

n2p5 commented Jan 29, 2025

albttx commented Jan 29, 2025 •

edited

Loading

albttx commented Jan 29, 2025

n2p5 commented Jan 29, 2025

n0izn0iz commented Jan 30, 2025

[OPS] FEATURE REQUEST: Pruning system - sign and forget #3637

[OPS] FEATURE REQUEST: Pruning system - sign and forget #3637

Comments

albttx commented Jan 29, 2025

Description

n2p5 commented Jan 29, 2025

albttx commented Jan 29, 2025 • edited Loading

albttx commented Jan 29, 2025

n2p5 commented Jan 29, 2025

n0izn0iz commented Jan 30, 2025

albttx commented Jan 29, 2025 •

edited

Loading