Suggestion for running BSC nodes #875

forcodedancing · 2022-04-26T10:16:54Z

The transaction volume of BSC is huge, which sometimes brings challenges for running BSC nodes with good performance. Here, information is collected and summarized for running BSC nodes. Hope it will be useful, and any suggestion or discussion is welcomed.

Binary

All the clients are suggested to upgrade to the latest release. The latest version is supposed to be more stable and better performance.

Spec for running nodes

Followings are the recommended specs for running validator and fullnode.

Running validator

3T GB of free disk space, solid-state drive(SSD), gp3, 8k IOPS, 250MB/S throughput, read latency <1ms.
12 cores of CPU and 48 gigabytes of memory (RAM).
m5zn.3xlarge instance type on AWS, or c2-standard-8 on Google cloud.
A broadband Internet connection with upload/download speeds of 10 megabyte per second

Running fullnode

2T GB of free disk space, solid-state drive(SSD), gp3, 3k IOPS, 125MB/S throughput, read latency <1ms. (if start with snap/fast sync, it will need NVMe SSD)
8 cores of CPU and 32 gigabytes of memory (RAM).
c5.4xlarge instance type on AWS, c2-standard-8 on Google cloud.
A broadband Internet connection with upload/download speeds of 5 megabyte per second.

Storage optimization

Block prune

If you do not care about the historical blocks/txs, e.g., txs in an old block, then you can take the following steps to prune blocks.

Stop the BSC node gracefully.
Run nohup geth snapshot prune-block --datadir {the data dir of your bsc node} --datadir.ancient {the ancient data dir of your bsc node} --block-amount-reserved 1024 &. It will take 3-5 hours to finish.
Start the node once the prune is done.

State prune

According to the test, the performance of a fullnode will degrade when the storage size exceeds 1.5T. We suggest the fullnode always keeps light storage by pruning the state storage.

Stop the BSC node gracefully.
Run nohup geth snapshot prune-state --datadir {the data dir of your bsc node} &. It will take 3-5 hours to finish.
Start the node once the prune is done.

Notice:

Due to that a few hours will be needed for pruning, the maintainers should always have a few backup nodes so that you can switch to the backup ones when one of them is pruning.
Prune should be taken periodically, e.g., every month, to achieve good performance.

Sync mode

Pipecommit

The pipecommit feature in release v1.1.8 for full sync. You can enable it by adding --pipecommit in the starting command when running full sync.

Light storage

When the node crashes or is force killed, the node will sync from a block that was a few minutes or a few hours ago. This is because the state in memory is not persisted into the database in real time, and the node needs to replay blocks from the last checkpoint. The replaying time dependents on the configuration TrieTimeout in the config.toml. We suggest you raise it if you can tolerate with long replaying time, so the node can keep light storage.

Performance monitoring

For importing blocks, you can monitor the following key metrics by using Prometheus/Grafana, via adding --metrics in your starting commands.

	blockInsertTimer     = metrics.NewRegisteredTimer("chain/inserts", nil) // chain_inserts in Prometheus
	blockValidationTimer = metrics.NewRegisteredTimer("chain/validation", nil) // chain_validation in Prometheus
	blockExecutionTimer  = metrics.NewRegisteredTimer("chain/execution", nil) // chain_execution in Prometheus
	blockWriteTimer      = metrics.NewRegisteredTimer("chain/write", nil) // chain_write in Prometheus

As showing in the above example, you can find more interested metrics from the source code and monitor them.

Performance tuning

In the logs, mgasps means the block processing ability of the fullnode, make sure the value is above 50.
The node can enable the profile function by adding —pprof in the starting command. Profiles can be taken by curl -sK -v http://127.0.0.1:6060/debug/pprof/profile?seconds=60 > profile_60s.out, and the dev community can help to analyze the profile.

Snapshot for new node

If you want to build a new BSC node, please fetch snapshot from bsc-snapshots.

Improvement suggestion

Feel free to raise pull requests or submit BEPs for your ideas.

References

The text was updated successfully, but these errors were encountered:

bert2002 · 2022-04-27T02:42:09Z

Any succession for running an archive node in the cloud?

deblanco · 2022-04-27T06:41:40Z

@forcodedancing Thanks for this content, is really useful!

Could you add the commands for running the node in different ways? (archive, light, ...)
and maybe optimization tips?

James19903 · 2022-04-27T15:05:28Z

What is faster? Diffsync or Pipecommit?

kugimiya530 · 2022-04-28T00:44:56Z

i was prune state already , there is only 600GB in my node folder

but my node still has a liitle performance behind (compare same location and spec 's server)

i can't figure out lol

nathanhopp · 2022-04-29T00:13:12Z

You've outlined two pruning methods. For the minimal size possible, should we be running nohup geth snapshot prune-state --datadir {the data dir of your bsc node} followed by nohup geth snapshot prune-block --datadir {the data dir of your bsc node} --datadir.ancient {the ancient data dir of your bsc node} --block-amount-reserved 1024

Is there any dependency between these two commands? For example, if we run prune-block, will we run into errors trying prune-state after?

forcodedancing · 2022-04-29T13:23:20Z

Any succession for running an archive node in the cloud?

The disk requirement is very high. I believe a 10T ~ 15T disk is required. If you have such disk, you can have a try.

forcodedancing · 2022-04-29T13:23:49Z

@forcodedancing Thanks for this content, is really useful!

Could you add the commands for running the node in different ways? (archive, light, ...) and maybe optimization tips?

Sure, I will add more detail on this.

forcodedancing · 2022-04-29T13:34:24Z

What is faster? Diffsync or Pipecommit?

Pipecommit is suggested, using the latest release.

forcodedancing · 2022-04-29T13:35:10Z

i was prune state already , there is only 600GB in my node folder

but my node still has a liitle performance behind (compare same location and spec 's server)

i can't figure out lol

Can you check your disk? It is usually the bottleneck now.

forcodedancing · 2022-04-29T13:37:08Z

You've outlined two pruning methods. For the minimal size possible, should we be running nohup geth snapshot prune-state --datadir {the data dir of your bsc node} followed by nohup geth snapshot prune-block --datadir {the data dir of your bsc node} --datadir.ancient {the ancient data dir of your bsc node} --block-amount-reserved 1024

Is there any dependency between these two commands? For example, if we run prune-block, will we run into errors trying prune-state after?

This is no dependency, and the order does not affect the final performance. You can do them one by one, not in parallel.

nathanhopp · 2022-04-29T16:52:54Z

I had found errors when running prune-state after prune-block has run. No known issues around this? I could double check and raise issue

tntwist · 2022-04-30T08:18:08Z

Does anyone know a good vps/dedicated server hoster for hosting a full node in the US?
AWS and Google cloud are quite expensive.

cruzerol · 2022-05-01T07:26:36Z

@tntwist vultr

tntwist · 2022-05-02T10:21:42Z

@cruzerol Thanks. What instance would you suggest there?

forcodedancing · 2022-05-05T02:11:21Z

I had found errors when running prune-state after prune-block has run. No known issues around this? I could double check and raise issue

Sure, please submit an issue and we can further analysis it. Thanks

cruzerol · 2022-05-10T10:57:55Z

@tntwist Bare Metal - 350$

haumanto · 2022-07-02T05:20:11Z

hi @forcodedancing , my node keeps getting "Synchronisation failed, dropping peer" issue and stop syncing... the only solution is to restart. it happens very often. attach is the performance profile. please help reviewing it..
profile_60s.out.zip
.

forcodedancing · 2022-07-31T13:13:37Z

hi @forcodedancing , my node keeps getting "Synchronisation failed, dropping peer" issue and stop syncing... the only solution is to restart. it happens very often. attach is the performance profile. please help reviewing it.. profile_60s.out.zip .
@haumanto can you try the latest release https://github.com/bnb-chain/bsc/releases ? This log "Synchronisation failed, dropping peer" issue should be warnings.

Thrisul-K · 2022-11-09T06:55:08Z

@forcodedancing or anyone
could you suggest what would be the suggested specs to run a node considering the current data. i tried with AWS m5zn3x.large with 10k IOPS it doesnt seem to catchup with current block even after 3 days.
just stuck at state heal in progress (below log for reference)

lvl=info msg="State heal in progress" accounts=2,797,558@164.39MiB slots=1,961,139@143.59MiB codes=1875@16.24MiB nodes=30,732,105@10.01GiB pending=161,822

i had observed below log during state sync
lvl=info msg="State sync in progress" synced=100.00% state="518.41 GiB" accounts=135,781,417@27.48GiB slots=2,371,626,607@476.15GiB codes=1,864,926@14.77GiB eta=-7m31.069s

does this mean heal phase should run till the accounts in this phase reach 135,781,417 ?

forcodedancing · 2022-11-15T12:27:13Z

i tried with AWS m5zn3x.large with 10k IOPS it doesnt seem to catchup with current block even after 3 days.
just stuck at state heal in progress (below log for reference)

The spec should be fine for fullnode. Did you use the snapshot? I also suggest running with fastnode.

jacobpake · 2022-11-28T12:33:54Z

i tried with AWS m5zn3x.large with 10k IOPS it doesnt seem to catchup with current block even after 3 days.
just stuck at state heal in progress (below log for reference)

The spec should be fine for fullnode. Did you use the snapshot? I also suggest running with fastnode.

#1198

DaveWK · 2022-12-03T21:26:04Z

Just chiming in here since I opened a related issue: #1198 There's an issue with syncing from scratch where it gets caught in "state heal" forever. I had the same problem on go-ethereum, which was solved by some recent commits and now have an open PR for bsc: #1226

These should fix/improve performance for syncing from scratch. I noticed on a c6a.8xlarge 9k iops that it seemed to finish it's initial sync after 8 hours, then go into the "state heal" loop, so hopefully the improvements will mean it will finish roughly in that amount of time.

In the meantime, if you do not care about historical/archive data, I was able to start a node from snapshot on the same specs. I had to download the archive and wait for it to finish (2-3 hours), then unzip it (another 3 hours) then wait for it to start up and do any initial catchup (another 1-2 hours) which meant having to monitor it for the next step. One thing I noticed was the go implementation of lz4 seems to be way faster on the CLI, I think because the C implementation is not using threads but the go implementation is. The go lz4 implementation at: https://github.com/pierrec/lz4 has a CLI included in the default fedora repos and reduced the archive extract by about an hour.

I didn't need to use fastnode with these specs when using a snapshot, and am now using full sync. Once the PR merges, I will try from scratch again but estimate I will be able to reduce the specs to 3k iops and a c6a.4xlarge (16 vcpu, 32 gig ram) based on the current bsc node (from snapshot) resource consumption, and specs of my go-ethereum node (note: this is a "full node" not a validator).

I prefer syncing from scratch due to supply chain attack concerns over using a snapshot, and since there are manual steps every few hours, rather than starting the node and just waiting for the initial sync.

One thing I was unsure about: is there a configuration to allow prune to take place concurrently while the node is running, like it does for go-ethereum/nethermind rather than having to manually stop the node and run prune?

forcodedancing · 2022-12-04T11:30:00Z

@DaveWK Thanks for sharing your useful experience and suggestions. There is no in place or online prune now.

coozebra · 2023-07-10T19:11:58Z

Hi!

I installed and fully synced a full node. It is working well.
The problem is: can I speed up catching pending transactions?
What I did: I tried to update RAM, ssd/nvme, IOPS and all failed.
My thought: can I possibly update geth code to sync with only validators on bsc network? It may be much faster and more efficient.

Thanks!

zzzckck · 2023-12-14T07:23:04Z

put it in FAQ: #1947

unclezoro pinned this issue Apr 27, 2022

This was referenced May 24, 2022

AWS EC2 instance specs #925

Closed

AWS Instance? #921

Closed

owen-reorg added the documentation Improvements or additions to documentation label Jun 9, 2022

unclezoro added the good first issue Good for newcomers label Jul 12, 2022

This was referenced Nov 22, 2022

Reduce Disk Space after sync from snapshot? #1197

Closed

Please prune state bnb-chain/bsc-snapshots#194

Closed

kris-ss mentioned this issue Nov 30, 2022

Disk volume for BSC node #1218

Closed

forcodedancing mentioned this issue Dec 19, 2022

Node can't sync #1202

Closed

zlacfzy mentioned this issue Feb 6, 2023

SnapSync failed at: State Heal #1284

Closed

zzzckck mentioned this issue Dec 14, 2023

BSC-FAQ #1947

Closed

zzzckck closed this as completed Dec 14, 2023

zzzckck unpinned this issue Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for running BSC nodes #875

Suggestion for running BSC nodes #875

forcodedancing commented Apr 26, 2022 •

edited by zzzckck

Loading

bert2002 commented Apr 27, 2022

deblanco commented Apr 27, 2022

James19903 commented Apr 27, 2022

kugimiya530 commented Apr 28, 2022 •

edited

Loading

nathanhopp commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022 •

edited

Loading

nathanhopp commented Apr 29, 2022

tntwist commented Apr 30, 2022

cruzerol commented May 1, 2022 •

edited

Loading

tntwist commented May 2, 2022

forcodedancing commented May 5, 2022

cruzerol commented May 10, 2022

haumanto commented Jul 2, 2022

forcodedancing commented Jul 31, 2022

Thrisul-K commented Nov 9, 2022

forcodedancing commented Nov 15, 2022

jacobpake commented Nov 28, 2022

DaveWK commented Dec 3, 2022

forcodedancing commented Dec 4, 2022

coozebra commented Jul 10, 2023

zzzckck commented Dec 14, 2023

Suggestion for running BSC nodes #875

Suggestion for running BSC nodes #875

Comments

forcodedancing commented Apr 26, 2022 • edited by zzzckck Loading

Binary

Spec for running nodes

Running validator

Running fullnode

Storage optimization

Block prune

State prune

Sync mode

Pipecommit

Light storage

Performance monitoring

Performance tuning

Snapshot for new node

Improvement suggestion

References

bert2002 commented Apr 27, 2022

deblanco commented Apr 27, 2022

James19903 commented Apr 27, 2022

kugimiya530 commented Apr 28, 2022 • edited Loading

nathanhopp commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022

forcodedancing commented Apr 29, 2022 • edited Loading

nathanhopp commented Apr 29, 2022

tntwist commented Apr 30, 2022

cruzerol commented May 1, 2022 • edited Loading

tntwist commented May 2, 2022

forcodedancing commented May 5, 2022

cruzerol commented May 10, 2022

haumanto commented Jul 2, 2022

forcodedancing commented Jul 31, 2022

Thrisul-K commented Nov 9, 2022

forcodedancing commented Nov 15, 2022

jacobpake commented Nov 28, 2022

DaveWK commented Dec 3, 2022

forcodedancing commented Dec 4, 2022

coozebra commented Jul 10, 2023

zzzckck commented Dec 14, 2023

forcodedancing commented Apr 26, 2022 •

edited by zzzckck

Loading

kugimiya530 commented Apr 28, 2022 •

edited

Loading

forcodedancing commented Apr 29, 2022 •

edited

Loading

cruzerol commented May 1, 2022 •

edited

Loading