Lotus splitstore not discarding - chain keeps growing #9840

stuberman · 2022-12-12T05:50:04Z

Checklist

This is not a security-related bug/issue. If it is, please follow please follow the security policy.
This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
This is not a new feature request. If it is, please file a feature request instead.
This is not an enhancement request. If it is, please file a improvement suggestion instead.
I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
I did not make any code changes to lotus.

Lotus component

Lotus Version

Daemon:  1.19.0-rc2+mainnet+git.2520b1644+api1.5.0
Local: lotus-miner version 1.19.0-rc2+mainnet+git.2520b1644

[Chainstore]
  # type: bool
  # env var: LOTUS_CHAINSTORE_ENABLESPLITSTORE
  EnableSplitstore = true

 [Chainstore.Splitstore]
   # ColdStoreType specifies the type of the coldstore.
   # It can be "messages" (default) to store only messages, "universal" to store all chain state or "discard" for discarding cold blocks.
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE
   ColdStoreType = "discard"
   # EnableColdStoreAutoPrune = false

   # HotStoreType specifies the type of the hotstore.
   # Only currently supported value is "badger".
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTORETYPE
   #HotStoreType = "badger"

   # MarkSetType specifies the type of the markset.
   # It can be "map" for in memory marking or "badger" (default) for on-disk marking.
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_MARKSETTYPE
   #MarkSetType = "badger"

   # HotStoreMessageRetention specifies the retention policy for messages, in finalities beyond
   # the compaction boundary; default is 0.
   #
   # type: uint64
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREMESSAGERETENTION
   #HotStoreMessageRetention = 0

   # HotStoreFullGCFrequency specifies how often to perform a full (moving) GC on the hotstore.
   # A value of 0 disables, while a value 1 will do full GC in every compaction.
   # Default is 20 (about once a week).
   #
   # type: uint64
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY
   #HotStoreFullGCFrequency = 20

Describe the Bug

Splitstore does not seem to discard old blocks (this is after I imported a new snapshot three days ago)

du -sch /.lotus/datastore/* | sort -rh
392G total
284G /.lotus/datastore/splitstore
109G /.lotus/datastore/chain
8.1M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

I noticed that the lotus daemon will also quit randomly - see logs below

Logging Information

2022-11-24T04:49:42.692 INFO filcrypto::util::types > fvm_machine_execute_message: end
2022-11-24T04:49:42.794 INFO filcrypto::util::types > fvm_machine_flush: start
2022-11-24T04:49:43.008 INFO filcrypto::util::types > fvm_machine_flush: end
2022-11-24T04:49:56.902Z	WARN	builder	node/shutdown.go:33	received shutdown
2022-11-24T04:49:56.902Z	WARN	builder	node/shutdown.go:36	Shutting down...
2022-11-24T04:49:56.903Z	INFO	builder	node/shutdown.go:44	rpc server shut down successfully 
2022-11-24T04:49:56.903Z	INFO	dt-impl	impl/impl.go:170	stop data-transfer module
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:438	error from message subscription: context canceled
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:440	quitting HandleIncomingMessages loop
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:55	quitting HandleIncomingBlocks loop
2022-11-24T04:49:56.904Z	WARN	peermgr	peermgr/peermgr.go:154	closing peermgr done
2022-11-24T04:49:56.904Z	WARN	peermgr	peermgr/peermgr.go:172	exiting peermgr run
2022-11-24T04:49:56.904Z	WARN	chainstore	store/store.go:616	reorgWorker quit
2022-11-24T04:49:56.904Z	WARN	splitstore	splitstore/splitstore.go:779	close with ongoing compaction in progress; waiting for it to finish...
2022-11-24T04:49:56.908Z	INFO	pubsub	go-libp2p-pubsub@v0.8.0/pubsub.go:640	pubsub processloop shutting down
2022-11-24T04:49:57.891Z	ERROR	splitstore	splitstore/splitstore_warmup.go:40	error warming up hotstore: error walking block (cid: bafy2bzacea4aeq6v2gdvre2txt3mmc5hsnz2j4ksrhheovmvn3uxbapuxx4cw): error walking state root (cid: bafy2bzacedi5adkbupv3cu5ichsa776jvgochor4or3k6n7cxarm4dnbza2qs): error walking link (cid: bafy2bzacectp2gnv2e5zzmlbhcxryggvpt3okboxfv3kacguz4ezthi4m4lje): error walking link (cid: bafy2bzacecl6tsngoa47mpboat65k6v6hwu7yagclr5bhxfsgoetpdtxzatwg): error walking link (cid: bafy2bzacedvrnrtom4fadhpoyvqmmoac5bxvzucorzwckr3qhtf5wlqs6gdzq): error walking link (cid: bafy2bzacecqx5xs4uk232v2xe7euwbj5s47zo4z64t75yfowjgjiz54bu42ya): error walking link (cid: bafy2bzacecs4vfmtttelwagv7stdw777mkrcgyatqv253uthilay3hst6qv5w): error walking link (cid: bafy2bzaceasb4hymydff2eej7jr4vsfiqedpcyajzotanrgjb3np3wqymqd46): error walking link (cid: bafy2bzacecmhi4kfcbmo2efxp677yifh7yz2pnhski7pxf42xunt3dqs4qoy2): error walking link (cid: bafy2bzacecouc4r5sc7pc2hmg67g3oe6ne64t7nhtdmzalevbrfzxw3d6bwuk): error walking link (cid: bafy2bzacebmdcqvc4plfhi5ufnqr4npov42tt5gnwd45bnghssq43yshknf4u): error walking link (cid: bafy2bzaceb2cs5w5z75qqa7bppnfh53b5xkjd2helbifwmavvhmsbkgyu3jsi): splitstore is closing
2022-11-24T04:49:58.230Z	INFO	badgerbs	v2@v2.2007.3/db.go:1031	Storing value log head: {Fid:66 Len:34 Offset:1041111137}

2022-11-24T04:49:58.438Z	INFO	badgerbs	v2@v2.2007.3/levels.go:1000	[Compactor: 173] Running compaction: {level:0 score:1.73 dropPrefixes:[]} for level: 0

2022-11-24T04:49:59.778Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:49:59.898Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.048Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.171Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/levels.go:962	LOG Compact 0->1, del 7 tables, add 6 tables, took 1.766679993s

2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/levels.go:1010	[Compactor: 173] Compaction for level: 0 DONE
2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/db.go:554	Force compaction on level 0 done
2022-11-24T04:50:00.378Z	INFO	builder	node/shutdown.go:44	node shut down successfully 
2022-11-24T04:50:00.378Z	WARN	builder	node/shutdown.go:47	Graceful shutdown successful

Repo Steps

Run '...'
Do '...'
See error '...'
...

The text was updated successfully, but these errors were encountered:

stuberman · 2022-12-12T06:06:13Z

Chain log from last hour:

chain.txt

stuberman · 2022-12-12T06:10:26Z

Chainlog (node) last 5 hours

chain2.txt

stuberman · 2022-12-14T16:22:58Z

du -sch /.lotus/datastore/* | sort -rh
544G total
435G /.lotus/datastore/splitstore
109G /.lotus/datastore/chain
9.5M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

vyzo · 2022-12-14T16:27:04Z

try setting HotStoreFullGCFrequency to something like 1, so as to force moving gc on badger.

It happens once a week by default. You can set it to 3 if you want it to occur daily.

NodeKing · 2022-12-19T02:01:09Z

I'm experiencing the same issues. The splitstore seems to keep growing. I have implemented the HotStoreFullGCFrequency=1 configuration, but it hasn't helped:

current size:
lotus/datastore
476.5 GiB /splitstore
110.2 GiB /chain

docker run command in use:
P2P_ANNOUNCE_IP=$(wget -qO- ifconfig.me/ip)
docker run -d --name lotus
--user 532:532
--network host
-e LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/8545/http
-e LOTUS_LIBP2P_LISTENADDRESSES=/ip4/0.0.0.0/tcp/6665
-e LOTUS_LIBP2P_ANNOUNCEADDRESSES=/ip4/$P2P_ANNOUNCE_IP/tcp/6665
-e LOTUS_LIBP2P_DISABLENATPORTMAP=true
-e LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
-e LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
-e LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=false
-e LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=1
-v /blockchain/lotus:/var/lib/lotus
-v /blockchain/lotus-tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters
filecoin/lotus:v1.18.2 daemon

NodeKing · 2022-12-19T02:06:42Z

I also have a second server running with the following config (same issue)

lotus/datastore
593.4 GiB /splitstore
104.7 GiB /chain

Using essentially the same config. But has pruning enabled, and missing the HOTSTOREFULLGCFREQUENCY=1 config

P2P_ANNOUNCE_IP=$(wget -qO- ifconfig.me/ip)
docker run -d --name lotus
--user 532:532
--network host
-e LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/8545/http
-e LOTUS_LIBP2P_LISTENADDRESSES=/ip4/0.0.0.0/tcp/6665
-e LOTUS_LIBP2P_ANNOUNCEADDRESSES=/ip4/$P2P_ANNOUNCE_IP/tcp/6665
-e LOTUS_LIBP2P_DISABLENATPORTMAP=true
-e LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
-e LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
-e LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=true
-v /blockchain/lotus:/var/lib/lotus
-v /blockchain/lotus-tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters
filecoin/lotus:v1.18.2 daemon

I have been experimenting with different configurations for weeks without success.. Any guidance would be greatly appreciated.

TippyFlitsUK · 2022-12-19T10:53:02Z

Did you delete the chain folder contents and clean the splitstore folder before importing the lightweight snapshot @NodeKing?

stuberman · 2022-12-19T13:57:37Z

I did not see this helping this issue, downloading a new minimal snapshot, by deleting the cold store (/chain) and hot store ( ./lotus/lotus-shed splitstore clear --repo=/.lotus) and setting HOTSTOREFULLGCFREQUENCY=1.

What seems to work is downloading a minimal snapshot and then before importing, renaming .lotus/datastore to .lotus/datastore.old. The import process would then rebuild the entire directory possibly eliminating any corrupt or old entries.

TippyFlitsUK · 2022-12-19T15:58:34Z

./lotus/lotus-shed splitstore clear is essentially doing exactly the same thing, Stu. HOTSTOREFULLGCFREQUENCY can also be set to a lower number if hotstore space is limited.

stuberman · 2022-12-19T16:04:10Z

./lotus/lotus-shed splitstore clear is essentially doing exactly the same thing, Stu. HOTSTOREFULLGCFREQUENCY can also be set to a lower number if hotstore space is limited.

However, it also removes the metadata, client and staging data.

BTW - It is now fully obvious to me that lotus client retrievals (legacy deals from Evergreen) cause all sorts of chain issues. That process uses GraphSync and causes the chain to lose sync.

TippyFlitsUK · 2022-12-19T16:09:21Z

However, it also removes the metadata, client and staging data.

Those folders have no impact on the splitstore size. ./lotus/lotus-shed splitstore clear is the recommended method of clearing the splitstore for a re-import.

BTW - It is now fully obvious to me that lotus client retrievals (legacy deals from Evergreen) cause all sorts of chain issues. That process uses GraphSync and causes the chain to lose sync.

Very interesting!! Thanks Stu, I will add that to our current monitoring. 🙏

NodeKing · 2022-12-19T21:46:47Z

I created a new server when starting to use the HOTSTOREFULLGCFREQUENCY=1 setting @TippyFlitsUK .

I first run the container with the config from my first post, but with this final line instead to import from snapshot:
filecoin/lotus:v1.18.2 daemon --import-snapshot https://snapshots.mainnet.filops.net/minimal/latest --halt-after-import true

stuberman · 2022-12-26T14:47:18Z

I am still seeing the chain grow even with HotStoreFullGCFrequency = 1

du -sch /.lotus/datastore/* | sort -rh

715G total
604G /.lotus/datastore/splitstore
112G /.lotus/datastore/chain
9.5M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

stuberman · 2022-12-26T14:49:21Z

BTW - I continue to see the chain sync get stuck and fall behind several times a day. I restart the service every 12 hours to keep the chain sync'ed. I hear this is a common problem.

stuberman · 2022-12-28T19:11:34Z

./lotus-shed splitstore info

warmup epoch: 2.43576e+06
base epoch: 2.43576e+06
compacting: true
compactions: 0
hotstore size: 8.02180837946e+11
prunes: 0

~/lotus$ du -sch /.lotus/datastore/* | sort -rh

862G total
751G /.lotus/datastore/splitstore
112G /.lotus/datastore/chain
12M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

stuberman · 2023-01-03T15:51:25Z

I was forced to download a new snapshot again two days ago as the size of the hot store was at 1 TiB
HotStoreFullGCFrequency = 1

./lotus-shed splitstore info

base epoch: 2.46768e+06
compacting: true
compactions: 0
hotstore size: 4.30981607577e+11
prunes: 0
warmup epoch: 0

du -sch /.lotus/datastore/* | sort -rh

541G total
426G /.lotus/datastore/splitstore
115G /.lotus/datastore/chain
7.1M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

NodeKing · 2023-01-04T01:50:29Z

I'm creating a new server too, as the server is now using 1.5TB of space.

TippyFlitsUK · 2023-01-09T21:45:23Z

Have you tried adjusting the splistore settings at all @NodeKing ?

NodeKing · 2023-01-10T00:44:17Z

Have tried variants of the below env var's, without any success. Can you suggest a config to try @TippyFlitsUK?

LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=true
LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=1

shawnp0wers · 2023-01-10T00:53:10Z

I had a miner that cleaned up after itself for a while, but now just keeps growing and growing. The discard process doesn't seem to happen. How does that part work? Once stuff is moved to coldstore, when does it get deleted? (It seems that "prune" means something different in this instance?)

NodeKing · 2023-01-10T01:32:56Z

I'm under the impress that the pruning never occurs because the cold store is meant to be discarding, so the autoPrune env var is probably redundant config. Would be good to get some clarity on this.

lotus-shed splitstore info
prunes: 0
warmup epoch: 0
base epoch: 2.492838e+06
compacting: true
compactions: 4
hotstore size: 1.75710129714e+11

TippyFlitsUK · 2023-01-10T14:19:03Z

What version of lotus are you running @NodeKing?

NodeKing · 2023-01-10T23:25:47Z

Still using the same version as the docker run commands above @TippyFlitsUK : filecoin/lotus:v1.18.2

TippyFlitsUK · 2023-01-11T12:29:21Z

Thanks @NodeKing!

Would you be able to upgrade to v1.19.0 with a fresh snapshot and fully clean chain and datastore folders?

1.19.0 includes SplitStore updates that may help with the issues that you are seeing. I have been using this version for weeks with the settings below and am not experiencing any issues at all.

Please note that the AutoPrune feature has been retired in 1.19.0, you can see all the updated docs here.

Many thanks!!

LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=10

shawnp0wers · 2023-01-11T13:13:43Z

I have 2 miners on splitstore with discard now. Both 1.19.0.

My test miner (mainnet) did discard once, going from 500ish to 350ish, but after that kept growing until it was over 750GB, then it fell out of sync completely.

My production miner on splitstore (still sealing) grew to about 650GB, but did go down to 400GB overnight. The process seems to be hit or miss, and troubleshooting is like peering into a mysterious black box.

vyzo · 2023-01-25T06:18:45Z

Are there any splitstore logs?
It sounds to me that compaction is not running to completion.

vyzo · 2023-01-25T06:20:08Z

Its right there -- warm up is erroring with a missing ref; cc @ZenGround0

William8Work · 2023-02-03T01:23:49Z

my daemon ran into similar issue. Change the hotstore GC frequency to 20, then to 10, now is running at 1. Did not help. The hotstore size is now at 1.7TB out of 1.9 TB drive. I need to delete the datastore folder and re-import the snapshot to clear the disk space.
daemon-f08399-2023.zip

RobQuistNL · 2023-02-06T16:05:38Z

Some more discussions about this issue; https://filecoinproject.slack.com/archives/CPFTWMY7N/p1673472134663959

dd45e640b42e6da7da96faee3996ef7c · 2023-02-13T14:20:29Z

same issues across 10 nodes :/

can we bump this to a higher priority somehow?

vyzo · 2023-02-13T14:41:58Z

I think we should -- pinging @jennijuju @ZenGround0

The issue is that compaction fails with some unreachable object. Root cause may be changes in what is reachable with the new fvm stuffs or some other change that makes the compactor try to traverse something unreachable.

jennijuju · 2023-02-13T14:46:32Z

I think we should -- pinging @jennijuju @ZenGround0

The issue is that compaction fails with some unreachable object. Root cause may be changes in what is reachable with the new fvm stuffs or some other change that makes the compactor try to traverse something unreachable.

Haha great timing! https://filecoinproject.slack.com/archives/CP50PPW2X/p1676297112125189

ZenGround0 · 2023-02-13T18:16:31Z

Its right there -- warm up is erroring with a missing ref

@vyzo Where?

vyzo · 2023-02-13T18:34:52Z

One of the top logs...

ZenGround0 · 2023-02-13T18:44:32Z

Do you mean this one?

2022-11-24T04:49:57.891Z	ERROR	splitstore	splitstore/splitstore_warmup.go:40	error warming up hotstore: error walking block (cid: bafy2bzacea4aeq6v2gdvre2txt3mmc5hsnz2j4ksrhheovmvn3uxbapuxx4cw): error walking state root (cid: bafy2bzacedi5adkbupv3cu5ichsa776jvgochor4or3k6n7cxarm4dnbza2qs): error walking link (cid: bafy2bzacectp2gnv2e5zzmlbhcxryggvpt3okboxfv3kacguz4ezthi4m4lje): error walking link (cid: bafy2bzacecl6tsngoa47mpboat65k6v6hwu7yagclr5bhxfsgoetpdtxzatwg): error walking link (cid: bafy2bzacedvrnrtom4fadhpoyvqmmoac5bxvzucorzwckr3qhtf5wlqs6gdzq): error walking link (cid: bafy2bzacecqx5xs4uk232v2xe7euwbj5s47zo4z64t75yfowjgjiz54bu42ya): error walking link (cid: bafy2bzacecs4vfmtttelwagv7stdw777mkrcgyatqv253uthilay3hst6qv5w): error walking link (cid: bafy2bzaceasb4hymydff2eej7jr4vsfiqedpcyajzotanrgjb3np3wqymqd46): error walking link (cid: bafy2bzacecmhi4kfcbmo2efxp677yifh7yz2pnhski7pxf42xunt3dqs4qoy2): error walking link (cid: bafy2bzacecouc4r5sc7pc2hmg67g3oe6ne64t7nhtdmzalevbrfzxw3d6bwuk): error walking link (cid: bafy2bzacebmdcqvc4plfhi5ufnqr4npov42tt5gnwd45bnghssq43yshknf4u): error walking link (cid: bafy2bzaceb2cs5w5z75qqa7bppnfh53b5xkjd2helbifwmavvhmsbkgyu3jsi): splitstore is closing

from stuberman's original comment? I thought so at first too but its just wrapping a splitstore closure, this is expected. I saw a proper COMPACTION ERROR like this when digging through f08399's logs too and they were all splitstore closures.

Let me know if it was a different one, if you can point me to an unexpected compaction failure it would be obviously be very helpful.

ZenGround0 · 2023-02-13T18:57:01Z

High level possible reasons state grows without bound in discard mode. Somewhere somehow some expected GC work is not getting done.

Compaction is erroring. We will see COMPACTION ERROR in the logs (note we'll also see this during lotus shutdown when compacting which is not an error)
Compaction isn't getting called.
Compaction is getting called and its skipping out on work leaving dead state hanging
cid level compact is working great but badger level GC leaves dead state hanging even after forcing badger compaction

Whatever the root cause it also seems somewhat non deterministic because some configurations can get it to work.

ZenGround0 · 2023-02-13T23:30:10Z

My next step is to modify debug logs to closely track data movement during compaction and badger GC. This should help us classify the scenario into one of the four options above. Once I've got a decent set of logs merged somewhere I'll request impacted operators to run this patch and we can consolidate measurements here.

In the meantime if anyone has logs with evidence that compaction is not running as often as expected or that compaction is failing unexpectedly please post them here. I can't rule these options out yet but also haven't found evidence of either behaviors from the provided logs in thread.

stuberman · 2023-02-14T00:57:49Z

Daemon log can be found here

SBudo · 2023-02-14T02:21:51Z

Having the same issue, but only on one of our daemons.
All the other daemons, which are identical (hardware, software) are fine.
Tried to clean up the chain folder, as well as the splitstore and re-import a fresh snapshot multiple times, but it keeps on filling up the disk and eventually runs out

My logs at time, filled up with those messages until the disk ran out of space:

2023-02-13T10:06:42.416+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.622917ms.

2023-02-13T10:06:42.423+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:312      doCopy Time elapsed: 24m02s, bytes sent: 126 GB, speed: 87 MB/sec

2023-02-13T10:06:42.484+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.483893ms.

2023-02-13T10:06:42.516+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.40986ms.

2023-02-13T10:06:42.644+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 14.66298ms.

2023-02-13T10:06:42.677+1100    INFO    badgerbs        v2@v2.2007.3/db.go:1031 Storing value log head: {Fid:118 Len:33 Offset:138878067}

2023-02-13T10:06:42.686+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 19.17112ms.

2023-02-13T10:06:42.702+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.53692ms.

2023-02-13T10:06:42.746+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.257801ms.

2023-02-13T10:06:42.763+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.67925ms.

2023-02-13T10:06:42.819+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.2709ms.

2023-02-13T10:06:42.841+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 10.089043ms.

2023-02-13T10:06:42.872+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.389161ms.

2023-02-13T10:06:42.905+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.105536ms.

2023-02-13T10:06:42.957+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 49.271054ms.

2023-02-13T10:06:43.030+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 17.803969ms.

2023-02-13T10:06:43.101+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 16.758906ms.

2023-02-13T10:06:43.128+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 16.381419ms.

2023-02-13T10:06:43.206+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 13 MB in 54.552666ms.

2023-02-13T10:06:43.324+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.052173ms.

2023-02-13T10:06:43.340+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.98394ms.

2023-02-13T10:06:43.394+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.929277ms.

2023-02-13T10:06:43.441+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 13 MB in 35.111055ms.

2023-02-13T10:06:43.441+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:312      doCopy Time elapsed: 24m03s, bytes sent: 126 GB, speed: 87 MB/sec

NodeKing · 2023-03-01T22:49:03Z

Has this issue be addressed in the latest release 1.20.0 ?

stuberman · 2023-03-01T23:02:57Z

Has this issue be addressed in the latest release 1.20.0 ?

Not yet. It is prevalent with those running lotus daemon with systemd.

Shekelme · 2023-03-02T05:38:40Z

rabinovitch@lotus:~$ du -h -s /home/rabinovitch/.lotus/datastore/splitstore && du -h -s /home/rabinovitch/.lotus/datastore/chain
1,5T /home/rabinovitch/.lotus/datastore/splitstore
217G /home/rabinovitch/.lotus/datastore/chain

((

stuberman · 2023-03-03T14:07:09Z

@ZenGround0 one thing we noted when using systemd is that the default profile for lotus is not suitable

Lotus default

MemoryAccounting=true
MemoryHigh=8G
MemoryMax=10G
LimitNOFILE=8192:10240

This improves splitstore compaction:

MemoryAccounting=false
MemoryHigh=16G
MemoryMax=infinity
LimitNOFILE=1024000:1024000

cat lotus/scripts/lotus-daemon.service

[Unit]
Description=Lotus Daemon
After=network-online.target
Requires=network-online.target

[Service]
Environment=GOLOG_FILE="/var/log/lotus/daemon.log"
Environment=GOLOG_LOG_FMT="json"
ExecStart=/usr/local/bin/lotus daemon
Restart=always
RestartSec=10

MemoryAccounting=true
MemoryHigh=8G
MemoryMax=10G
LimitNOFILE=8192:10240

[Install]
WantedBy=multi-user.target

YuXiaoCoder · 2023-03-09T03:02:56Z

I have the same problem, my configuration is as follows

[Chainstore]
  EnableSplitstore = true
  [Chainstore.Splitstore]
    ColdStoreType = "discard"
    HotStoreType = "badger"
    MarkSetType = "badger"
    HotStoreMessageRetention = 0
    HotStoreFullGCFrequency = 3

The software version is 1.20.1

froid1911 · 2023-04-22T11:56:13Z

Have the same issue :(((

RobQuistNL · 2023-04-22T12:46:39Z

Related: #10712 #10711 #10710

NodeKing · 2023-04-23T02:06:21Z

We still have the issue and are rebuilding our servers every couple of weeks.

clinta · 2023-05-23T13:47:28Z

Version 1.23.0 seems to have resolved this issue for us.

TippyFlitsUK · 2023-05-23T13:50:27Z

Thank you for the update @clinta 🙏

NodeKing · 2023-05-23T21:25:37Z

v1.23.0 resolved our issue too. Our servers were pretty flakey on this version for some time, getting out of sync with the network. Servers needed frequent rebooting to get back in sync. Everything eventually stabilised on it's own and we ended up with massive disk space savings.

Our nodes were getting up to 2TB of used space before we would rebuild them, now they're using ~280GB.

Shekelme · 2023-05-24T04:41:46Z

Can't say the same as @NodeKing. But apparently the volume is not growing and the drives are not overflowing.

TippyFlitsUK · 2023-05-24T12:38:16Z

Can you run a lotus chain prune hot-moving @Shekelme? My most recent run took me all the way down to 126 GiB.

stuberman · 2023-07-26T13:31:37Z

I have not had any problems with compaction since I changed the sync from systemd to a local task.
My chain has been running for months now without a need to reimport a minimal snapshot.

stuberman added kind/bug Kind: Bug need/triage labels Dec 12, 2022

Reiers added need/analysis Hint: Needs Analysis splitstore and removed need/triage labels Jan 3, 2023

RobQuistNL mentioned this issue Mar 10, 2023

Lotus CommitAggregateWait stuck / Missing CommitAggregate messages #10438

Closed

11 tasks

stuberman closed this as completed Jul 26, 2023

Lotus splitstore not discarding - chain keeps growing #9840

Lotus splitstore not discarding - chain keeps growing #9840

Comments

stuberman commented Dec 12, 2022

Checklist

Lotus component

Lotus Version

Describe the Bug

Logging Information

Repo Steps

stuberman commented Dec 12, 2022

stuberman commented Dec 12, 2022

stuberman commented Dec 14, 2022

vyzo commented Dec 14, 2022 • edited Loading

NodeKing commented Dec 19, 2022

NodeKing commented Dec 19, 2022

TippyFlitsUK commented Dec 19, 2022

stuberman commented Dec 19, 2022

TippyFlitsUK commented Dec 19, 2022

stuberman commented Dec 19, 2022

TippyFlitsUK commented Dec 19, 2022

NodeKing commented Dec 19, 2022

stuberman commented Dec 26, 2022

stuberman commented Dec 26, 2022

stuberman commented Dec 28, 2022

stuberman commented Jan 3, 2023

NodeKing commented Jan 4, 2023

TippyFlitsUK commented Jan 9, 2023

NodeKing commented Jan 10, 2023 • edited Loading

shawnp0wers commented Jan 10, 2023

NodeKing commented Jan 10, 2023

TippyFlitsUK commented Jan 10, 2023

NodeKing commented Jan 10, 2023

TippyFlitsUK commented Jan 11, 2023

shawnp0wers commented Jan 11, 2023

vyzo commented Jan 25, 2023

vyzo commented Jan 25, 2023

William8Work commented Feb 3, 2023

RobQuistNL commented Feb 6, 2023

dd45e640b42e6da7da96faee3996ef7c commented Feb 13, 2023

vyzo commented Feb 13, 2023

jennijuju commented Feb 13, 2023

ZenGround0 commented Feb 13, 2023

vyzo commented Feb 13, 2023

ZenGround0 commented Feb 13, 2023

ZenGround0 commented Feb 13, 2023

ZenGround0 commented Feb 13, 2023

stuberman commented Feb 14, 2023

SBudo commented Feb 14, 2023 • edited Loading

NodeKing commented Mar 1, 2023

stuberman commented Mar 1, 2023

Shekelme commented Mar 2, 2023

stuberman commented Mar 3, 2023

YuXiaoCoder commented Mar 9, 2023

froid1911 commented Apr 22, 2023

RobQuistNL commented Apr 22, 2023

NodeKing commented Apr 23, 2023

clinta commented May 23, 2023

TippyFlitsUK commented May 23, 2023

NodeKing commented May 23, 2023 • edited Loading

Shekelme commented May 24, 2023

TippyFlitsUK commented May 24, 2023

stuberman commented Jul 26, 2023

vyzo commented Dec 14, 2022 •

edited

Loading

NodeKing commented Jan 10, 2023 •

edited

Loading

SBudo commented Feb 14, 2023 •

edited

Loading

NodeKing commented May 23, 2023 •

edited

Loading