Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Lotus splitstore not discarding - chain keeps growing #9840

Closed
8 of 18 tasks
stuberman opened this issue Dec 12, 2022 · 64 comments
Closed
8 of 18 tasks

Lotus splitstore not discarding - chain keeps growing #9840

stuberman opened this issue Dec 12, 2022 · 64 comments
Labels
kind/bug Kind: Bug need/analysis Hint: Needs Analysis splitstore

Comments

@stuberman
Copy link

Checklist

  • This is not a security-related bug/issue. If it is, please follow please follow the security policy.
  • This is not a question or a support request. If you have any lotus related questions, please ask in the lotus forum.
  • This is not a new feature request. If it is, please file a feature request instead.
  • This is not an enhancement request. If it is, please file a improvement suggestion instead.
  • I have searched on the issue tracker and the lotus forum, and there is no existing related issue or discussion.
  • I am running the Latest release, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.
  • I did not make any code changes to lotus.

Lotus component

  • lotus daemon - chain sync
  • lotus miner - mining and block production
  • lotus miner/worker - sealing
  • lotus miner - proving(WindowPoSt)
  • lotus miner/market - storage deal
  • lotus miner/market - retrieval deal
  • lotus miner/market - data transfer
  • lotus client
  • lotus JSON-RPC API
  • lotus message management (mpool)
  • Other

Lotus Version

Daemon:  1.19.0-rc2+mainnet+git.2520b1644+api1.5.0
Local: lotus-miner version 1.19.0-rc2+mainnet+git.2520b1644

[Chainstore]
  # type: bool
  # env var: LOTUS_CHAINSTORE_ENABLESPLITSTORE
  EnableSplitstore = true

 [Chainstore.Splitstore]
   # ColdStoreType specifies the type of the coldstore.
   # It can be "messages" (default) to store only messages, "universal" to store all chain state or "discard" for discarding cold blocks.
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE
   ColdStoreType = "discard"
   # EnableColdStoreAutoPrune = false

   # HotStoreType specifies the type of the hotstore.
   # Only currently supported value is "badger".
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTORETYPE
   #HotStoreType = "badger"

   # MarkSetType specifies the type of the markset.
   # It can be "map" for in memory marking or "badger" (default) for on-disk marking.
   #
   # type: string
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_MARKSETTYPE
   #MarkSetType = "badger"

   # HotStoreMessageRetention specifies the retention policy for messages, in finalities beyond
   # the compaction boundary; default is 0.
   #
   # type: uint64
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREMESSAGERETENTION
   #HotStoreMessageRetention = 0

   # HotStoreFullGCFrequency specifies how often to perform a full (moving) GC on the hotstore.
   # A value of 0 disables, while a value 1 will do full GC in every compaction.
   # Default is 20 (about once a week).
   #
   # type: uint64
   # env var: LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY
   #HotStoreFullGCFrequency = 20

Describe the Bug

Splitstore does not seem to discard old blocks (this is after I imported a new snapshot three days ago)

du -sch /.lotus/datastore/* | sort -rh
392G total
284G /.lotus/datastore/splitstore
109G /.lotus/datastore/chain
8.1M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

I noticed that the lotus daemon will also quit randomly - see logs below

Logging Information

2022-11-24T04:49:42.692 INFO filcrypto::util::types > fvm_machine_execute_message: end
2022-11-24T04:49:42.794 INFO filcrypto::util::types > fvm_machine_flush: start
2022-11-24T04:49:43.008 INFO filcrypto::util::types > fvm_machine_flush: end
2022-11-24T04:49:56.902Z	WARN	builder	node/shutdown.go:33	received shutdown
2022-11-24T04:49:56.902Z	WARN	builder	node/shutdown.go:36	Shutting down...
2022-11-24T04:49:56.903Z	INFO	builder	node/shutdown.go:44	rpc server shut down successfully 
2022-11-24T04:49:56.903Z	INFO	dt-impl	impl/impl.go:170	stop data-transfer module
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:61	listenHeadChanges quit
2022-11-24T04:49:56.903Z	WARN	events	events/observer.go:66	not restarting listenHeadChanges: context error: context canceled
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:438	error from message subscription: context canceled
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:440	quitting HandleIncomingMessages loop
2022-11-24T04:49:56.904Z	WARN	sub	sub/incoming.go:55	quitting HandleIncomingBlocks loop
2022-11-24T04:49:56.904Z	WARN	peermgr	peermgr/peermgr.go:154	closing peermgr done
2022-11-24T04:49:56.904Z	WARN	peermgr	peermgr/peermgr.go:172	exiting peermgr run
2022-11-24T04:49:56.904Z	WARN	chainstore	store/store.go:616	reorgWorker quit
2022-11-24T04:49:56.904Z	WARN	splitstore	splitstore/splitstore.go:779	close with ongoing compaction in progress; waiting for it to finish...
2022-11-24T04:49:56.908Z	INFO	pubsub	go-libp2p-pubsub@v0.8.0/pubsub.go:640	pubsub processloop shutting down
2022-11-24T04:49:57.891Z	ERROR	splitstore	splitstore/splitstore_warmup.go:40	error warming up hotstore: error walking block (cid: bafy2bzacea4aeq6v2gdvre2txt3mmc5hsnz2j4ksrhheovmvn3uxbapuxx4cw): error walking state root (cid: bafy2bzacedi5adkbupv3cu5ichsa776jvgochor4or3k6n7cxarm4dnbza2qs): error walking link (cid: bafy2bzacectp2gnv2e5zzmlbhcxryggvpt3okboxfv3kacguz4ezthi4m4lje): error walking link (cid: bafy2bzacecl6tsngoa47mpboat65k6v6hwu7yagclr5bhxfsgoetpdtxzatwg): error walking link (cid: bafy2bzacedvrnrtom4fadhpoyvqmmoac5bxvzucorzwckr3qhtf5wlqs6gdzq): error walking link (cid: bafy2bzacecqx5xs4uk232v2xe7euwbj5s47zo4z64t75yfowjgjiz54bu42ya): error walking link (cid: bafy2bzacecs4vfmtttelwagv7stdw777mkrcgyatqv253uthilay3hst6qv5w): error walking link (cid: bafy2bzaceasb4hymydff2eej7jr4vsfiqedpcyajzotanrgjb3np3wqymqd46): error walking link (cid: bafy2bzacecmhi4kfcbmo2efxp677yifh7yz2pnhski7pxf42xunt3dqs4qoy2): error walking link (cid: bafy2bzacecouc4r5sc7pc2hmg67g3oe6ne64t7nhtdmzalevbrfzxw3d6bwuk): error walking link (cid: bafy2bzacebmdcqvc4plfhi5ufnqr4npov42tt5gnwd45bnghssq43yshknf4u): error walking link (cid: bafy2bzaceb2cs5w5z75qqa7bppnfh53b5xkjd2helbifwmavvhmsbkgyu3jsi): splitstore is closing
2022-11-24T04:49:58.230Z	INFO	badgerbs	v2@v2.2007.3/db.go:1031	Storing value log head: {Fid:66 Len:34 Offset:1041111137}

2022-11-24T04:49:58.438Z	INFO	badgerbs	v2@v2.2007.3/levels.go:1000	[Compactor: 173] Running compaction: {level:0 score:1.73 dropPrefixes:[]} for level: 0

2022-11-24T04:49:59.778Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:49:59.898Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.048Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.171Z	INFO	engine	decision/engine.go:657	aborting message processingshutting down
2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/levels.go:962	LOG Compact 0->1, del 7 tables, add 6 tables, took 1.766679993s

2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/levels.go:1010	[Compactor: 173] Compaction for level: 0 DONE
2022-11-24T04:50:00.205Z	INFO	badgerbs	v2@v2.2007.3/db.go:554	Force compaction on level 0 done
2022-11-24T04:50:00.378Z	INFO	builder	node/shutdown.go:44	node shut down successfully 
2022-11-24T04:50:00.378Z	WARN	builder	node/shutdown.go:47	Graceful shutdown successful

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...'
    ...
@stuberman
Copy link
Author

Chain log from last hour:

chain.txt

@stuberman
Copy link
Author

Chainlog (node) last 5 hours

chain2.txt

@stuberman
Copy link
Author

du -sch /.lotus/datastore/* | sort -rh
544G total
435G /.lotus/datastore/splitstore
109G /.lotus/datastore/chain
9.5M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

@vyzo
Copy link
Contributor

vyzo commented Dec 14, 2022

try setting HotStoreFullGCFrequency to something like 1, so as to force moving gc on badger.

It happens once a week by default. You can set it to 3 if you want it to occur daily.

@NodeKing
Copy link

I'm experiencing the same issues. The splitstore seems to keep growing. I have implemented the HotStoreFullGCFrequency=1 configuration, but it hasn't helped:

current size:
lotus/datastore
476.5 GiB /splitstore
110.2 GiB /chain

docker run command in use:
P2P_ANNOUNCE_IP=$(wget -qO- ifconfig.me/ip)
docker run -d --name lotus
--user 532:532
--network host
-e LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/8545/http
-e LOTUS_LIBP2P_LISTENADDRESSES=/ip4/0.0.0.0/tcp/6665
-e LOTUS_LIBP2P_ANNOUNCEADDRESSES=/ip4/$P2P_ANNOUNCE_IP/tcp/6665
-e LOTUS_LIBP2P_DISABLENATPORTMAP=true
-e LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
-e LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
-e LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=false
-e LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=1
-v /blockchain/lotus:/var/lib/lotus
-v /blockchain/lotus-tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters
filecoin/lotus:v1.18.2 daemon

@NodeKing
Copy link

I also have a second server running with the following config (same issue)

lotus/datastore
593.4 GiB /splitstore
104.7 GiB /chain

Using essentially the same config. But has pruning enabled, and missing the HOTSTOREFULLGCFREQUENCY=1 config

P2P_ANNOUNCE_IP=$(wget -qO- ifconfig.me/ip)
docker run -d --name lotus
--user 532:532
--network host
-e LOTUS_API_LISTENADDRESS=/ip4/0.0.0.0/tcp/8545/http
-e LOTUS_LIBP2P_LISTENADDRESSES=/ip4/0.0.0.0/tcp/6665
-e LOTUS_LIBP2P_ANNOUNCEADDRESSES=/ip4/$P2P_ANNOUNCE_IP/tcp/6665
-e LOTUS_LIBP2P_DISABLENATPORTMAP=true
-e LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
-e LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
-e LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=true
-v /blockchain/lotus:/var/lib/lotus
-v /blockchain/lotus-tmp/filecoin-proof-parameters:/var/tmp/filecoin-proof-parameters
filecoin/lotus:v1.18.2 daemon

I have been experimenting with different configurations for weeks without success.. Any guidance would be greatly appreciated.

@TippyFlitsUK
Copy link
Contributor

Did you delete the chain folder contents and clean the splitstore folder before importing the lightweight snapshot @NodeKing?

@stuberman
Copy link
Author

I did not see this helping this issue, downloading a new minimal snapshot, by deleting the cold store (/chain) and hot store ( ./lotus/lotus-shed splitstore clear --repo=/.lotus) and setting HOTSTOREFULLGCFREQUENCY=1.

What seems to work is downloading a minimal snapshot and then before importing, renaming .lotus/datastore to .lotus/datastore.old. The import process would then rebuild the entire directory possibly eliminating any corrupt or old entries.

@TippyFlitsUK
Copy link
Contributor

./lotus/lotus-shed splitstore clear is essentially doing exactly the same thing, Stu. HOTSTOREFULLGCFREQUENCY can also be set to a lower number if hotstore space is limited.

@stuberman
Copy link
Author

./lotus/lotus-shed splitstore clear is essentially doing exactly the same thing, Stu. HOTSTOREFULLGCFREQUENCY can also be set to a lower number if hotstore space is limited.

However, it also removes the metadata, client and staging data.

BTW - It is now fully obvious to me that lotus client retrievals (legacy deals from Evergreen) cause all sorts of chain issues. That process uses GraphSync and causes the chain to lose sync.

@TippyFlitsUK
Copy link
Contributor

However, it also removes the metadata, client and staging data.

Those folders have no impact on the splitstore size. ./lotus/lotus-shed splitstore clear is the recommended method of clearing the splitstore for a re-import.

BTW - It is now fully obvious to me that lotus client retrievals (legacy deals from Evergreen) cause all sorts of chain issues. That process uses GraphSync and causes the chain to lose sync.

Very interesting!! Thanks Stu, I will add that to our current monitoring. 🙏

@NodeKing
Copy link

I created a new server when starting to use the HOTSTOREFULLGCFREQUENCY=1 setting @TippyFlitsUK .

I first run the container with the config from my first post, but with this final line instead to import from snapshot:
filecoin/lotus:v1.18.2 daemon --import-snapshot https://snapshots.mainnet.filops.net/minimal/latest --halt-after-import true

@stuberman
Copy link
Author

I am still seeing the chain grow even with HotStoreFullGCFrequency = 1

du -sch /.lotus/datastore/* | sort -rh

715G total
604G /.lotus/datastore/splitstore
112G /.lotus/datastore/chain
9.5M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

@stuberman
Copy link
Author

BTW - I continue to see the chain sync get stuck and fall behind several times a day. I restart the service every 12 hours to keep the chain sync'ed. I hear this is a common problem.

@stuberman
Copy link
Author

./lotus-shed splitstore info

warmup epoch: 2.43576e+06
base epoch: 2.43576e+06
compacting: true
compactions: 0
hotstore size: 8.02180837946e+11
prunes: 0

~/lotus$ du -sch /.lotus/datastore/* | sort -rh

862G total
751G /.lotus/datastore/splitstore
112G /.lotus/datastore/chain
12M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

@stuberman
Copy link
Author

I was forced to download a new snapshot again two days ago as the size of the hot store was at 1 TiB
HotStoreFullGCFrequency = 1

./lotus-shed splitstore info

base epoch: 2.46768e+06
compacting: true
compactions: 0
hotstore size: 4.30981607577e+11
prunes: 0
warmup epoch: 0

du -sch /.lotus/datastore/* | sort -rh

541G total
426G /.lotus/datastore/splitstore
115G /.lotus/datastore/chain
7.1M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client

@Reiers Reiers added need/analysis Hint: Needs Analysis splitstore and removed need/triage labels Jan 3, 2023
@NodeKing
Copy link

NodeKing commented Jan 4, 2023

I'm creating a new server too, as the server is now using 1.5TB of space.

@TippyFlitsUK
Copy link
Contributor

Have you tried adjusting the splistore settings at all @NodeKing ?

@NodeKing
Copy link

NodeKing commented Jan 10, 2023

Have tried variants of the below env var's, without any success. Can you suggest a config to try @TippyFlitsUK?

LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
LOTUS_CHAINSTORE_SPLITSTORE_ENABLECOLDSTOREAUTOPRUNE=true
LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=1

@shawnp0wers
Copy link

I had a miner that cleaned up after itself for a while, but now just keeps growing and growing. The discard process doesn't seem to happen. How does that part work? Once stuff is moved to coldstore, when does it get deleted? (It seems that "prune" means something different in this instance?)

@NodeKing
Copy link

I'm under the impress that the pruning never occurs because the cold store is meant to be discarding, so the autoPrune env var is probably redundant config. Would be good to get some clarity on this.

lotus-shed splitstore info
prunes: 0
warmup epoch: 0
base epoch: 2.492838e+06
compacting: true
compactions: 4
hotstore size: 1.75710129714e+11

@TippyFlitsUK
Copy link
Contributor

What version of lotus are you running @NodeKing?

@NodeKing
Copy link

Still using the same version as the docker run commands above @TippyFlitsUK : filecoin/lotus:v1.18.2

@TippyFlitsUK
Copy link
Contributor

Thanks @NodeKing!

Would you be able to upgrade to v1.19.0 with a fresh snapshot and fully clean chain and datastore folders?

1.19.0 includes SplitStore updates that may help with the issues that you are seeing. I have been using this version for weeks with the settings below and am not experiencing any issues at all.

Please note that the AutoPrune feature has been retired in 1.19.0, you can see all the updated docs here.

Many thanks!!

LOTUS_CHAINSTORE_ENABLESPLITSTORE=true
LOTUS_CHAINSTORE_SPLITSTORE_COLDSTORETYPE=discard
LOTUS_CHAINSTORE_SPLITSTORE_HOTSTOREFULLGCFREQUENCY=10

@shawnp0wers
Copy link

I have 2 miners on splitstore with discard now. Both 1.19.0.

My test miner (mainnet) did discard once, going from 500ish to 350ish, but after that kept growing until it was over 750GB, then it fell out of sync completely.

My production miner on splitstore (still sealing) grew to about 650GB, but did go down to 400GB overnight. The process seems to be hit or miss, and troubleshooting is like peering into a mysterious black box.

@vyzo
Copy link
Contributor

vyzo commented Jan 25, 2023

Are there any splitstore logs?
It sounds to me that compaction is not running to completion.

@vyzo
Copy link
Contributor

vyzo commented Jan 25, 2023

Its right there -- warm up is erroring with a missing ref; cc @ZenGround0

@William8Work
Copy link

my daemon ran into similar issue. Change the hotstore GC frequency to 20, then to 10, now is running at 1. Did not help. The hotstore size is now at 1.7TB out of 1.9 TB drive. I need to delete the datastore folder and re-import the snapshot to clear the disk space.
daemon-f08399-2023.zip

@RobQuistNL
Copy link
Contributor

Some more discussions about this issue; https://filecoinproject.slack.com/archives/CPFTWMY7N/p1673472134663959

@dd45e640b42e6da7da96faee3996ef7c
Copy link
Contributor

same issues across 10 nodes :/

can we bump this to a higher priority somehow?

@vyzo
Copy link
Contributor

vyzo commented Feb 13, 2023

I think we should -- pinging @jennijuju @ZenGround0

The issue is that compaction fails with some unreachable object. Root cause may be changes in what is reachable with the new fvm stuffs or some other change that makes the compactor try to traverse something unreachable.

@jennijuju
Copy link
Member

I think we should -- pinging @jennijuju @ZenGround0

The issue is that compaction fails with some unreachable object. Root cause may be changes in what is reachable with the new fvm stuffs or some other change that makes the compactor try to traverse something unreachable.

Haha great timing! https://filecoinproject.slack.com/archives/CP50PPW2X/p1676297112125189

@ZenGround0
Copy link
Contributor

Its right there -- warm up is erroring with a missing ref

@vyzo Where?

@vyzo
Copy link
Contributor

vyzo commented Feb 13, 2023

One of the top logs...

@ZenGround0
Copy link
Contributor

Do you mean this one?

2022-11-24T04:49:57.891Z	ERROR	splitstore	splitstore/splitstore_warmup.go:40	error warming up hotstore: error walking block (cid: bafy2bzacea4aeq6v2gdvre2txt3mmc5hsnz2j4ksrhheovmvn3uxbapuxx4cw): error walking state root (cid: bafy2bzacedi5adkbupv3cu5ichsa776jvgochor4or3k6n7cxarm4dnbza2qs): error walking link (cid: bafy2bzacectp2gnv2e5zzmlbhcxryggvpt3okboxfv3kacguz4ezthi4m4lje): error walking link (cid: bafy2bzacecl6tsngoa47mpboat65k6v6hwu7yagclr5bhxfsgoetpdtxzatwg): error walking link (cid: bafy2bzacedvrnrtom4fadhpoyvqmmoac5bxvzucorzwckr3qhtf5wlqs6gdzq): error walking link (cid: bafy2bzacecqx5xs4uk232v2xe7euwbj5s47zo4z64t75yfowjgjiz54bu42ya): error walking link (cid: bafy2bzacecs4vfmtttelwagv7stdw777mkrcgyatqv253uthilay3hst6qv5w): error walking link (cid: bafy2bzaceasb4hymydff2eej7jr4vsfiqedpcyajzotanrgjb3np3wqymqd46): error walking link (cid: bafy2bzacecmhi4kfcbmo2efxp677yifh7yz2pnhski7pxf42xunt3dqs4qoy2): error walking link (cid: bafy2bzacecouc4r5sc7pc2hmg67g3oe6ne64t7nhtdmzalevbrfzxw3d6bwuk): error walking link (cid: bafy2bzacebmdcqvc4plfhi5ufnqr4npov42tt5gnwd45bnghssq43yshknf4u): error walking link (cid: bafy2bzaceb2cs5w5z75qqa7bppnfh53b5xkjd2helbifwmavvhmsbkgyu3jsi): splitstore is closing

from stuberman's original comment? I thought so at first too but its just wrapping a splitstore closure, this is expected. I saw a proper COMPACTION ERROR like this when digging through f08399's logs too and they were all splitstore closures.

Let me know if it was a different one, if you can point me to an unexpected compaction failure it would be obviously be very helpful.

@ZenGround0
Copy link
Contributor

High level possible reasons state grows without bound in discard mode. Somewhere somehow some expected GC work is not getting done.

  1. Compaction is erroring. We will see COMPACTION ERROR in the logs (note we'll also see this during lotus shutdown when compacting which is not an error)
  2. Compaction isn't getting called.
  3. Compaction is getting called and its skipping out on work leaving dead state hanging
  4. cid level compact is working great but badger level GC leaves dead state hanging even after forcing badger compaction

Whatever the root cause it also seems somewhat non deterministic because some configurations can get it to work.

@ZenGround0
Copy link
Contributor

My next step is to modify debug logs to closely track data movement during compaction and badger GC. This should help us classify the scenario into one of the four options above. Once I've got a decent set of logs merged somewhere I'll request impacted operators to run this patch and we can consolidate measurements here.

In the meantime if anyone has logs with evidence that compaction is not running as often as expected or that compaction is failing unexpectedly please post them here. I can't rule these options out yet but also haven't found evidence of either behaviors from the provided logs in thread.

@stuberman
Copy link
Author

Daemon log can be found here

@SBudo
Copy link

SBudo commented Feb 14, 2023

Having the same issue, but only on one of our daemons.
All the other daemons, which are identical (hardware, software) are fine.
Tried to clean up the chain folder, as well as the splitstore and re-import a fresh snapshot multiple times, but it keeps on filling up the disk and eventually runs out

My logs at time, filled up with those messages until the disk ran out of space:

2023-02-13T10:06:42.416+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.622917ms.

2023-02-13T10:06:42.423+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:312      doCopy Time elapsed: 24m02s, bytes sent: 126 GB, speed: 87 MB/sec

2023-02-13T10:06:42.484+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.483893ms.

2023-02-13T10:06:42.516+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.40986ms.

2023-02-13T10:06:42.644+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 14.66298ms.

2023-02-13T10:06:42.677+1100    INFO    badgerbs        v2@v2.2007.3/db.go:1031 Storing value log head: {Fid:118 Len:33 Offset:138878067}

2023-02-13T10:06:42.686+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 19.17112ms.

2023-02-13T10:06:42.702+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.53692ms.

2023-02-13T10:06:42.746+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.257801ms.

2023-02-13T10:06:42.763+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.67925ms.

2023-02-13T10:06:42.819+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.2709ms.

2023-02-13T10:06:42.841+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 10.089043ms.

2023-02-13T10:06:42.872+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.389161ms.

2023-02-13T10:06:42.905+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 13.105536ms.

2023-02-13T10:06:42.957+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 49.271054ms.

2023-02-13T10:06:43.030+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 17.803969ms.

2023-02-13T10:06:43.101+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 16.758906ms.

2023-02-13T10:06:43.128+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 16.381419ms.

2023-02-13T10:06:43.206+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 13 MB in 54.552666ms.

2023-02-13T10:06:43.324+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.052173ms.

2023-02-13T10:06:43.340+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 12.98394ms.

2023-02-13T10:06:43.394+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 4.2 MB in 11.929277ms.

2023-02-13T10:06:43.441+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:269      doCopy Created batch of size: 13 MB in 35.111055ms.

2023-02-13T10:06:43.441+1100    INFO    badgerbs        v2@v2.2007.3/stream.go:312      doCopy Time elapsed: 24m03s, bytes sent: 126 GB, speed: 87 MB/sec

@NodeKing
Copy link

NodeKing commented Mar 1, 2023

Has this issue be addressed in the latest release 1.20.0 ?

@stuberman
Copy link
Author

Has this issue be addressed in the latest release 1.20.0 ?

Not yet. It is prevalent with those running lotus daemon with systemd.

@Shekelme
Copy link

Shekelme commented Mar 2, 2023

rabinovitch@lotus:~$ du -h -s /home/rabinovitch/.lotus/datastore/splitstore && du -h -s /home/rabinovitch/.lotus/datastore/chain
1,5T /home/rabinovitch/.lotus/datastore/splitstore
217G /home/rabinovitch/.lotus/datastore/chain

((

@stuberman
Copy link
Author

@ZenGround0 one thing we noted when using systemd is that the default profile for lotus is not suitable

Lotus default

MemoryAccounting=true
MemoryHigh=8G
MemoryMax=10G
LimitNOFILE=8192:10240

This improves splitstore compaction:

MemoryAccounting=false
MemoryHigh=16G
MemoryMax=infinity
LimitNOFILE=1024000:1024000

cat lotus/scripts/lotus-daemon.service

[Unit]
Description=Lotus Daemon
After=network-online.target
Requires=network-online.target

[Service]
Environment=GOLOG_FILE="/var/log/lotus/daemon.log"
Environment=GOLOG_LOG_FMT="json"
ExecStart=/usr/local/bin/lotus daemon
Restart=always
RestartSec=10

MemoryAccounting=true
MemoryHigh=8G
MemoryMax=10G
LimitNOFILE=8192:10240

[Install]
WantedBy=multi-user.target

@YuXiaoCoder
Copy link

I have the same problem, my configuration is as follows

[Chainstore]
  EnableSplitstore = true
  [Chainstore.Splitstore]
    ColdStoreType = "discard"
    HotStoreType = "badger"
    MarkSetType = "badger"
    HotStoreMessageRetention = 0
    HotStoreFullGCFrequency = 3

The software version is 1.20.1

@froid1911
Copy link

Have the same issue :(((

@RobQuistNL
Copy link
Contributor

Related: #10712 #10711 #10710

@NodeKing
Copy link

We still have the issue and are rebuilding our servers every couple of weeks.

@clinta
Copy link
Contributor

clinta commented May 23, 2023

Version 1.23.0 seems to have resolved this issue for us.

@TippyFlitsUK
Copy link
Contributor

Thank you for the update @clinta 🙏

@NodeKing
Copy link

NodeKing commented May 23, 2023

v1.23.0 resolved our issue too. Our servers were pretty flakey on this version for some time, getting out of sync with the network. Servers needed frequent rebooting to get back in sync. Everything eventually stabilised on it's own and we ended up with massive disk space savings.

Our nodes were getting up to 2TB of used space before we would rebuild them, now they're using ~280GB.

@Shekelme
Copy link

image
image
image

Can't say the same as @NodeKing. But apparently the volume is not growing and the drives are not overflowing.

@TippyFlitsUK
Copy link
Contributor

Can you run a lotus chain prune hot-moving @Shekelme? My most recent run took me all the way down to 126 GiB.

@stuberman
Copy link
Author

I have not had any problems with compaction since I changed the sync from systemd to a local task.
My chain has been running for months now without a need to reimport a minimal snapshot.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Kind: Bug need/analysis Hint: Needs Analysis splitstore
Projects
None yet
Development

No branches or pull requests