-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Lotus splitstore not discarding - chain keeps growing #9840
Comments
Chain log from last hour: |
Chainlog (node) last 5 hours |
|
try setting HotStoreFullGCFrequency to something like 1, so as to force moving gc on badger. It happens once a week by default. You can set it to 3 if you want it to occur daily. |
I'm experiencing the same issues. The splitstore seems to keep growing. I have implemented the HotStoreFullGCFrequency=1 configuration, but it hasn't helped: current size: docker run command in use: |
I also have a second server running with the following config (same issue) lotus/datastore Using essentially the same config. But has pruning enabled, and missing the HOTSTOREFULLGCFREQUENCY=1 config P2P_ANNOUNCE_IP=$(wget -qO- ifconfig.me/ip) I have been experimenting with different configurations for weeks without success.. Any guidance would be greatly appreciated. |
Did you delete the chain folder contents and clean the splitstore folder before importing the lightweight snapshot @NodeKing? |
I did not see this helping this issue, downloading a new minimal snapshot, by deleting the cold store (/chain) and hot store ( ./lotus/lotus-shed splitstore clear --repo=/.lotus) and setting HOTSTOREFULLGCFREQUENCY=1. What seems to work is downloading a minimal snapshot and then before importing, renaming .lotus/datastore to .lotus/datastore.old. The import process would then rebuild the entire directory possibly eliminating any corrupt or old entries. |
|
However, it also removes the metadata, client and staging data. BTW - It is now fully obvious to me that lotus client retrievals (legacy deals from Evergreen) cause all sorts of chain issues. That process uses GraphSync and causes the chain to lose sync. |
Those folders have no impact on the splitstore size.
Very interesting!! Thanks Stu, I will add that to our current monitoring. 🙏 |
I created a new server when starting to use the HOTSTOREFULLGCFREQUENCY=1 setting @TippyFlitsUK . I first run the container with the config from my first post, but with this final line instead to import from snapshot: |
I am still seeing the chain grow even with
|
BTW - I continue to see the chain sync get stuck and fall behind several times a day. I restart the service every 12 hours to keep the chain sync'ed. I hear this is a common problem. |
|
I was forced to download a new snapshot again two days ago as the size of the hot store was at 1 TiB
|
I'm creating a new server too, as the server is now using 1.5TB of space. |
Have you tried adjusting the splistore settings at all @NodeKing ? |
Have tried variants of the below env var's, without any success. Can you suggest a config to try @TippyFlitsUK? LOTUS_CHAINSTORE_ENABLESPLITSTORE=true |
I had a miner that cleaned up after itself for a while, but now just keeps growing and growing. The discard process doesn't seem to happen. How does that part work? Once stuff is moved to coldstore, when does it get deleted? (It seems that "prune" means something different in this instance?) |
I'm under the impress that the pruning never occurs because the cold store is meant to be discarding, so the autoPrune env var is probably redundant config. Would be good to get some clarity on this. lotus-shed splitstore info |
What version of lotus are you running @NodeKing? |
Still using the same version as the docker run commands above @TippyFlitsUK : filecoin/lotus:v1.18.2 |
Thanks @NodeKing! Would you be able to upgrade to v1.19.0 with a fresh snapshot and fully clean
Please note that the AutoPrune feature has been retired in Many thanks!!
|
I have 2 miners on splitstore with discard now. Both 1.19.0. My test miner (mainnet) did discard once, going from 500ish to 350ish, but after that kept growing until it was over 750GB, then it fell out of sync completely. My production miner on splitstore (still sealing) grew to about 650GB, but did go down to 400GB overnight. The process seems to be hit or miss, and troubleshooting is like peering into a mysterious black box. |
Are there any splitstore logs? |
Its right there -- warm up is erroring with a missing ref; cc @ZenGround0 |
my daemon ran into similar issue. Change the hotstore GC frequency to 20, then to 10, now is running at 1. Did not help. The hotstore size is now at 1.7TB out of 1.9 TB drive. I need to delete the datastore folder and re-import the snapshot to clear the disk space. |
Some more discussions about this issue; https://filecoinproject.slack.com/archives/CPFTWMY7N/p1673472134663959 |
same issues across 10 nodes :/ can we bump this to a higher priority somehow? |
I think we should -- pinging @jennijuju @ZenGround0 The issue is that compaction fails with some unreachable object. Root cause may be changes in what is reachable with the new fvm stuffs or some other change that makes the compactor try to traverse something unreachable. |
Haha great timing! https://filecoinproject.slack.com/archives/CP50PPW2X/p1676297112125189 |
@vyzo Where? |
One of the top logs... |
Do you mean this one?
from stuberman's original comment? I thought so at first too but its just wrapping a splitstore closure, this is expected. I saw a proper COMPACTION ERROR like this when digging through f08399's logs too and they were all splitstore closures. Let me know if it was a different one, if you can point me to an unexpected compaction failure it would be obviously be very helpful. |
High level possible reasons state grows without bound in discard mode. Somewhere somehow some expected GC work is not getting done.
Whatever the root cause it also seems somewhat non deterministic because some configurations can get it to work. |
My next step is to modify debug logs to closely track data movement during compaction and badger GC. This should help us classify the scenario into one of the four options above. Once I've got a decent set of logs merged somewhere I'll request impacted operators to run this patch and we can consolidate measurements here. In the meantime if anyone has logs with evidence that compaction is not running as often as expected or that compaction is failing unexpectedly please post them here. I can't rule these options out yet but also haven't found evidence of either behaviors from the provided logs in thread. |
Daemon log can be found here |
Having the same issue, but only on one of our daemons. My logs at time, filled up with those messages until the disk ran out of space:
|
Has this issue be addressed in the latest release 1.20.0 ? |
Not yet. It is prevalent with those running lotus daemon with systemd. |
(( |
@ZenGround0 one thing we noted when using systemd is that the default profile for lotus is not suitable Lotus default
This improves splitstore compaction:
|
I have the same problem, my configuration is as follows [Chainstore]
EnableSplitstore = true
[Chainstore.Splitstore]
ColdStoreType = "discard"
HotStoreType = "badger"
MarkSetType = "badger"
HotStoreMessageRetention = 0
HotStoreFullGCFrequency = 3 The software version is 1.20.1 |
Have the same issue :((( |
We still have the issue and are rebuilding our servers every couple of weeks. |
Version 1.23.0 seems to have resolved this issue for us. |
Thank you for the update @clinta 🙏 |
v1.23.0 resolved our issue too. Our servers were pretty flakey on this version for some time, getting out of sync with the network. Servers needed frequent rebooting to get back in sync. Everything eventually stabilised on it's own and we ended up with massive disk space savings. Our nodes were getting up to 2TB of used space before we would rebuild them, now they're using ~280GB. |
Can't say the same as @NodeKing. But apparently the volume is not growing and the drives are not overflowing. |
Can you run a lotus chain prune hot-moving @Shekelme? My most recent run took me all the way down to 126 GiB. |
I have not had any problems with compaction since I changed the sync from systemd to a local task. |
Checklist
Latest release
, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
Lotus Version
Describe the Bug
Splitstore does not seem to discard old blocks (this is after I imported a new snapshot three days ago)
du -sch /.lotus/datastore/* | sort -rh
392G total
284G /.lotus/datastore/splitstore
109G /.lotus/datastore/chain
8.1M /.lotus/datastore/metadata
20K /.lotus/datastore/staging
20K /.lotus/datastore/client
I noticed that the lotus daemon will also quit randomly - see logs below
Logging Information
Repo Steps
...
The text was updated successfully, but these errors were encountered: