Ability to truncate raft.db on nomad server nodes #4477

tantra35 · 2018-07-06T00:32:43Z

Nomad version

0.8.4

After some launch of huge jobs we have very big raft,db files on server nodes, so restart of nomad server takes huge amount of time, and nomad server eats too many memory. When we replace one server with this huge raft.db with new one, on that node raft.db not so big, and nomad eats expected amount of memory. So it will be very usfull to manually truncate raft.db

I found some mention about the same problem in consul issue tracker:

hashicorp/consul#866

But not clearly to my is nomad have the same behavior

The text was updated successfully, but these errors were encountered:

jippi · 2018-07-06T07:15:49Z

huge jobs
huge amount of
too many memory
eats expected amount of memory

can you please put some actual numbers to these definition ? :)

tantra35 · 2018-07-06T08:43:23Z

@jippi In our case nomad servers was on t2.micro instances. And we launch spark jobs with 2000 executors(ie 2000 allocations). After cluster hung, due insufficient hardware we upgrade servers to t2.xlarge, but fo now raft.db take 30GB and nomad uses 65% of memory, but for now on that cluster at now days launched only 10 jobs with 1-3 allocations within job, and restart of nomad server doesn't help to reduce memory consuption

jippi · 2018-07-06T10:15:28Z

@tantra35 that seem quite underpowered, see the recommended production requirements - especially with the bursty capacity of the t2 family - personally we use m4.xlarge and m4.2xlarge in prod

tantra35 · 2018-07-06T11:48:43Z

@jippi I understand this(and we already made new more power cluster), but it seems looks strange that after we dramatically reduced allocs count(ie stop all spark jobs), nomad doesn't return occupied memory, and doesn't make this even after restart server, its not problem to replace all server nodes(in that case as i mention above, new server takes expected amount of memory), but sometimes this not so easy (fo example in static dedicated hardware)

jippi · 2018-07-07T19:59:37Z

did you try to force a gc?

tantra35 · 2018-07-08T11:10:40Z

@jippi this was first what we did

burdandrei · 2018-07-10T17:26:59Z

Adding my 2 cents:
0.8.4 couple of thousands of allocations we receive OOM on 16GB r4.large:

tantra35 · 2018-07-11T17:22:20Z

@burdandrei can you share how you solve this in your case?

preetapan · 2018-07-11T17:44:54Z

@tantra35 Until nomad takes a snapshot, the raft DB is not compacted. Currently the threshold of when we take a snapshot is not tunable, it does it after 8192 raft writes.

What's the output of nomad agent-info (https://www.nomadproject.io/docs/commands/agent-info.html) on your server? The specific values of interest are last_snapshot_index and last_log_index. Usually after that number reaches a multiple close to 8192, any old logs are deleted including the ones from jobs not running anymore.

We could make this value tunable but that's a tradeoff between disk I/O (since compacting the raft DB via snapshots is expensive) vs disk utilization. I would recommend increasing disk capacity on your servers.

burdandrei · 2018-07-11T17:57:16Z

well @tantra35 in the meantime i changed servers to be i3.xlarge to have 32 GB ram and NVME for the data dir. but to be honest raft.db file is hardly reaching 200MB.
I found the memory problem was because of our job scaler. It submitted the jobs too fast while the last deployment was still running, so we had a LOT of canceled deployments.
As soon as I added a condition scaler to avoid adjusting the group count if there is a running deployment memory leak stopped.

@preetapan It shows that deep inside nomad is not foolproof. =(
I managed to bring nomad servers to OOM 4 times a day before adding the fix.
P.S. 0.8.3 handled it without a problem, so probably 0.8.4 has some degradations.

tantra35 · 2018-07-13T22:26:29Z

@preetapan

here is our values:

raft
  applied_index = 523719
  commit_index = 523719
  fsm_pending = 0
  last_contact = 13.264716ms
  last_log_index = 523719
  last_log_term = 3180
  last_snapshot_index = 516962
  last_snapshot_term = 3180

due cluster now in static state(no new allocations add or removed) we have no any chance move the case from the dead point, but if we had the opportunity to send some signal to nomad for example USR1, or nomad can making some cleanup at start(that seems reasonable) , or static tool to maintain raft.db

tantra35 · 2018-07-30T15:12:00Z

@preetapan A few seconds ago on one of our cluster diff between last_snapshot_index and last_log_index is 8192, and snapshot action was performed, but raft.db still huge about 17GB(we expect that this file will be much smaller after snapshot, but it not truncated),

root@ip-172-25-2-11:/var/lib/nomad/server/raft# ls -lah
total 17G
drwxr-xr-x 3 root root 4.0K Jul 30 16:27 .
drwxr-xr-x 4 root root 4.0K Jul 27 17:36 ..
-rwxr-xr-x 1 root root  971 Jul  4 16:29 peers.info
-rw------- 1 root root  17G Jul 30 18:10 raft.db
drwxr-xr-x 4 root root 4.0K Jul 30 18:08 snapshots

but snapshot(state.bin) is about 130MB

root@ip-172-25-2-11:/var/lib/nomad/server/raft/snapshots/969-1786945-1532957246133# ls -lah
total 131M
drwxr-xr-x 2 root root 4.0K Jul 30 16:27 .
drwxr-xr-x 4 root root 4.0K Jul 30 18:08 ..
-rw-r--r-- 1 root root  454 Jul 30 16:27 meta.json
-rw-r--r-- 1 root root 131M Jul 30 16:27 state.bin

In logs we see follow

Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Starting snapshot up to 1794727
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] snapshot: Creating new snapshot at /var/lib/nomad/server/raft/snapshots/1278-1794727-1532962979564
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Compacting logs from 1786488 to 1784493
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Snapshot to 1794727 complete

Its very strange and looks like a bug, that from(1786488) log is greater than to(1784493), so as i understand nothing happens because nomаd uses default raft config, and in compactLogs function present follow code:

	// Truncate up to the end of the snapshot, or `TrailingLogs`
	// back from the head, which ever is further back. This ensures
	// at least `TrailingLogs` entries, but does not allow logs
	// after the snapshot to be removed.
	maxLog := min(snapIdx, lastLogIdx-r.conf.TrailingLogs)

in nomad case r.conf.TrailingLogs is 10240 so raft.db actually not truncated, also this not happens at second round:

Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] raft: Starting snapshot up to 1803290
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] snapshot: Creating new snapshot at /var/lib/nomad/server/raft/snapshots/1278-1803290-1532965026178
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] snapshot: reaping snapshot /var/lib/nomad/server/raft/snapshots/952-1786487-1532956909606
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] raft: Compacting logs from 1786488 to 1793117
Jul 30 18:37:07 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:07 [INFO] raft: Snapshot to 1803290 complete

I think in nomad r.conf.TrailingLogs must be equal to autopilot.MaxTrailingLogs, but not predefined 10240

After write little prorgam that emulates compactLogs functionality:

package main

import (
	"fmt"
	raftboltdb "github.com/hashicorp/raft-boltdb"
)

func main() {
	l_db, _ := raftboltdb.NewBoltStore("./raft.db")

	l_first,_ := l_db.FirstIndex()
	l_end,_ := l_db.LastIndex()

	fmt.Printf("[INFO] raft: Compacting logs from %d to %d\n", l_first, l_end)

	if l_err := l_db.DeleteRange(l_first, l_end); l_err != nil { 
		fmt.Printf("[ERR] log compaction failed: %v\n", l_err)
	} else {
		fmt.Printf("[INFO] log compaction ok\n")
	}

	l_db.Close()
}

We found that raft.db not truncated, and we make manual compaction throw boltdb cli compact (boltdb/bolt#460),

root@ip-172-25-2-22:/home/ruslan# ./bolt-cli compact -o /var/lib/nomad/server/raft/raft.db /var/lib/nomad/server/raft/raft.old.db
20370898944 -> 109387776 bytes (gain=186.23x)

after that nomad servers stop eat too many memory

preetapan added theme/raft stage/thinking labels Jul 11, 2018

tantra35 mentioned this issue Jul 31, 2018

Reset Queued allocs to zero when job stopped #4414

Merged

tantra35 mentioned this issue Aug 23, 2018

Nomad Server: fatal error: concurrent map read and map write #4607

Closed

tgross added stage/needs-discussion and removed stage/thinking labels Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to truncate raft.db on nomad server nodes #4477

Ability to truncate raft.db on nomad server nodes #4477

tantra35 commented Jul 6, 2018

jippi commented Jul 6, 2018

tantra35 commented Jul 6, 2018 •

edited

Loading

jippi commented Jul 6, 2018

tantra35 commented Jul 6, 2018

jippi commented Jul 7, 2018

tantra35 commented Jul 8, 2018

burdandrei commented Jul 10, 2018 •

edited

Loading

tantra35 commented Jul 11, 2018

preetapan commented Jul 11, 2018

burdandrei commented Jul 11, 2018

tantra35 commented Jul 13, 2018 •

edited

Loading

tantra35 commented Jul 30, 2018 •

edited

Loading

Ability to truncate raft.db on nomad server nodes #4477

Ability to truncate raft.db on nomad server nodes #4477

Comments

tantra35 commented Jul 6, 2018

Nomad version

jippi commented Jul 6, 2018

tantra35 commented Jul 6, 2018 • edited Loading

jippi commented Jul 6, 2018

tantra35 commented Jul 6, 2018

jippi commented Jul 7, 2018

tantra35 commented Jul 8, 2018

burdandrei commented Jul 10, 2018 • edited Loading

tantra35 commented Jul 11, 2018

preetapan commented Jul 11, 2018

burdandrei commented Jul 11, 2018

tantra35 commented Jul 13, 2018 • edited Loading

tantra35 commented Jul 30, 2018 • edited Loading

tantra35 commented Jul 6, 2018 •

edited

Loading

burdandrei commented Jul 10, 2018 •

edited

Loading

tantra35 commented Jul 13, 2018 •

edited

Loading

tantra35 commented Jul 30, 2018 •

edited

Loading