Growing CPU usage on server agents #687

andreygolev · 2020-02-25T19:03:02Z

Describe the bug
Hi,
We're trying Dkron in production workloads. And currently we have 30 jobs. Most of them are running each 1-3 seconds.
During normal operation, CPU usage on server agents is growing over time, let's say after 24-48h it's consuming more than 100% of CPU.

I've compiled Dkron with CPU profiling enabled and run it for some time in production workload and it looks that issue is at dkron/store.go listTxnFunc

Please take a look at the attached PDF with profiling info.
profile001.pdf

In order to mitigate and investigate growing CPU, I tried to create iterator in listTnxfunc with different options

		opts := badger.DefaultIteratorOptions
		opts.PrefetchSize = 0
		opts.PrefetchValues = false
		it := txn.NewIterator(opts)

And it looks that it helps a a bit. CPU usage now grows way slower, but still not stopping to grow over time. At around 15:00 it's visible that CPU usage became growing way slower than it was before, because of PrefetchSize changes. PrefetechSize = 5 and PrefetchValues = true have the same good effect on CPU usage.

I suspected that the CPU usage will stop increasing after MaxExecutions value is reached, then old executions will be cleaned up according to code. I changed default MaxExecutions value 100 to 5 and noticed that in Dkron WebUI execution amount really became not more than 5 and RAM usage became way smaller. Though CPU usage is still growing same way.

More info

Removing jobs and adding them helps with CPU usage. CPU usage drops, then again slowly begins to increase.
Restart of Dkron
Noticed in logs that Badger is being backed up from time to time. And the key amount it backs up is also growing.

time="2020-02-25T13:07:48Z" level=info msg="DB.Backup Sent 35855 keys\n" node=dkron-server-2
time="2020-02-25T13:24:32Z" level=info msg="DB.Backup Sent 46964 keys\n" node=dkron-server-2
time="2020-02-25T13:36:44Z" level=info msg="DB.Backup Sent 55259 keys\n" node=dkron-server-2
time="2020-02-25T13:48:49Z" level=info msg="DB.Backup Sent 63484 keys\n" node=dkron-server-2
time="2020-02-25T14:03:52Z" level=info msg="DB.Backup Sent 73724 keys\n" node=dkron-server-2
time="2020-02-25T14:17:05Z" level=info msg="DB.Backup Sent 82742 keys\n" node=dkron-server-2
time="2020-02-25T14:30:51Z" level=info msg="DB.Backup Sent 92166 keys\n" node=dkron-server-2
time="2020-02-25T14:43:15Z" level=info msg="DB.Backup Sent 100561 keys\n" node=dkron-server-2
time="2020-02-25T14:55:59Z" level=info msg="DB.Backup Sent 109236 keys\n" node=dkron-server-2
time="2020-02-25T15:10:26Z" level=info msg="DB.Backup Sent 119040 keys\n" node=dkron-server-2
time="2020-02-25T15:22:26Z" level=info msg="DB.Backup Sent 127211 keys\n" node=dkron-server-2
time="2020-02-25T15:34:52Z" level=info msg="DB.Backup Sent 135704 keys\n" node=dkron-server-2
time="2020-02-25T15:46:41Z" level=info msg="DB.Backup Sent 143739 keys\n" node=dkron-server-2
time="2020-02-25T16:00:53Z" level=info msg="DB.Backup Sent 153395 keys\n" node=dkron-server-2
time="2020-02-25T16:15:27Z" level=info msg="DB.Backup Sent 163276 keys\n" node=dkron-server-2
time="2020-02-25T16:28:55Z" level=info msg="DB.Backup Sent 172474 keys\n" node=dkron-server-2
time="2020-02-25T16:40:45Z" level=info msg="DB.Backup Sent 180520 keys\n" node=dkron-server-2
time="2020-02-25T16:54:33Z" level=info msg="DB.Backup Sent 189931 keys\n" node=dkron-server-2
time="2020-02-25T17:07:35Z" level=info msg="DB.Backup Sent 198811 keys\n" node=dkron-server-2
time="2020-02-25T17:21:57Z" level=info msg="DB.Backup Sent 208570 keys\n" node=dkron-server-2
time="2020-02-25T17:34:36Z" level=info msg="DB.Backup Sent 217170 keys\n" node=dkron-server-2
time="2020-02-25T17:46:52Z" level=info msg="DB.Backup Sent 225595 keys\n" node=dkron-server-2
time="2020-02-25T17:58:48Z" level=info msg="DB.Backup Sent 233728 keys\n" node=dkron-server-2
time="2020-02-25T18:12:18Z" level=info msg="DB.Backup Sent 242888 keys\n" node=dkron-server-2
time="2020-02-25T18:24:21Z" level=info msg="DB.Backup Sent 251110 keys\n" node=dkron-server-2
time="2020-02-25T18:37:31Z" level=info msg="DB.Backup Sent 260100 keys\n" node=dkron-server-2
time="2020-02-25T18:50:30Z" level=info msg="DB.Backup Sent 268918 keys\n" node=dkron-server-2

It feels like the issue is because Badger in reality is not cleaned up fully and some structures are growing forever, causing fullscan by execution prefix to work slower and slower and consuming more CPU to iterate over bigger dataset. Also restart of Dkron fixes the issue temporarily, because Badger is fully wiped right on start, according to code.
Also noticed that Badger vlog file in data dir is also growing and never shrinked or stopping growth.
Also visible on file cache memory usage. It's always growing also along with CPU which makes me think that it's all related

Though no ideas so far on how to fix it.
Is anyone encountered the same behaviour?

Expected behavior
CPU usage is not growing over time in case jobs are not being added to Dkron.

** Specifications:**

OS: Linux

Additional context
It's all run in Kubernetes

The text was updated successfully, but these errors were encountered:

andreygolev · 2020-02-27T16:22:48Z

Some update there.
I've tried to compile Dkron with the following BadgerDB settings and the RAM and CPU usage significantly dropped. Like 50mb instead of few GB now.

	opts := badger.DefaultOptions(dir).
		WithValueLogLoadingMode(options.FileIO).
		WithTableLoadingMode(options.FileIO).
		WithLogger(log).
		WithCompression(options.None).
		WithMaxCacheSize(10).
		WithNumMemtables(1).
		WithNumLevelZeroTables(1).
		WithNumLevelZeroTablesStall(2).
		WithValueLogMaxEntries(10000).
		WithKeepL0InMemory(true).
		WithMaxTableSize(1 << 20).
		WithLevelSizeMultiplier(1)

Though it's still visible that not everything is still deleted from DB, as CPU and RAM usage also are growing, but way slower.

vcastellm · 2020-03-04T22:29:33Z

Thanks @andreygolev, I'm close following this issue, thanks for the investigation. I need to check what you are showing here.

I'll update here

vcastellm · 2020-03-04T22:36:52Z

This is how my CPU metrics looks like

It's pretty low load and idle > 90% I have almost 40 jobs

You need to look at the logs, they should be saying something that is not working well.

I can tell you that 1s jobs are not going to work well, > 3s jobs must be ok, but this doesn't need to be related with the CPU problem I hope.

Badger could be the reason as you investigated, but there should be something related to your use case, as I never observed the CPU behaviour you are experiencing.

Please, can you provide more info on your job setup?

andreygolev · 2020-03-05T09:04:24Z

I also spent some time with investigation what really happens behind the scenes, and I'm pretty confident that execution cleanup after MaxExecutions reached works totally fine, and it.Seek() through the BadgerDB also iterates not more than MaxExecution elements. So, there are no problems with logic in Dkron code, but there are definitely problems with BadgerDB itself.
Maybe you don't see problems, because your workload don't delete much executions because of MaxExecutions cap is not reached. My delete rate is pretty much high :)

Also I saw some similar issues in BadgerDB repo. Users were claiming about degrading performance with deleted keys and growing storage. Badger maintainers suggested running ValueLogGC more often, but in it also didn't help at all for me.

So, after some trial and errors with no luck I decided to give a try to a BoltDB instead of BadgerDB, and it's rock solid now. Please take a look at the attached screenshots.
CPU usage is not more than 4% always with absolutely same workload.
Also memory footprint is around 35mb, and that's pretty awesome, comparing to 1gb+ and growing with BadgerDB.
I'm pretty sure that's the fault of BadgerDB :)

Performance though is slower than Badger, but I'm pretty sure it's easy to solve by going with inmemory DB. In my case I just moved BoltDB db file to tmpfs.

This is how I implemented it: andreygolev@c098107

vcastellm · 2020-03-05T09:25:47Z

I also spent some time with investigation what really happens behind the scenes, and I'm pretty confident that execution cleanup after MaxExecutions reached works totally fine, and it.Seek() through the BadgerDB also iterates not more than MaxExecution elements. So, there are no problems with logic in Dkron code, but there are definitely problems with BadgerDB itself.
Maybe you don't see problems, because your workload don't delete much executions because of MaxExecutions cap is not reached. My delete rate is pretty much high :)

Yes and probably :)

Also I saw some similar issues in BadgerDB repo. Users were claiming about degrading performance with deleted keys and growing storage. Badger maintainers suggested running ValueLogGC more often, but in it also didn't help at all for me.

I'm following them too, this is clearly related to dgraph-io/badger#718

So, after some trial and errors with no luck I decided to give a try to a BoltDB instead of BadgerDB, and it's rock solid now. Please take a look at the attached screenshots.
CPU usage is not more than 4% always with absolutely same workload.
Also memory footprint is around 35mb, and that's pretty awesome, comparing to 1gb+ and growing with BadgerDB.
I'm pretty sure that's the fault of BadgerDB :)

I have some test branches using different engines, @andreygolev using your special use case, may I ask you to try with this?d235916#diff-58f7996e41c5c348c30c7362170178d1L465

That would be helpful. I initially wanted to go with Badger because I liked the interface and the features, as you said you need to use tmpfs with BoldDB which I don't like, but, BuntDB have an almost identical interface of Badger, which makes very easy to switch between them.

Thanks

andreygolev · 2020-03-05T09:33:03Z

Totally agree, tmpfs is not the way :)
Thanks, I'll try with BuntDB and will come back with something within few days.

By the way, the issue was easily reproducible with running 50 jobs @every 3s with MaxExecutions = 5
In 2h it was easy to see CPU usage growth trend on a local machine.

andreygolev · 2020-03-10T08:32:55Z

So, with BuntDB it looks good and stable after a day of our workload.
I also added DeleteExecutions method which was missing in buntdb branch.
If you're ok with results, I can create a PR based on your buntdb branch.

vcastellm · 2020-03-10T16:53:01Z

Thanks for the test @andreygolev, yes go ahead with the PR

vcastellm · 2020-03-12T23:02:32Z

Fixed in #702

xtbwqtq · 2024-06-27T09:19:25Z

Describe the bug Hi, We're trying Dkron in production workloads. And currently we have 30 jobs. Most of them are running each 1-3 seconds. During normal operation, CPU usage on server agents is growing over time, let's say after 24-48h it's consuming more than 100% of CPU.

I've compiled Dkron with CPU profiling enabled and run it for some time in production workload and it looks that issue is at dkron/store.go listTxnFunc

Please take a look at the attached PDF with profiling info. profile001.pdf

In order to mitigate and investigate growing CPU, I tried to create iterator in listTnxfunc with different options
		opts := badger.DefaultIteratorOptions
		opts.PrefetchSize = 0
		opts.PrefetchValues = false
		it := txn.NewIterator(opts)
And it looks that it helps a a bit. CPU usage now grows way slower, but still not stopping to grow over time. At around 15:00 it's visible that CPU usage became growing way slower than it was before, because of PrefetchSize changes. PrefetechSize = 5 and PrefetchValues = true have the same good effect on CPU usage.

I suspected that the CPU usage will stop increasing after MaxExecutions value is reached, then old executions will be cleaned up according to code. I changed default MaxExecutions value 100 to 5 and noticed that in Dkron WebUI execution amount really became not more than 5 and RAM usage became way smaller. Though CPU usage is still growing same way.

More info

Removing jobs and adding them helps with CPU usage. CPU usage drops, then again slowly begins to increase.

Restart of Dkron

Noticed in logs that Badger is being backed up from time to time. And the key amount it backs up is also growing.
time="2020-02-25T13:07:48Z" level=info msg="DB.Backup Sent 35855 keys\n" node=dkron-server-2
time="2020-02-25T13:24:32Z" level=info msg="DB.Backup Sent 46964 keys\n" node=dkron-server-2
time="2020-02-25T13:36:44Z" level=info msg="DB.Backup Sent 55259 keys\n" node=dkron-server-2
time="2020-02-25T13:48:49Z" level=info msg="DB.Backup Sent 63484 keys\n" node=dkron-server-2
time="2020-02-25T14:03:52Z" level=info msg="DB.Backup Sent 73724 keys\n" node=dkron-server-2
time="2020-02-25T14:17:05Z" level=info msg="DB.Backup Sent 82742 keys\n" node=dkron-server-2
time="2020-02-25T14:30:51Z" level=info msg="DB.Backup Sent 92166 keys\n" node=dkron-server-2
time="2020-02-25T14:43:15Z" level=info msg="DB.Backup Sent 100561 keys\n" node=dkron-server-2
time="2020-02-25T14:55:59Z" level=info msg="DB.Backup Sent 109236 keys\n" node=dkron-server-2
time="2020-02-25T15:10:26Z" level=info msg="DB.Backup Sent 119040 keys\n" node=dkron-server-2
time="2020-02-25T15:22:26Z" level=info msg="DB.Backup Sent 127211 keys\n" node=dkron-server-2
time="2020-02-25T15:34:52Z" level=info msg="DB.Backup Sent 135704 keys\n" node=dkron-server-2
time="2020-02-25T15:46:41Z" level=info msg="DB.Backup Sent 143739 keys\n" node=dkron-server-2
time="2020-02-25T16:00:53Z" level=info msg="DB.Backup Sent 153395 keys\n" node=dkron-server-2
time="2020-02-25T16:15:27Z" level=info msg="DB.Backup Sent 163276 keys\n" node=dkron-server-2
time="2020-02-25T16:28:55Z" level=info msg="DB.Backup Sent 172474 keys\n" node=dkron-server-2
time="2020-02-25T16:40:45Z" level=info msg="DB.Backup Sent 180520 keys\n" node=dkron-server-2
time="2020-02-25T16:54:33Z" level=info msg="DB.Backup Sent 189931 keys\n" node=dkron-server-2
time="2020-02-25T17:07:35Z" level=info msg="DB.Backup Sent 198811 keys\n" node=dkron-server-2
time="2020-02-25T17:21:57Z" level=info msg="DB.Backup Sent 208570 keys\n" node=dkron-server-2
time="2020-02-25T17:34:36Z" level=info msg="DB.Backup Sent 217170 keys\n" node=dkron-server-2
time="2020-02-25T17:46:52Z" level=info msg="DB.Backup Sent 225595 keys\n" node=dkron-server-2
time="2020-02-25T17:58:48Z" level=info msg="DB.Backup Sent 233728 keys\n" node=dkron-server-2
time="2020-02-25T18:12:18Z" level=info msg="DB.Backup Sent 242888 keys\n" node=dkron-server-2
time="2020-02-25T18:24:21Z" level=info msg="DB.Backup Sent 251110 keys\n" node=dkron-server-2
time="2020-02-25T18:37:31Z" level=info msg="DB.Backup Sent 260100 keys\n" node=dkron-server-2
time="2020-02-25T18:50:30Z" level=info msg="DB.Backup Sent 268918 keys\n" node=dkron-server-2
It feels like the issue is because Badger in reality is not cleaned up fully and some structures are growing forever, causing fullscan by execution prefix to work slower and slower and consuming more CPU to iterate over bigger dataset. Also restart of Dkron fixes the issue temporarily, because Badger is fully wiped right on start, according to code.

Also noticed that Badger vlog file in data dir is also growing and never shrinked or stopping growth.
Also visible on file cache memory usage. It's always growing also along with CPU which makes me think that it's all related

Though no ideas so far on how to fix it. Is anyone encountered the same behaviour?

Expected behavior CPU usage is not growing over time in case jobs are not being added to Dkron.

** Specifications:**

OS: Linux

Additional context It's all run in Kubernetes

I encountered a similar situation. I frequently deleted and inserted key-value pairs, and the CPU kept increasing. This CPU was mainly caused by iterators. I accidentally found the badger warn log: Block cache might be too small, so I increased this parameter. Now it seems normal. I need to continue to observe for a while.
func initOps() badger.Options { ops := badger.DefaultOptions(storagePath) return ops.WithBlockCacheSize(1024 << 20) }

vcastellm · 2024-06-27T09:28:34Z

Hi @xtbwqtq and thanks for the deep investigation, but in this issue, we're talking about Badger and it was removed from Dkron a long time ago due to these kind of issues 🫤 as you can see in the issue that fixed this we moved onto BuntDB that has proven way smoother and efficient.

Is it a possibility for you to upgrade Dkron?

vcastellm closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Growing CPU usage on server agents #687

Growing CPU usage on server agents #687

andreygolev commented Feb 25, 2020

andreygolev commented Feb 27, 2020

vcastellm commented Mar 4, 2020

vcastellm commented Mar 4, 2020 •

edited

Loading

andreygolev commented Mar 5, 2020

vcastellm commented Mar 5, 2020

andreygolev commented Mar 5, 2020

andreygolev commented Mar 10, 2020

vcastellm commented Mar 10, 2020

vcastellm commented Mar 12, 2020

xtbwqtq commented Jun 27, 2024

vcastellm commented Jun 27, 2024

Growing CPU usage on server agents #687

Growing CPU usage on server agents #687

Comments

andreygolev commented Feb 25, 2020

andreygolev commented Feb 27, 2020

vcastellm commented Mar 4, 2020

vcastellm commented Mar 4, 2020 • edited Loading

andreygolev commented Mar 5, 2020

vcastellm commented Mar 5, 2020

andreygolev commented Mar 5, 2020

andreygolev commented Mar 10, 2020

vcastellm commented Mar 10, 2020

vcastellm commented Mar 12, 2020

xtbwqtq commented Jun 27, 2024

vcastellm commented Jun 27, 2024

vcastellm commented Mar 4, 2020 •

edited

Loading