Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Growing CPU usage on server agents #687

Closed
andreygolev opened this issue Feb 25, 2020 · 11 comments
Closed

Growing CPU usage on server agents #687

andreygolev opened this issue Feb 25, 2020 · 11 comments

Comments

@andreygolev
Copy link
Contributor

Describe the bug
Hi,
We're trying Dkron in production workloads. And currently we have 30 jobs. Most of them are running each 1-3 seconds.
During normal operation, CPU usage on server agents is growing over time, let's say after 24-48h it's consuming more than 100% of CPU.

I've compiled Dkron with CPU profiling enabled and run it for some time in production workload and it looks that issue is at dkron/store.go listTxnFunc

Please take a look at the attached PDF with profiling info.
profile001.pdf

In order to mitigate and investigate growing CPU, I tried to create iterator in listTnxfunc with different options

		opts := badger.DefaultIteratorOptions
		opts.PrefetchSize = 0
		opts.PrefetchValues = false
		it := txn.NewIterator(opts)

And it looks that it helps a a bit. CPU usage now grows way slower, but still not stopping to grow over time. At around 15:00 it's visible that CPU usage became growing way slower than it was before, because of PrefetchSize changes. PrefetechSize = 5 and PrefetchValues = true have the same good effect on CPU usage.
Screenshot 2020-02-25 at 20 25 56

I suspected that the CPU usage will stop increasing after MaxExecutions value is reached, then old executions will be cleaned up according to code. I changed default MaxExecutions value 100 to 5 and noticed that in Dkron WebUI execution amount really became not more than 5 and RAM usage became way smaller. Though CPU usage is still growing same way.

More info

  • Removing jobs and adding them helps with CPU usage. CPU usage drops, then again slowly begins to increase.
  • Restart of Dkron
  • Noticed in logs that Badger is being backed up from time to time. And the key amount it backs up is also growing.
time="2020-02-25T13:07:48Z" level=info msg="DB.Backup Sent 35855 keys\n" node=dkron-server-2
time="2020-02-25T13:24:32Z" level=info msg="DB.Backup Sent 46964 keys\n" node=dkron-server-2
time="2020-02-25T13:36:44Z" level=info msg="DB.Backup Sent 55259 keys\n" node=dkron-server-2
time="2020-02-25T13:48:49Z" level=info msg="DB.Backup Sent 63484 keys\n" node=dkron-server-2
time="2020-02-25T14:03:52Z" level=info msg="DB.Backup Sent 73724 keys\n" node=dkron-server-2
time="2020-02-25T14:17:05Z" level=info msg="DB.Backup Sent 82742 keys\n" node=dkron-server-2
time="2020-02-25T14:30:51Z" level=info msg="DB.Backup Sent 92166 keys\n" node=dkron-server-2
time="2020-02-25T14:43:15Z" level=info msg="DB.Backup Sent 100561 keys\n" node=dkron-server-2
time="2020-02-25T14:55:59Z" level=info msg="DB.Backup Sent 109236 keys\n" node=dkron-server-2
time="2020-02-25T15:10:26Z" level=info msg="DB.Backup Sent 119040 keys\n" node=dkron-server-2
time="2020-02-25T15:22:26Z" level=info msg="DB.Backup Sent 127211 keys\n" node=dkron-server-2
time="2020-02-25T15:34:52Z" level=info msg="DB.Backup Sent 135704 keys\n" node=dkron-server-2
time="2020-02-25T15:46:41Z" level=info msg="DB.Backup Sent 143739 keys\n" node=dkron-server-2
time="2020-02-25T16:00:53Z" level=info msg="DB.Backup Sent 153395 keys\n" node=dkron-server-2
time="2020-02-25T16:15:27Z" level=info msg="DB.Backup Sent 163276 keys\n" node=dkron-server-2
time="2020-02-25T16:28:55Z" level=info msg="DB.Backup Sent 172474 keys\n" node=dkron-server-2
time="2020-02-25T16:40:45Z" level=info msg="DB.Backup Sent 180520 keys\n" node=dkron-server-2
time="2020-02-25T16:54:33Z" level=info msg="DB.Backup Sent 189931 keys\n" node=dkron-server-2
time="2020-02-25T17:07:35Z" level=info msg="DB.Backup Sent 198811 keys\n" node=dkron-server-2
time="2020-02-25T17:21:57Z" level=info msg="DB.Backup Sent 208570 keys\n" node=dkron-server-2
time="2020-02-25T17:34:36Z" level=info msg="DB.Backup Sent 217170 keys\n" node=dkron-server-2
time="2020-02-25T17:46:52Z" level=info msg="DB.Backup Sent 225595 keys\n" node=dkron-server-2
time="2020-02-25T17:58:48Z" level=info msg="DB.Backup Sent 233728 keys\n" node=dkron-server-2
time="2020-02-25T18:12:18Z" level=info msg="DB.Backup Sent 242888 keys\n" node=dkron-server-2
time="2020-02-25T18:24:21Z" level=info msg="DB.Backup Sent 251110 keys\n" node=dkron-server-2
time="2020-02-25T18:37:31Z" level=info msg="DB.Backup Sent 260100 keys\n" node=dkron-server-2
time="2020-02-25T18:50:30Z" level=info msg="DB.Backup Sent 268918 keys\n" node=dkron-server-2
  • It feels like the issue is because Badger in reality is not cleaned up fully and some structures are growing forever, causing fullscan by execution prefix to work slower and slower and consuming more CPU to iterate over bigger dataset. Also restart of Dkron fixes the issue temporarily, because Badger is fully wiped right on start, according to code.
  • Also noticed that Badger vlog file in data dir is also growing and never shrinked or stopping growth.
    Also visible on file cache memory usage. It's always growing also along with CPU which makes me think that it's all related

Screenshot 2020-02-25 at 20 26 06

Though no ideas so far on how to fix it.
Is anyone encountered the same behaviour?

Expected behavior
CPU usage is not growing over time in case jobs are not being added to Dkron.

** Specifications:**

  • OS: Linux

Additional context
It's all run in Kubernetes

@andreygolev
Copy link
Contributor Author

Some update there.
I've tried to compile Dkron with the following BadgerDB settings and the RAM and CPU usage significantly dropped. Like 50mb instead of few GB now.

	opts := badger.DefaultOptions(dir).
		WithValueLogLoadingMode(options.FileIO).
		WithTableLoadingMode(options.FileIO).
		WithLogger(log).
		WithCompression(options.None).
		WithMaxCacheSize(10).
		WithNumMemtables(1).
		WithNumLevelZeroTables(1).
		WithNumLevelZeroTablesStall(2).
		WithValueLogMaxEntries(10000).
		WithKeepL0InMemory(true).
		WithMaxTableSize(1 << 20).
		WithLevelSizeMultiplier(1)

Though it's still visible that not everything is still deleted from DB, as CPU and RAM usage also are growing, but way slower.

@vcastellm
Copy link
Member

Thanks @andreygolev, I'm close following this issue, thanks for the investigation. I need to check what you are showing here.

I'll update here

@vcastellm
Copy link
Member

vcastellm commented Mar 4, 2020

This is how my CPU metrics looks like
image

It's pretty low load and idle > 90% I have almost 40 jobs

You need to look at the logs, they should be saying something that is not working well.

I can tell you that 1s jobs are not going to work well, > 3s jobs must be ok, but this doesn't need to be related with the CPU problem I hope.

Badger could be the reason as you investigated, but there should be something related to your use case, as I never observed the CPU behaviour you are experiencing.

Please, can you provide more info on your job setup?

@andreygolev
Copy link
Contributor Author

I also spent some time with investigation what really happens behind the scenes, and I'm pretty confident that execution cleanup after MaxExecutions reached works totally fine, and it.Seek() through the BadgerDB also iterates not more than MaxExecution elements. So, there are no problems with logic in Dkron code, but there are definitely problems with BadgerDB itself.
Maybe you don't see problems, because your workload don't delete much executions because of MaxExecutions cap is not reached. My delete rate is pretty much high :)

Also I saw some similar issues in BadgerDB repo. Users were claiming about degrading performance with deleted keys and growing storage. Badger maintainers suggested running ValueLogGC more often, but in it also didn't help at all for me.

So, after some trial and errors with no luck I decided to give a try to a BoltDB instead of BadgerDB, and it's rock solid now. Please take a look at the attached screenshots.
CPU usage is not more than 4% always with absolutely same workload.
Also memory footprint is around 35mb, and that's pretty awesome, comparing to 1gb+ and growing with BadgerDB.
I'm pretty sure that's the fault of BadgerDB :)

Performance though is slower than Badger, but I'm pretty sure it's easy to solve by going with inmemory DB. In my case I just moved BoltDB db file to tmpfs.

Screenshot 2020-03-05 at 10 52 34
Screenshot 2020-03-05 at 10 52 38
Screenshot 2020-03-05 at 10 59 19

This is how I implemented it: andreygolev@c098107

@vcastellm
Copy link
Member

I also spent some time with investigation what really happens behind the scenes, and I'm pretty confident that execution cleanup after MaxExecutions reached works totally fine, and it.Seek() through the BadgerDB also iterates not more than MaxExecution elements. So, there are no problems with logic in Dkron code, but there are definitely problems with BadgerDB itself.
Maybe you don't see problems, because your workload don't delete much executions because of MaxExecutions cap is not reached. My delete rate is pretty much high :)

Yes and probably :)

Also I saw some similar issues in BadgerDB repo. Users were claiming about degrading performance with deleted keys and growing storage. Badger maintainers suggested running ValueLogGC more often, but in it also didn't help at all for me.

I'm following them too, this is clearly related to dgraph-io/badger#718

So, after some trial and errors with no luck I decided to give a try to a BoltDB instead of BadgerDB, and it's rock solid now. Please take a look at the attached screenshots.
CPU usage is not more than 4% always with absolutely same workload.
Also memory footprint is around 35mb, and that's pretty awesome, comparing to 1gb+ and growing with BadgerDB.
I'm pretty sure that's the fault of BadgerDB :)

I have some test branches using different engines, @andreygolev using your special use case, may I ask you to try with this?d235916#diff-58f7996e41c5c348c30c7362170178d1L465

That would be helpful. I initially wanted to go with Badger because I liked the interface and the features, as you said you need to use tmpfs with BoldDB which I don't like, but, BuntDB have an almost identical interface of Badger, which makes very easy to switch between them.

Thanks

@andreygolev
Copy link
Contributor Author

Totally agree, tmpfs is not the way :)
Thanks, I'll try with BuntDB and will come back with something within few days.

By the way, the issue was easily reproducible with running 50 jobs @every 3s with MaxExecutions = 5
In 2h it was easy to see CPU usage growth trend on a local machine.

@andreygolev
Copy link
Contributor Author

So, with BuntDB it looks good and stable after a day of our workload.
I also added DeleteExecutions method which was missing in buntdb branch.
If you're ok with results, I can create a PR based on your buntdb branch.
Screenshot 2020-03-10 at 10 29 40

@vcastellm
Copy link
Member

Thanks for the test @andreygolev, yes go ahead with the PR

@vcastellm
Copy link
Member

Fixed in #702

@xtbwqtq
Copy link

xtbwqtq commented Jun 27, 2024

Describe the bug Hi, We're trying Dkron in production workloads. And currently we have 30 jobs. Most of them are running each 1-3 seconds. During normal operation, CPU usage on server agents is growing over time, let's say after 24-48h it's consuming more than 100% of CPU.

I've compiled Dkron with CPU profiling enabled and run it for some time in production workload and it looks that issue is at dkron/store.go listTxnFunc

Please take a look at the attached PDF with profiling info. profile001.pdf

In order to mitigate and investigate growing CPU, I tried to create iterator in listTnxfunc with different options

		opts := badger.DefaultIteratorOptions
		opts.PrefetchSize = 0
		opts.PrefetchValues = false
		it := txn.NewIterator(opts)

And it looks that it helps a a bit. CPU usage now grows way slower, but still not stopping to grow over time. At around 15:00 it's visible that CPU usage became growing way slower than it was before, because of PrefetchSize changes. PrefetechSize = 5 and PrefetchValues = true have the same good effect on CPU usage. Screenshot 2020-02-25 at 20 25 56

I suspected that the CPU usage will stop increasing after MaxExecutions value is reached, then old executions will be cleaned up according to code. I changed default MaxExecutions value 100 to 5 and noticed that in Dkron WebUI execution amount really became not more than 5 and RAM usage became way smaller. Though CPU usage is still growing same way.

More info

  • Removing jobs and adding them helps with CPU usage. CPU usage drops, then again slowly begins to increase.
  • Restart of Dkron
  • Noticed in logs that Badger is being backed up from time to time. And the key amount it backs up is also growing.
time="2020-02-25T13:07:48Z" level=info msg="DB.Backup Sent 35855 keys\n" node=dkron-server-2
time="2020-02-25T13:24:32Z" level=info msg="DB.Backup Sent 46964 keys\n" node=dkron-server-2
time="2020-02-25T13:36:44Z" level=info msg="DB.Backup Sent 55259 keys\n" node=dkron-server-2
time="2020-02-25T13:48:49Z" level=info msg="DB.Backup Sent 63484 keys\n" node=dkron-server-2
time="2020-02-25T14:03:52Z" level=info msg="DB.Backup Sent 73724 keys\n" node=dkron-server-2
time="2020-02-25T14:17:05Z" level=info msg="DB.Backup Sent 82742 keys\n" node=dkron-server-2
time="2020-02-25T14:30:51Z" level=info msg="DB.Backup Sent 92166 keys\n" node=dkron-server-2
time="2020-02-25T14:43:15Z" level=info msg="DB.Backup Sent 100561 keys\n" node=dkron-server-2
time="2020-02-25T14:55:59Z" level=info msg="DB.Backup Sent 109236 keys\n" node=dkron-server-2
time="2020-02-25T15:10:26Z" level=info msg="DB.Backup Sent 119040 keys\n" node=dkron-server-2
time="2020-02-25T15:22:26Z" level=info msg="DB.Backup Sent 127211 keys\n" node=dkron-server-2
time="2020-02-25T15:34:52Z" level=info msg="DB.Backup Sent 135704 keys\n" node=dkron-server-2
time="2020-02-25T15:46:41Z" level=info msg="DB.Backup Sent 143739 keys\n" node=dkron-server-2
time="2020-02-25T16:00:53Z" level=info msg="DB.Backup Sent 153395 keys\n" node=dkron-server-2
time="2020-02-25T16:15:27Z" level=info msg="DB.Backup Sent 163276 keys\n" node=dkron-server-2
time="2020-02-25T16:28:55Z" level=info msg="DB.Backup Sent 172474 keys\n" node=dkron-server-2
time="2020-02-25T16:40:45Z" level=info msg="DB.Backup Sent 180520 keys\n" node=dkron-server-2
time="2020-02-25T16:54:33Z" level=info msg="DB.Backup Sent 189931 keys\n" node=dkron-server-2
time="2020-02-25T17:07:35Z" level=info msg="DB.Backup Sent 198811 keys\n" node=dkron-server-2
time="2020-02-25T17:21:57Z" level=info msg="DB.Backup Sent 208570 keys\n" node=dkron-server-2
time="2020-02-25T17:34:36Z" level=info msg="DB.Backup Sent 217170 keys\n" node=dkron-server-2
time="2020-02-25T17:46:52Z" level=info msg="DB.Backup Sent 225595 keys\n" node=dkron-server-2
time="2020-02-25T17:58:48Z" level=info msg="DB.Backup Sent 233728 keys\n" node=dkron-server-2
time="2020-02-25T18:12:18Z" level=info msg="DB.Backup Sent 242888 keys\n" node=dkron-server-2
time="2020-02-25T18:24:21Z" level=info msg="DB.Backup Sent 251110 keys\n" node=dkron-server-2
time="2020-02-25T18:37:31Z" level=info msg="DB.Backup Sent 260100 keys\n" node=dkron-server-2
time="2020-02-25T18:50:30Z" level=info msg="DB.Backup Sent 268918 keys\n" node=dkron-server-2
  • It feels like the issue is because Badger in reality is not cleaned up fully and some structures are growing forever, causing fullscan by execution prefix to work slower and slower and consuming more CPU to iterate over bigger dataset. Also restart of Dkron fixes the issue temporarily, because Badger is fully wiped right on start, according to code.
  • Also noticed that Badger vlog file in data dir is also growing and never shrinked or stopping growth.
    Also visible on file cache memory usage. It's always growing also along with CPU which makes me think that it's all related
Screenshot 2020-02-25 at 20 26 06

Though no ideas so far on how to fix it. Is anyone encountered the same behaviour?

Expected behavior CPU usage is not growing over time in case jobs are not being added to Dkron.

** Specifications:**

  • OS: Linux

Additional context It's all run in Kubernetes

I encountered a similar situation. I frequently deleted and inserted key-value pairs, and the CPU kept increasing. This CPU was mainly caused by iterators. I accidentally found the badger warn log: Block cache might be too small, so I increased this parameter. Now it seems normal. I need to continue to observe for a while.
func initOps() badger.Options { ops := badger.DefaultOptions(storagePath) return ops.WithBlockCacheSize(1024 << 20) }

@vcastellm
Copy link
Member

Hi @xtbwqtq and thanks for the deep investigation, but in this issue, we're talking about Badger and it was removed from Dkron a long time ago due to these kind of issues 🫤 as you can see in the issue that fixed this we moved onto BuntDB that has proven way smoother and efficient.

Is it a possibility for you to upgrade Dkron?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants