-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
High CPU usage while using KafkaStreamsMetrics #2801
Comments
@JorgenRingen I checked out your project but unfortunately I was not reproduce the issue, the Prometheus overhead I'm seeing is somewhere between 0 and 1%. Also, the client fails shortly after I start the service:
Is it possibe to make the project a little more minimalistic? It still references config files I don't have: Btw have you take a look at spring integration? It has nice Kafka support with a much simpler API. |
@jonatan-ivanov : Edit: I deleted the default-branch and made the "spring-boot, spring-kafka, java" (java-branch) sample the default main branch. Also updated links in original post. Sorry for any confusion. https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage |
@JorgenRingen Thank you very much, it is much better this way but unfortunately I'm still not able to repro your issue. My environment: macOS 11.6 (Big Sur); Intel Core i9-9980HK CPU @ 2.40GHz; 32 GB RAM I tried it with Java 11 ( Some additional details;
So I did not see any difference between 1.7.3 and 1.7.4 but this is a lot of data (5000+ lines, 1MiB) and the endpoint feels slow. With this amount of metrics, I this this is somewhat expected and ~6.5% CPU peaks feel normal during scraping. Additionally your experience can be worse than this if you add custom tags. Could you please check if you see similar numbers on your end? I definitely don't see 100+% CPU spikes that you experienced (slack thread screenshot).
Could you please tell how frequently? If I remember correctly the default scraping interval for Prometheus is 60s and going down to the few seconds zone is not recommended. With this response size and latency it can happen that a scrape hasn't finished but the next request has arrived if your interval is small. |
Very interesting! I sent the sample-app to a couple of colleagues so they can test it as well. But I've replicated it on 2 different machines and in our k8s production environment, where the screenshots in slack-tread was taken from. The k8s-pods are not very powerful, so that's why the cpu-spikes are so high there. I just ran in on my local machine: macOS 11.6 (Big Sur); 3,8 GHz Quad-Core Intel Core i5; 40 GB RAM and Java 17 Some runs with
It uses some time (about a minute) before it stabilizes on a number as you said. I don't know what causes this and the slow increase in number of metrics before it stabilizes. Might be some scheduled thing in kafka-streams that generates a lot of metrics. Micrometer 1.7.3: Some runs with
Really interesting. Always same result on 1.7.3. Also had to wait quite a bit of time (1+ min) before it stabilizes on 5188. The CPU trend is the same for 1.7.4 (15-20%) and 1.7.3 (1-2%) regardless of number of metrics. Also, the endpoint itself is extremely more responsive on 1.7.3 (~0.015s) compared to 1.7.4 (~1.6sec)
Will definitely look into filtering out some of the metrics we don't use. We just use the default setup and kafka-streams provides a lot of metrics by default.
We use a datadog-agent in kubernetes which scrapes /prometheus and forwards the metrics to datadog every 15sec by default: https://docs.datadoghq.com/integrations/prometheus/
OBS: Intellij can be pretty slow with refreshing dependencies if changing the version-number in the pom, so make sure which version is actually used by checking the output in the run console (I made this mistake myself 😆). Also saw you used mvn, but just double-checking :-) |
Similar experience here. Local machine: macOS BigSur 11.6; 2,6 GHz 6-Core Intel Core i7; 16 GB 2667 MHz DDR4. Running the following command (in zsh)
I also ran: and for 1.7.3 i get:
Running the same time-taker i get: 0.024868s Which might explain the differences. Polling every second seems to overload the resolution and - by just reloading localhost:8080/actuator/prometheus every second (not very scientific), yields the following graph on 1.7.4: So I guess reducing the poll frequency should result in a drop (naturally) in CPU usage, but I'm still curious what could've caused the resolution time to jump from 0.025s to 2.5 (100x slower) 🤔 |
here for every metric we call org.apache.kafka.KafkaStreams->metrics it was introduced by #2770 |
Yes, as @ghmulti points out, it seems to be this change: f78fcb0#diff-7056fc4a7f15d24878096caa96ae9852ac8c9b2afcf11609faead6b9123efbb0R197 Instead of retrieving the value of the metric directly, the metric-supplier is passed and by calling the metric-supplier all kafka-streams metrics are retrieved for each metric. This means that this call is run for all kafka-streams metrics (4-5000): org.apache.kafka.streams.KafkaStreams#metrics |
First of all thank you so much all of you to look into these and provide data points. As a maintainer, it's so good to see this.
Kafka creates The change you are referring to (#2770) is somewhat related to this. It was introduced because:
You might be able to see this behavior in 1.7.3 (seeing a bunch of Right now, based on this, I think there are two things that could affect the performance:
This might explain why I'm not seeing any difference: my theory is that 1.7.3 might have been always slow for me if Kafka did not replace its I need to verify these but does this make sense to you? Can you see a bunch of If my theory about this is right, since the "slow" behavior is due to a bug (
|
Updates:
If I do this in
If I do this (no Map lookup and no supplier call, as you pointed out), the CPU utilization is negligible:
Right now, I'm not sure how to fix this but I'm going to look into this deeper. |
Thanks for a detailed response @jonatan-ivanov! Glad you were able to reproduce, maven caches can be a hassle! 😆 We'll work around this by adding a MeterFilter for kafka-related metrics, which should be a good thing regardless of the CPU-issues. I've measured with ~200-300 metrics ( We will also try too look deeper into the root-cause and maybe come up with a suggestion if time permits. |
I investigated this further and if I replace the private ToDoubleFunction<AtomicReference<Map<MetricName, ? extends Metric>>> toMetricValue(MetricName metricName) {
return metrics -> toDouble(metrics.get().get(metricName));
} It seems it is fast again. The problem with this is the same, if you just updated the reference but Kafka will replace its (It seems that the supplier get call is a bottleneck in itself.) |
@JorgenRingen Here's a potential fix: jonatan-ivanov@63e489c I need to go through this and think about where this can be broken, also figure out how can we test this. |
… meter is polled. related: micrometer-metricsgh-2801
@JorgenRingen I merged the fix in, should go out in the next patch releases (today). |
Released, 1.7.5 should fix this, please give it a try. |
Great to hear! |
Describe the bug
We run spring-boot + kafka-streams with micrometer and KafkaStreamsMetrics. Metrics are exposed by using
micrometer-registry-prometheus
After upgrading from micrometer 1.7.3 to 1.7.4 we notice high cpu-usage. If we disable KafkaStreamsMetrics the high cpu usage disappear. Also, downgrading to 1.7.3 makes the high cpu usage disappear.
The cpu usage increases while calling
actuator/prometheus
. Our monitoring system polls this endpoint frequently and the CPU is therefore always high.Have provided a small sample-app that reproduces the issue: https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage
Environment
Kubernetes and locally.
To Reproduce
How to reproduce the bug:
https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage
Java-branch has a very basic spring-boot, kafka-streams, micrometer setup.
Expected behavior
While polling
localhost:8080/acutator/prometheus
the CPU usage is very high on 1.7.4. Expected behavior is that this should barely be noticeable.Additional context
Slack thread: https://micrometer-metrics.slack.com/archives/C662HUJC9/p1632729969102400
The text was updated successfully, but these errors were encountered: