Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

High CPU usage while using KafkaStreamsMetrics #2801

Closed
JorgenRingen opened this issue Oct 4, 2021 · 16 comments
Closed

High CPU usage while using KafkaStreamsMetrics #2801

JorgenRingen opened this issue Oct 4, 2021 · 16 comments
Assignees
Labels
bug A general bug module: micrometer-core An issue that is related to our core module performance Issues related to general performance regression A regression from a previous release
Milestone

Comments

@JorgenRingen
Copy link

JorgenRingen commented Oct 4, 2021

Describe the bug
We run spring-boot + kafka-streams with micrometer and KafkaStreamsMetrics. Metrics are exposed by using micrometer-registry-prometheus

After upgrading from micrometer 1.7.3 to 1.7.4 we notice high cpu-usage. If we disable KafkaStreamsMetrics the high cpu usage disappear. Also, downgrading to 1.7.3 makes the high cpu usage disappear.

The cpu usage increases while calling actuator/prometheus. Our monitoring system polls this endpoint frequently and the CPU is therefore always high.

Have provided a small sample-app that reproduces the issue: https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage

Environment
Kubernetes and locally.

  • Micrometer version: 1.7.4
  • Micrometer registry: prometheus
  • OS: macOS, Linux
  • Java version: openjdk version "11.0.7"

To Reproduce
How to reproduce the bug:
https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage
Java-branch has a very basic spring-boot, kafka-streams, micrometer setup.

Expected behavior
While polling localhost:8080/acutator/prometheus the CPU usage is very high on 1.7.4. Expected behavior is that this should barely be noticeable.

Additional context
Slack thread: https://micrometer-metrics.slack.com/archives/C662HUJC9/p1632729969102400

@JorgenRingen JorgenRingen added the bug A general bug label Oct 4, 2021
@shakuzen shakuzen added the module: micrometer-core An issue that is related to our core module label Oct 5, 2021
@jonatan-ivanov
Copy link
Member

@JorgenRingen I checked out your project but unfortunately I was not reproduce the issue, the Prometheus overhead I'm seeing is somewhere between 0 and 1%.

Also, the client fails shortly after I start the service:

2021-10-05 15:20:55.297 ERROR 7963 --- [-StreamThread-2] org.apache.kafka.streams.KafkaStreams    : stream-client [sample-app-4340437c-e603-4bed-b5c7-4845a9d78561] Encountered the following exception during processing and the registered exception handler opted to SHUTDOWN_CLIENT. The streams client is going to shut down now. 

org.apache.kafka.streams.errors.StreamsException: Could not create topic sample-app-foo-repartiton-1-repartition.
...
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.

Is it possibe to make the project a little more minimalistic? It still references config files I don't have: file:/etc/config/sb-config.yaml and in order to rule out language/sdk-related things it would be great to have a java reproducer. I'm not sure if all classes are also needed to repro this, that project contains rather a lot of code for such a simple reproducer.

Btw have you take a look at spring integration? It has nice Kafka support with a much simpler API.

@jonatan-ivanov jonatan-ivanov added waiting for feedback We need additional information before we can continue and removed bug A general bug labels Oct 5, 2021
@JorgenRingen
Copy link
Author

JorgenRingen commented Oct 6, 2021

@jonatan-ivanov : Yes, I linked to the java-branch on the sample project. It's very minimalistic and uses out-of-the-box spring integration: https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage/tree/java

Edit: I deleted the default-branch and made the "spring-boot, spring-kafka, java" (java-branch) sample the default main branch. Also updated links in original post. Sorry for any confusion. https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage

@jonatan-ivanov
Copy link
Member

@JorgenRingen Thank you very much, it is much better this way but unfortunately I'm still not able to repro your issue.

My environment: macOS 11.6 (Big Sur); Intel Core i9-9980HK CPU @ 2.40GHz; 32 GB RAM

I tried it with Java 11 (Temurin 11.0.12 2021-07-20) and Java 17 (Temurin 17 2021-09-14) using ./mvnw spring-boot:run and running it from IntelliJ, the results (I used visualvm) were always the same: the CPU peaked around 6.5% (the peaks correlate to the Prometheus calls):

Micrometer 1.7.3:
micrometer-1.7.3

Micrometer 1.7.4:
micrometer-1.7.4

Some additional details; curl -s localhost:8080/actuator/prometheus | wc -l gives the following results (number of lines in the response)
(you need to wait a little to repro so Kafka Metrics are populated and your results could be a little different but should be close)

  • 1.7.3 + Java 17
    • 4036 lines (3893 are kafka-related)
  • 1.7.4 + Java 17
    • 4145 lines (4005 are kafka-related)
  • 1.7.3 + Java 11
    • 4976 lines (4836 are kafka-related)
  • 1.7.4 + Java 11
    • 5188 lines (5048 are kafka-related)
    • Content-Length is ~1 MiB

So I did not see any difference between 1.7.3 and 1.7.4 but this is a lot of data (5000+ lines, 1MiB) and the endpoint feels slow. With this amount of metrics, I this this is somewhat expected and ~6.5% CPU peaks feel normal during scraping. Additionally your experience can be worse than this if you add custom tags.

Could you please check if you see similar numbers on your end? I definitely don't see 100+% CPU spikes that you experienced (slack thread screenshot).
Could you please also check if you need all 5000+ metrics? You can filter them out using a MeterFilter.
Also, this caught my eye:

Our monitoring system polls this endpoint frequently

Could you please tell how frequently? If I remember correctly the default scraping interval for Prometheus is 60s and going down to the few seconds zone is not recommended. With this response size and latency it can happen that a scrape hasn't finished but the next request has arrived if your interval is small.

@JorgenRingen
Copy link
Author

JorgenRingen commented Oct 12, 2021

Very interesting! I sent the sample-app to a couple of colleagues so they can test it as well. But I've replicated it on 2 different machines and in our k8s production environment, where the screenshots in slack-tread was taken from. The k8s-pods are not very powerful, so that's why the cpu-spikes are so high there.

I just ran in on my local machine: macOS 11.6 (Big Sur); 3,8 GHz Quad-Core Intel Core i5; 40 GB RAM and Java 17 (Temurin 17 2021-09-14) using VisualVM:

Micrometer 1.7.4:
image

Some runs with curl -s localhost:8080/actuator/prometheus | wc -l (stop-start app between each run):

  1. 5188
  2. 4515
  3. 3950
  4. 3945
  5. 5188 (again)

It uses some time (about a minute) before it stabilizes on a number as you said.
The diff between total number of metrics and non-kafka metrics is always 140.

I don't know what causes this and the slow increase in number of metrics before it stabilizes. Might be some scheduled thing in kafka-streams that generates a lot of metrics.

Micrometer 1.7.3:
image
(last spike is shutdown)

Some runs with curl -s localhost:8080/actuator/prometheus | wc -l (stop-start app between each run):

  1. 5188 (same as 1.7.4)
  2. 5188
  3. 5188

Really interesting. Always same result on 1.7.3. Also had to wait quite a bit of time (1+ min) before it stabilizes on 5188.

The CPU trend is the same for 1.7.4 (15-20%) and 1.7.3 (1-2%) regardless of number of metrics.

Also, the endpoint itself is extremely more responsive on 1.7.3 (~0.015s) compared to 1.7.4 (~1.6sec)

Could you please also check if you need all 5000+ metrics? You can filter them out using a MeterFilter.

Will definitely look into filtering out some of the metrics we don't use. We just use the default setup and kafka-streams provides a lot of metrics by default.

Could you please tell how frequently?

We use a datadog-agent in kubernetes which scrapes /prometheus and forwards the metrics to datadog every 15sec by default: https://docs.datadoghq.com/integrations/prometheus/

.using ./mvnw spring-boot:run and running it from IntelliJ,

OBS: Intellij can be pretty slow with refreshing dependencies if changing the version-number in the pom, so make sure which version is actually used by checking the output in the run console (I made this mistake myself 😆). Also saw you used mvn, but just double-checking :-)

@andeki92
Copy link

andeki92 commented Oct 12, 2021

Similar experience here. Local machine: macOS BigSur 11.6; 2,6 GHz 6-Core Intel Core i7; 16 GB 2667 MHz DDR4.

Running the following command (in zsh) repeat 100 { curl -s localhost:8080/actuator/prometheus; sleep 1 | wc -l } i get the following result for 1.7.4:

Screenshot 2021-10-12 at 12 21 31

  1. 3779
  2. 3786 (this repeats forever)

I also ran: curl -s -o /dev/null -w 'Total: %{time_total}s\n' localhost:8080/actuator/prometheus, which yields: 2.444678s

and for 1.7.3 i get:

Screenshot 2021-10-12 at 13 25 44

  1. 5118 (this repeats forever)

Running the same time-taker i get: 0.024868s

Which might explain the differences. Polling every second seems to overload the resolution and - by just reloading localhost:8080/actuator/prometheus every second (not very scientific), yields the following graph on 1.7.4:

Screenshot 2021-10-12 at 12 34 49

So I guess reducing the poll frequency should result in a drop (naturally) in CPU usage, but I'm still curious what could've caused the resolution time to jump from 0.025s to 2.5 (100x slower) 🤔

@ghmulti
Copy link

ghmulti commented Oct 12, 2021

private ToDoubleFunction<Supplier<Map<MetricName, ? extends Metric>>> toMetricValue(MetricName metricName) {
        return (metricsSupplier) -> {
            return this.toDouble((Metric)((Map)metricsSupplier.get()).get(metricName));
        };
    }

here for every metric we call org.apache.kafka.KafkaStreams->metrics
public Map<MetricName, ? extends Metric> metrics() which returns around 3k metrics and then we use just one and ignore the rest.
Seems that this toMetricValue is called for every single kafka metric through DefaultGauge->value() value.applyAsDouble(obj)

it was introduced by #2770

Here are the cpu profiler results
image (9)

@JorgenRingen
Copy link
Author

Yes, as @ghmulti points out, it seems to be this change: f78fcb0#diff-7056fc4a7f15d24878096caa96ae9852ac8c9b2afcf11609faead6b9123efbb0R197

Instead of retrieving the value of the metric directly, the metric-supplier is passed and by calling the metric-supplier all kafka-streams metrics are retrieved for each metric.

This means that this call is run for all kafka-streams metrics (4-5000): org.apache.kafka.streams.KafkaStreams#metrics

@jonatan-ivanov
Copy link
Member

First of all thank you so much all of you to look into these and provide data points. As a maintainer, it's so good to see this.

The diff between total number of metrics and non-kafka metrics is always 140.

I don't know what causes this and the slow increase in number of metrics before it stabilizes. Might be some scheduled thing in kafka-streams that generates a lot of metrics.

Kafka creates Metric objects dynamically, the 140 other one that you see are created by other no or less-dynamic parts of instrumentation. Kafka moves its Metric objects around (creates, deletes replaces them) that's why you can see different amount of meters.

The change you are referring to (#2770) is somewhat related to this. It was introduced because:

  • Micrometer sets-up Gauges and FunctionalCounters against Metric objects created by Kafka
  • When Kafka replaces these objects (I don't know why), Micrometer references the old instance while Kafka starts updating a new one
  • Micrometer often uses a WeakReference to refer to these "state" objects (Metric) which does not prevent GC to clean these objects up
  • When GC kicks in and the state object (Metric) is removed, Micrometer will report NaN for these Meters
  • See: Kafka Metrics become NaN after a while (possibly after rebalance) #2726

You might be able to see this behavior in 1.7.3 (seeing a bunch of NaN values), also, the number of metrics does not change in 1.7.3 because of this; Kafka changes them but micrometer does not "follow". From 1.7.4, this was fixed and Micrometer "follows" Kafka metrics.

Right now, based on this, I think there are two things that could affect the performance:

  1. The extra look-up in the Map that you also pointed out (I think this is not likely)
  2. In 1.7.3, if the Kafka Metric gets GC'ed micrometer will just return NaN immediately (this is very fast since it is just a null check).
    In 1.7.4, Micrometer will "follow" the update and will actually call metric.metricValue() which can be slower than a null-check + return NaN combo.

This might explain why I'm not seeing any difference: my theory is that 1.7.3 might have been always slow for me if Kafka did not replace its Metric objects and/or GC did not remove them so that my Meters were not shortcut to NaN (and be fast but wrong).

I need to verify these but does this make sense to you? Can you see a bunch of NaN values for 1.7.3 and significantly less NaNs in 1.7.4?

If my theory about this is right, since the "slow" behavior is due to a bug (NaN is a wrong value), I think the solution could be:

  1. You can use a MeterFilter and filter out what you don't need, on Micrometer side we need to document this and warn the users
  2. We can add some kind of pre-filtering Mechanism into KafkaMetrics (i.e.: having a simplified mode where you don't get all the metrics that Kafka creates) but for this, we need some feedback from users and from Kafka maintainers

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Oct 12, 2021

Updates:

  • I was able to repro (deleted part of my maven cache, now I can repro it every single time, it seems it was not the GC)
  • It seems the bottleneck is the Map lookup not the metricValue() call

If I do this in KafkaMetrics (hardcoding a value) instead of calling metricValue(), there is no significant difference:

    private double toDouble(@Nullable Metric metric) {
        return (metric != null) ? 42.0 : Double.NaN;
    }

If I do this (no Map lookup and no supplier call, as you pointed out), the CPU utilization is negligible:

    private ToDoubleFunction<Supplier<Map<MetricName, ? extends Metric>>> toMetricValue(MetricName metricName) {
        return metricsSupplier -> 42.0;
    }

Right now, I'm not sure how to fix this but I'm going to look into this deeper.

@jonatan-ivanov jonatan-ivanov added performance Issues related to general performance and removed waiting for feedback We need additional information before we can continue labels Oct 12, 2021
@jonatan-ivanov jonatan-ivanov added this to the 1.6.x milestone Oct 12, 2021
@shakuzen shakuzen added regression A regression from a previous release bug A general bug labels Oct 13, 2021
@JorgenRingen
Copy link
Author

Thanks for a detailed response @jonatan-ivanov! Glad you were able to reproduce, maven caches can be a hassle! 😆

We'll work around this by adding a MeterFilter for kafka-related metrics, which should be a good thing regardless of the CPU-issues. I've measured with ~200-300 metrics (curl localhost:8080/actuator/prometheus | grep kafka | wc -l) and CPU/response-time drops significantly. Draft for anyone interested: https://github.com/JorgenRingen/micrometer_1_7_4_high_cpu_usage/blob/filter/src/main/java/com/example/micrometer1_7_4kafkastreamscpuusage/KafkaStreamsConfig.java#L74

We will also try too look deeper into the root-cause and maybe come up with a suggestion if time permits.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Oct 13, 2021

I investigated this further and if I replace the Supplier to an AtomicReference and update the AtomicReference with the scheduler in KafkaMetrics (60s by default):

private ToDoubleFunction<AtomicReference<Map<MetricName, ? extends Metric>>> toMetricValue(MetricName metricName) {
    return metrics -> toDouble(metrics.get().get(metricName));
}

It seems it is fast again. The problem with this is the same, if you just updated the reference but Kafka will replace its Metric then you won't get updates (or you will get NaN if GC kicks in) for a minute.

(It seems that the supplier get call is a bottleneck in itself.)

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Oct 14, 2021

@JorgenRingen Here's a potential fix: jonatan-ivanov@63e489c

I need to go through this and think about where this can be broken, also figure out how can we test this.

@jonatan-ivanov jonatan-ivanov modified the milestones: 1.6.x, 1.6.12 Oct 14, 2021
jonatan-ivanov added a commit to jonatan-ivanov/micrometer that referenced this issue Oct 14, 2021
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Oct 14, 2021

@JorgenRingen I merged the fix in, should go out in the next patch releases (today).

@jonatan-ivanov jonatan-ivanov changed the title High CPU usage while using KafkaStreamsMetrics in micrometer 1.7.4 High CPU usage while using KafkaStreamsMetrics Oct 14, 2021
@jonatan-ivanov
Copy link
Member

Released, 1.7.5 should fix this, please give it a try.

@JorgenRingen
Copy link
Author

Great! 👍

LGTM, 1.7.4 vs 1.7.5 without meterfilter:
image

@jonatan-ivanov
Copy link
Member

Great to hear!
Thanks everyone for reporting and investigating this!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug A general bug module: micrometer-core An issue that is related to our core module performance Issues related to general performance regression A regression from a previous release
Projects
None yet
Development

No branches or pull requests

5 participants