-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Interaction between Autoscaler and Istio Proxy triggering memory leak #8761
Comments
Does this istio/istio#25145 — sound related? |
So basically you're running istio sidecar and istio sidecar is not releasing memory? |
istio/istio#25145 is interesting as this might share a similar root cause, but the notable difference being 25145 is about the ingressgateway, and here the issues is the sidecar of the autoscaler job. Unfortunately I don't know enough about the behavior of the autoscaler to know what a common root cause might look like. |
Well, it's probably a shared istio code. In any case Knative itself does nothing with the istio sidecar. So I think the problem is in istio, unless they can show that it's Knative that causes their memory to balloon. /cc @tcnghia @ZhiminXiang |
Here is the current hypothesis: Knative, by way of trying to scrape metrics from individual pods, is inducing a large number of metrics (think prometheus) to be generated in Envoy which are never garbage collected. Reasoning: While the pod scrape logic attempts to remember if scraping pods is possible, the Where that is trying to match log lines like the following: {
"caller": "metrics/stats_scraper.go:267",
"commit": "d74ecbe",
"level": "info",
"logger": "autoscaler",
"msg": "Pod 10.11.15.78 failed scraping: GET request for URL \"http://10.11.15.78:9090/metrics\" returned HTTP status 502",
"ts": "2020-07-24T00:01:53.379Z"
} The two graphs are of similar shape. Taking the IP from the sample log line and poking in the /stats endpoint of the Envoy proxy, there are the following lines:
Sampling a few of the recent log lines from above reveal they are all showing up in the /stats output. Sampling a few log lines from hours ago, shows the IPs from those failed pod scraping events to still be in the /stats endpoint even though it has been ~3 hours since the attempt to scrape the pod. What might be contributing to this is that the Istio cluster is setup with An easy fix could be to persist |
podsAddressable is cached for a given revision. |
If you mean Knative Revisions, there haven't been explicit changes to those - we have only made use of the Knative Services, and they haven't been changed/deployed since July 1 - what could be triggering changes to revisions? |
So when you send varying traffic this means autoscaler scales the pods up and down. |
…oscaler. The Knative autoscaler attempts to rech Pods directly on occasion and when it does, the request is rejected by the REGISTRY_ONLY setting. This is a simplified use case for knative/serving#8761 to present to the Istio folks.
This issue is stale because it has been open for 90 days with no |
I'm just starting to dig into this, but I'm filing the bug to raise visibility and hopefully be able to get some help debugging.
What version of Knative?
Both:
but not 0.14.0
Expected Behavior
The memory usage of the autoscaler's Istio proxy stays relatively flat, or at least proportional to some reasonable metric (number of pods, number of ksvcs, ...).
Actual Behavior
Over time the memory usage of the istio proxy container seems to grow over time.
Steps to Reproduce the Problem
On July 8th, I updated our cluster from 0.14.0 to 0.15.2. A about a week later on July 14th, traffic going to Knative services ended up grinding to a halt. In a quick effort to get things running again, I restarted a number of pods, which quickly got things back working. I didn't spend much time looking through logs and metrics as I knew at the time there was a 0.16.0 release available. On July 15th, I updated the cluster to 0.16.0. Just a few hours ago (July 22nd), traffic seized up again.
Looking at the Knative pods I saw:
I noticed that the autoscaler only had one of the two containers running. To quickly fix the problem, the autoscaler pod was deleted, and once the new autoscaler pod was started, traffic began to flow again. I only noticed this now, but the age of the
activator-6768988647-npxzd
roughly aligns with the start of the outage. This cluster is serving low priority traffic (batch pipelines and whatnot), so the top line monitoring is quite delayed.The
autoscaler-57cb4c8475-vvtn7
pod was suffering the following events:Graphs of memory and CPU usage over the last month which show the change in behavior and ramping memory usage in the Istio proxy container:

A few notes about this cluster which might be contributing:
activator-6768988647-npxzd
has a number of messages likeWebsocket connection could not be established
as well as a bunch of Envoy request logs having theresponse_code: UH
, which meansNo healthy upstream hosts in upstream cluster in addition to 503 response code
I believe.I'm going to let the current autoscaler run a little and then try to see what metrics/stats/(core?) I can grab off it and try to figure out what might be contributing to the problem.
I am fully aware that the problem is manifesting in the Istio container, and there is a high likelihood that the bug is over there. Any help I can get collecting application specific data, or diagnostics to pin down which system has the problem would be greatly appreciated. I started on this side as the problem didn't manifest with 0.14.0+Istio 1.6.4, and only started once upgrading Knative to 0.15.2 (and later 0.16.0). My guess is that there is something that changed on the Knative side which is having a bad interaction with Istio - just need help finding it.
Feel free to ping me on the Knative slack (nairb774) if some live debugging would help.
The text was updated successfully, but these errors were encountered: