Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Interaction between Autoscaler and Istio Proxy triggering memory leak #8761

Closed
nairb774 opened this issue Jul 22, 2020 · 9 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@nairb774
Copy link
Contributor

I'm just starting to dig into this, but I'm filing the bug to raise visibility and hopefully be able to get some help debugging.

What version of Knative?

Both:

0.15.2
0.16.0

but not 0.14.0

Expected Behavior

The memory usage of the autoscaler's Istio proxy stays relatively flat, or at least proportional to some reasonable metric (number of pods, number of ksvcs, ...).

Actual Behavior

Over time the memory usage of the istio proxy container seems to grow over time.

Steps to Reproduce the Problem

On July 8th, I updated our cluster from 0.14.0 to 0.15.2. A about a week later on July 14th, traffic going to Knative services ended up grinding to a halt. In a quick effort to get things running again, I restarted a number of pods, which quickly got things back working. I didn't spend much time looking through logs and metrics as I knew at the time there was a 0.16.0 release available. On July 15th, I updated the cluster to 0.16.0. Just a few hours ago (July 22nd), traffic seized up again.

Looking at the Knative pods I saw:

NAMESPACE                 NAME                                                             READY   STATUS             RESTARTS   AGE
knative-serving           activator-6768988647-npxzd                                       1/2     CrashLoopBackOff   59         4h24m
knative-serving           autoscaler-57cb4c8475-vvtn7                                      1/2     Running            0          5d23h
knative-serving           controller-5448b975d-k7z7q                                       2/2     Running            0          5d12h
knative-serving           istio-webhook-594cc55456-25sfj                                   2/2     Running            0          7d2h
knative-serving           networking-istio-7b9854496d-h6r56                                1/1     Running            0          7d2h
knative-serving           webhook-756b79fb8d-b2q2j                                         2/2     Running            0          7d2h

I noticed that the autoscaler only had one of the two containers running. To quickly fix the problem, the autoscaler pod was deleted, and once the new autoscaler pod was started, traffic began to flow again. I only noticed this now, but the age of the activator-6768988647-npxzd roughly aligns with the start of the outage. This cluster is serving low priority traffic (batch pipelines and whatnot), so the top line monitoring is quite delayed.

The autoscaler-57cb4c8475-vvtn7 pod was suffering the following events:

LAST SEEN   TYPE      REASON      OBJECT                            MESSAGE
20m         Warning   Unhealthy   pod/autoscaler-57cb4c8475-vvtn7   Readiness probe failed: Get http://10.11.9.139:15021/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Graphs of memory and CPU usage over the last month which show the change in behavior and ramping memory usage in the Istio proxy container:
image

A few notes about this cluster which might be contributing:

  • Traffic coming into the cluster is unreasonably spiky, which results in lots of scale up/scale down behaviors. 1->100->1 in small numbers of minutes, over and over.
  • Overall traffic isn't really high. On the order of 1kqps, but the concurrency of several ksvcs is <10 which combined with traffic changes results in lots of pod churn.
  • The cluster is running on EKS 1.16, Istio 1.6.4 for most of this.
  • Early logs of the activator-6768988647-npxzd has a number of messages like Websocket connection could not be established as well as a bunch of Envoy request logs having the response_code: UH, which means No healthy upstream hosts in upstream cluster in addition to 503 response code I believe.

I'm going to let the current autoscaler run a little and then try to see what metrics/stats/(core?) I can grab off it and try to figure out what might be contributing to the problem.

I am fully aware that the problem is manifesting in the Istio container, and there is a high likelihood that the bug is over there. Any help I can get collecting application specific data, or diagnostics to pin down which system has the problem would be greatly appreciated. I started on this side as the problem didn't manifest with 0.14.0+Istio 1.6.4, and only started once upgrading Knative to 0.15.2 (and later 0.16.0). My guess is that there is something that changed on the Knative side which is having a bad interaction with Istio - just need help finding it.

Feel free to ping me on the Knative slack (nairb774) if some live debugging would help.

@nairb774 nairb774 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2020
@vagababov
Copy link
Contributor

Does this istio/istio#25145 — sound related?

@vagababov
Copy link
Contributor

So basically you're running istio sidecar and istio sidecar is not releasing memory?
IIUC, then it's not clear to me what Knative can do here though?

@nairb774
Copy link
Contributor Author

I started on this side as the problem didn't manifest with 0.14.0+Istio 1.6.4, and only started once upgrading Knative to 0.15.2 (and later 0.16.0).

istio/istio#25145 is interesting as this might share a similar root cause, but the notable difference being 25145 is about the ingressgateway, and here the issues is the sidecar of the autoscaler job. Unfortunately I don't know enough about the behavior of the autoscaler to know what a common root cause might look like.

@vagababov
Copy link
Contributor

Well, it's probably a shared istio code. In any case Knative itself does nothing with the istio sidecar. So I think the problem is in istio, unless they can show that it's Knative that causes their memory to balloon.

/cc @tcnghia @ZhiminXiang
for more Istio insight.

@nairb774
Copy link
Contributor Author

Here is the current hypothesis:

Knative, by way of trying to scrape metrics from individual pods, is inducing a large number of metrics (think prometheus) to be generated in Envoy which are never garbage collected.

Reasoning:

While the pod scrape logic attempts to remember if scraping pods is possible, the CreateOrUpdate method of the MetricCollector causes the replacement of the stats scraper. This seems to happen approximately at the same rate memory is consumed by the Envoy proxy:

image

Where that is trying to match log lines like the following:

{
  "caller": "metrics/stats_scraper.go:267",
  "commit": "d74ecbe",
  "level": "info",
  "logger": "autoscaler",
  "msg": "Pod 10.11.15.78 failed scraping: GET request for URL \"http://10.11.15.78:9090/metrics\" returned HTTP status 502",
  "ts": "2020-07-24T00:01:53.379Z"
}

The two graphs are of similar shape. Taking the IP from the sample log line and poking in the /stats endpoint of the Envoy proxy, there are the following lines:

reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_requests_total: 2
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_bytes: P0(nan,1200.0) P25(nan,1225.0) P50(nan,1250.0) P75(nan,1275.0) P90(nan,1290.0) P95(nan,1295.0) P99(nan,1299.0) P99.5(nan,1299.5) P99.9(nan,1299.9) P100(nan,1300.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_request_duration_milliseconds: P0(nan,0.0) P25(nan,0.0) P50(nan,0.0) P75(nan,0.0) P90(nan,0.0) P95(nan,0.0) P99(nan,0.0) P99.5(nan,0.0) P99.9(nan,0.0) P100(nan,0.0)
reporter=.=source;.;source_workload=.=autoscaler;.;source_workload_namespace=.=knative-serving;.;source_principal=.=unknown;.;source_app=.=autoscaler;.;source_version=.=unknown;.;source_canonical_service=.=autoscaler;.;source_canonical_revision=.=latest;.;destination_workload=.=unknown;.;destination_workload_namespace=.=unknown;.;destination_principal=.=unknown;.;destination_app=.=unknown;.;destination_version=.=unknown;.;destination_service=.=10.11.15.78:9090;.;destination_service_name=.=BlackHoleCluster;.;destination_service_namespace=.=unknown;.;destination_canonical_service=.=unknown;.;destination_canonical_revision=.=latest;.;request_protocol=.=http;.;response_code=.=502;.;grpc_response_status=.=;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_response_bytes: P0(nan,54.0) P25(nan,54.25) P50(nan,54.5) P75(nan,54.75) P90(nan,54.9) P95(nan,54.95) P99(nan,54.99) P99.5(nan,54.995) P99.9(nan,54.999) P100(nan,55.0)

Sampling a few of the recent log lines from above reveal they are all showing up in the /stats output. Sampling a few log lines from hours ago, shows the IPs from those failed pod scraping events to still be in the /stats endpoint even though it has been ~3 hours since the attempt to scrape the pod.

What might be contributing to this is that the Istio cluster is setup with outboundTrafficPolicy set to REGISTRY_ONLY which might be contributing to the stats lines.

An easy fix could be to persist podsAddressable in some way across calls to updateScraper. I'm sure there is some nuance there that I don't yet understand, but it would be really helpful as I don't anticipate getting Envoy to drop old metrics is going to happen quickly (still looking to see if an issue is even filed yet).

@vagababov
Copy link
Contributor

podsAddressable is cached for a given revision.
In theory we can try to cache it across revisions, but this (don't remember why) was not considered safe.

@nairb774
Copy link
Contributor Author

If you mean Knative Revisions, there haven't been explicit changes to those - we have only made use of the Knative Services, and they haven't been changed/deployed since July 1 - what could be triggering changes to revisions?

@vagababov
Copy link
Contributor

So when you send varying traffic this means autoscaler scales the pods up and down.
As a consequence it means that the endpoints list in your sidecar, that watches those pods change.
My guess is that istio has a bug where it does not purge them from caches. Hence my suggestion is that you first ask there to see if it's a known issue (e.g. like the one I posted above).
In other words, this is an istio issue unless proven otherwise (it's istio container that has memory leak right?)

nairb774 added a commit to nairb774/experimental that referenced this issue Jul 30, 2020
…oscaler.

The Knative autoscaler attempts to rech Pods directly on occasion and
when it does, the request is rejected by the REGISTRY_ONLY setting. This
is a simplified use case for
knative/serving#8761 to present to the Istio
folks.
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants