Expose UDP message on /metrics endpoint #2608

fmejia97 · 2018-06-05T18:45:27Z

What this PR does / why we need it:

This PR:

Adds a UDP collector that listens to UDP messages from monitor.lua and exposes them on /metrics endpoint using prometheus client.
Removes VTS collector.

Why do we need this PR?

With the dynamic mode introduced a couple of releases ago, the load balancing is handled in LUA. This means there is no upstream section in the generated nginx.conf, there is only one used to switch to LUA. In this scenario, the VTS module is useless because the stats only contains information about this upstream without any context about the traffic. Additionally, by using the VTS module it is not possible to add custom variables to the stats, such as information about the namespace, ingress rule, and service (something available as NGINX variable).

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

replaces #2607

fixes #2355
fixes #2128
fixes #557

k8s-ci-robot · 2018-06-05T18:45:30Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please # with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

aledbf · 2018-06-05T18:51:58Z

You have some lint issues (maybe from one of my commits :P )

./hack/verify-golint.sh
!!! 'golint' problems: 
/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:53:6: exported type StatsCollector should have comment or be unexported
/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:64:1: exported function NewInstance should have comment or be unexported
/home/aledbf/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:179:1: exported method StatsCollector.Run should have comment or be unexported

gianrubio · 2018-06-05T18:53:55Z

Does it make sense to fully remove the current vts library or should we have metrics/v1 => vts and metrics/v2 => lua exporter.

Can I have a docker image of this changes?

aledbf · 2018-06-05T18:56:54Z

@gianrubio I was building that 😉 quay.io/aledbf/nginx-ingress-controller:0.371

aledbf · 2018-06-05T18:57:24Z

Does it make sense to fully remove the current vts library or should we have metrics/v1 => vts and metrics/v2 => lua exporter.

Once this PR lands we are removing the VTS module from the nginx image

aledbf · 2018-06-05T19:02:51Z

@fmejia97

I0605 19:00:21.763053       6 stats.go:126] msg: {"requestLength":"1343","namespace":"gitlab","host":"git.192.168.1.29.xip.io","upstreamResponseTime":"-","remoteAddr":"192.168.1.2","ingress":"gitlab-dev","status":"499","protocol":"HTTP\/1.1","upstreamStatus":"-","path":"\/docker-images\/base\/branches","method":"GET","service":"gitlab","requestDuration":"0.522","bytesSent":"0","upstreamIP":"10.2.1.22:80"}
panic: json: cannot unmarshal number - into Go struct field udpData.upstreamResponseTime of type float64

goroutine 66 [running]:
k8s.io/ingress-nginx/internal/ingress/metric/collector.(*StatsCollector).handleMessage(0xc4204ca5f0, 0xc420626000, 0x167, 0x10400)
	/home/aledbf/go/src/k8s.io/ingress-nginx/internal/ingress/metric/collector/stats.go:132 +0xbc7
k8s.io/ingress-nginx/internal/ingress/metric/collector.(*StatsCollector).(k8s.io/ingress-nginx/internal/ingress/metric/collector.handleMessage)-fm(0xc420626000, 0x167, 0x10400)

fmejia97 · 2018-06-05T19:03:53Z

🤔 Looking right now

aledbf · 2018-06-05T19:07:11Z

Example of the metrics https://gist.github.com/aledbf/c68ad149c1292da54d6bd3162bf05740

aledbf · 2018-06-05T19:09:31Z

"upstreamResponseTime":"-"

This could be an issue. In some scenarios, nginx returns - in some fields (usually when status==499)

fmejia97 · 2018-06-05T19:10:40Z

Yes, that is what's crashing Unmarshal. I'm working on a fix and will update the PR soon.

ElvinEfendi · 2018-06-05T19:39:48Z

I've not looked deep into the PR yet, but can you make another PR for Lua middleware and deletion of VTS? Will make it easier to review.

I'd suggest we first get the Lua middleware merged and then merge this PR. And finally remove VTS and edit nginx.tmpl to switch to the new metric collection.

aledbf · 2018-06-05T19:42:03Z

I'd suggest we first get the Lua middleware merged and then merge this PR.

Both things are in this PR

And finally remove VTS and edit nginx.tmpl to switch to the new metric collection.

That's the plan

aledbf · 2018-06-05T19:42:43Z

@ElvinEfendi by default the vts module is disabled so the prometheus metrics only contain the default status information from nginx

ElvinEfendi · 2018-06-05T19:44:10Z

Both things are in this PR

I know but for example Lua middleware lacks unit tests, and having self contained separate PRs if possible makes it easier to review/revert.

aledbf · 2018-06-05T19:44:29Z

Will make it easier to review.

Not sure if splitting this PR makes sense (in this case). It will be harder to review from my point of view

SuperQ · 2018-06-06T08:50:11Z

internal/ingress/metric/collector/stats.go

+			Name:      "bytes_sent",
+			Help:      "The the number of bytes sent to a client",
+			Namespace: ns,
+			Buckets:   prometheus.LinearBuckets(50, 25, 10), // 10 buckets, each 50 bytes wide.


I would probably use exponential buckets for bytes sent, from 100 bytes to maybe 10 MiB.

SuperQ · 2018-06-06T08:59:34Z

Is it possible to use ngx.socket.stream with a unix domain socket rather than UDP? This would make more sense for something communicating locally. It would avoid the whole network stack, even on localhost large amounts of UDP can be problematic.

aledbf · 2018-06-06T12:16:22Z

@SuperQ yes but not sure if that change will be done in this PR

aledbf · 2018-06-06T13:11:34Z

@fmejia97 maybe we could add this additional metric #2121 now? (and close that PR)

fmejia97 · 2018-06-06T14:56:48Z

@aledbf Looks like the PR wants to add a metric for average request time for individual services (upstreams). I'm currently exposing upstream_response_time_seconds_sum{...} which contains services labels and looking at Prometheus conventions, aggregation (avg) is usually carried out at the prometheus server side. We could obtain the additional metric using the query: avg_over_time(range-vector) https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation-_over_time. This query allows us to specify how back in time we want to consider values for the aggregation. What are your thoughts?

aledbf · 2018-06-06T15:02:37Z

internal/ingress/metric/collector/stats.go

+}
+
+func (sc *StatsCollector) handleMessage(msg []byte) {
+	glog.Infof("msg: %v", string(msg))


Please change this to glog.V(5).Infof("msg: %v", string(msg))

aledbf · 2018-06-12T20:11:52Z

internal/ingress/metric/collector/nginx_status_collector.go

+	nginxStatusCollector struct {
+		scrapeChan     chan scrapeRequest
+		ngxHealthPort  int
+		ngxVtsPath     string


@fmejia97 please remove this field. We don't need this now.

aledbf · 2018-06-12T20:26:51Z

@fmejia97 I forgot to mention that we don't need nginx_status_collector.go. Please remove that file too.
After this PR, there is only one way to get metrics.

fmejia97 · 2018-06-13T04:41:56Z

@aledbf After talking with @ElvinEfendi and @andrewlouis93 , we decided that keeping nginx_status_collector.go is a good idea since it gives us connections and request metrics. Ideally we don't want to emit these metrics through the monitor.lua module since this module only emits data regarding each individual request. Connections metrics are independent of individual requests. As it is right now, the nginx_status collector writes metrics using the prometheus client, therefore both monitor.lua and nginx_status metrics will be visible on the same endpoint: /metrics. If we remove it, we won't have connection metrics. What do you think?

aledbf · 2018-06-13T05:18:42Z

we decided that keeping nginx_status_collector.go is a good idea since it gives us connections and request metrics.

Until now nginx_status_collector.go was used only if the VTS module was disabled. I would prefer to get this info only from one place if possible. Maybe we can leave this until we merge this PR and then see exactly what can be removed (or not)

gianrubio · 2018-06-13T11:22:14Z

I'd like to understand what are the main differences between the prometheus Lua collector and nginx vts?

I noticed that nginx vts added the feature to expose metrics in prometheus format, can we discuss pro and con between lua x vts?

aledbf · 2018-06-13T12:44:17Z

I noticed that nginx vts added the feature to expose metrics in prometheus format, can we discuss pro and con between lua x vts?

With the dynamic mode introduced a couple of releases ago, the load balancing stuff is handled in LUA. This means there is no upstream section in the generated nginx.conf, there is only one used to switch to LUA. In this scenario, the VTS module is useless because the stats only contains information about this upstream without any context about the traffic.
Additionally to this, using the VTS module is not possible to add custom variables to the stats, like information about the namespace, ingress rule, and service (something available as NGINX variable).

andrewloux · 2018-06-13T19:37:12Z

@fmejia97 Is not clear to everyone why we are switching away from VTS - could we add more context to this PR description before this ships?

codecov-io · 2018-06-13T19:53:06Z

Codecov Report

Merging #2608 into master will decrease coverage by 0.03%.
The diff coverage is 3.67%.

@@            Coverage Diff            @@
##           master   #2608      +/-   ##
=========================================
- Coverage   40.83%   40.8%   -0.04%     
=========================================
  Files          75      74       -1     
  Lines        5123    5083      -40     
=========================================
- Hits         2092    2074      -18     
+ Misses       2750    2726      -24     
- Partials      281     283       +2

Impacted Files	Coverage Δ
internal/ingress/controller/nginx.go	`11.51% <0%> (+0.24%)`	⬆️
internal/ingress/metric/collector/udp_collector.go	`0% <0%> (ø)`
internal/ingress/controller/util.go	`27.77% <0%> (-13.89%)`	⬇️
...rnal/ingress/metric/collector/process_collector.go	`0% <0%> (ø)`
...ingress/metric/collector/nginx_status_collector.go	`0% <0%> (ø)`
cmd/nginx/main.go	`22.62% <0%> (-6.38%)`	⬇️
internal/ingress/metric/collector/listener.go	`69.23% <69.23%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 764bcd5...2cd2da7. Read the comment docs.

aledbf · 2018-06-13T20:08:14Z

@fmejia97 please rollback the last commit. The namespace cannot be a constant. Use a helper to fix the name (like ingress-nginx to ingress_nginx)

…d exposes them on /metrics endpoint

fmejia97 · 2018-06-14T01:58:00Z

@aledbf I used a constant since I saw that's how it was used before. Added a new commit where I remove the constant and instead I fix the namespace name 👍

fmejia97 · 2018-06-14T03:27:40Z

@aledbf Rebased + squashed 👍

andrewloux · 2018-06-14T12:52:36Z

@aledbf I'm happy with this as is right now. Will do another PR where we'll try an reduce complexity once this one is shipped 🙏 Great work on this @fmejia97 ❤️

aledbf · 2018-06-14T13:12:32Z

Let's merge this, it's already too big.

aledbf · 2018-06-14T13:12:37Z

/lgtm
/approve

k8s-ci-robot · 2018-06-14T13:12:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, fmejia97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aledbf · 2018-06-14T13:13:22Z

@fmejia97 @andrewlouis93 I will work in the use of unix sockets instead of UDP to send the metrics

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2018

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2018

aledbf requested a review from gianrubio June 5, 2018 19:54

aledbf assigned aledbf, ElvinEfendi and antoineco Jun 5, 2018

SuperQ reviewed Jun 6, 2018

View reviewed changes

aledbf mentioned this pull request Jun 6, 2018

VTS Upstream groups not working with --enable-dynamic-configuration #2355

Closed

aledbf reviewed Jun 6, 2018

View reviewed changes

aledbf reviewed Jun 12, 2018

View reviewed changes

fmejia97 force-pushed the expose-udp-metrics-updated branch from c145559 to b04a052 Compare June 12, 2018 20:42

fmejia97 force-pushed the expose-udp-metrics-updated branch from b04a052 to 9fb664b Compare June 13, 2018 04:50

fmejia97 force-pushed the expose-udp-metrics-updated branch 3 times, most recently from 8f33578 to f508189 Compare June 14, 2018 01:00

Create UDP collector that listens to UDP messages from monitor.lua an…

2cd2da7

…d exposes them on /metrics endpoint

fmejia97 force-pushed the expose-udp-metrics-updated branch from f508189 to 2cd2da7 Compare June 14, 2018 01:32

aledbf mentioned this pull request Jun 14, 2018

Update nginx to 1.15.0 and remove VTS module #2618

Merged

andrewloux approved these changes Jun 14, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2018

k8s-ci-robot merged commit c9a0c90 into kubernetes:master Jun 14, 2018

This was referenced Jun 14, 2018

Remove VTS from the ingress controller #2643

Merged

Collect metrics of QPS per backend #2116

Closed

Expose UDP message on /metrics endpoint #2608

Expose UDP message on /metrics endpoint #2608

Conversation

fmejia97 commented Jun 5, 2018 • edited Loading

k8s-ci-robot commented Jun 5, 2018

aledbf commented Jun 5, 2018

gianrubio commented Jun 5, 2018

aledbf commented Jun 5, 2018

aledbf commented Jun 5, 2018

aledbf commented Jun 5, 2018

fmejia97 commented Jun 5, 2018

aledbf commented Jun 5, 2018

aledbf commented Jun 5, 2018

fmejia97 commented Jun 5, 2018

ElvinEfendi commented Jun 5, 2018

aledbf commented Jun 5, 2018

aledbf commented Jun 5, 2018

ElvinEfendi commented Jun 5, 2018 • edited Loading

aledbf commented Jun 5, 2018

SuperQ Jun 6, 2018

Choose a reason for hiding this comment

SuperQ commented Jun 6, 2018

aledbf commented Jun 6, 2018

aledbf commented Jun 6, 2018

fmejia97 commented Jun 6, 2018 • edited Loading

aledbf Jun 6, 2018

Choose a reason for hiding this comment

aledbf Jun 12, 2018

Choose a reason for hiding this comment

aledbf commented Jun 12, 2018

fmejia97 commented Jun 13, 2018 • edited Loading

aledbf commented Jun 13, 2018

gianrubio commented Jun 13, 2018

aledbf commented Jun 13, 2018

andrewloux commented Jun 13, 2018 • edited Loading

codecov-io commented Jun 13, 2018 • edited Loading

Codecov Report

aledbf commented Jun 13, 2018

fmejia97 commented Jun 14, 2018

fmejia97 commented Jun 14, 2018

andrewloux commented Jun 14, 2018

aledbf commented Jun 14, 2018

aledbf commented Jun 14, 2018

k8s-ci-robot commented Jun 14, 2018

aledbf commented Jun 14, 2018

fmejia97 commented Jun 5, 2018 •

edited

Loading

ElvinEfendi commented Jun 5, 2018 •

edited

Loading

fmejia97 commented Jun 6, 2018 •

edited

Loading

fmejia97 commented Jun 13, 2018 •

edited

Loading

andrewloux commented Jun 13, 2018 •

edited

Loading

codecov-io commented Jun 13, 2018 •

edited

Loading