-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
mtr-exporter leaking defunct mtr and mtr-packet processes #24
Comments
@gberche-orange the failsafe mechanism is So, by plan, You might want to play https://github.com/krallin/tini?tab=readme-ov-file#using-tini … it is built into docker already, maybe worth a try. |
I can see the potential need to use https://pkg.go.dev/os/exec#CommandContext with killing the "previous" mtr-process automatically "before" the new one is about to be launched. something like if "schedule" is "@every 60s", the previous mtr-run should be forced to be timed out around 59s or alike. Wont help with the zombies though I assume. |
thanks a lot @mgumz for your prompt and detailed answer, and sorry I had missed #17
Indeed I had modified the whole entry points instead of just adding the args to it (in k8s using the
Thanks for the hint related to tini that I'm currently testing. Reading tini's documentation and in particular krallin/tini#8 (comment)
I'm not sure why this would be the case with mtr-exporter as Lines 57 to 58 in fd2834d
https://pkg.go.dev/os/exec#Cmd.Run
https://cs.opensource.google/go/go/+/master:src/os/exec/exec.go;l=622-627 func (c *Cmd) Run() error {
if err := c.Start(); err != nil {
return err
}
return c.Wait()
} https://pkg.go.dev/os/exec#Cmd.Wait
I'll report though if the tiny workaround isn't sufficient.
Rather than killing the previous mtr process, could mtr-exporter wait for it to complete ? This approach expect that the mtr arguments (timeouts remated) are properly configured to prevent an infinite hang. This approach would preserve the mtr data result parsing and would handle cases where the default 60s schedule is unexpectedly shorter than the mtr response time (cases like network outages). |
@gberche-orange: via #21, @clavinjune is kind of also working on k8s deployment (which end up as helm chart). In that, he will also tackle the issue of using either
Yes, doable. The interesting piece then becomes the design of the |
yes, in our environment (production) we're also utilizing tini in order to eliminate zombie process and it is working well |
@gberche-orange all settled? |
Thanks @mgumz for your support on this issue. The use of tini as command indeed solved the symptom in my environment.
What problem would this documentation Can you detail what's the current behavior of the exporter in face of parallel overlapping mtr executions N-1 and N , and responses to the |
parallel execution: an external process is started and runs until it finishes. once that is done, the metrics for that run are fed into an internal datastructure and just presented as is in the so, in theory: run-256 is launched, a network-situation occurs which delays / slows down this run-256. meanwhile, run-257 is triggered, finishes, the collector gets "recent updates", then run-256 finishes and "overwrites" the data of run-257. also: if you scrape the prometheus endpoint every 30 minutes … and launch mtr every 60s … then you would see only the last of the 30 runs. the feature-request for |
Relates to mgumz#24
Thanks @mgumz for your feedback ! I submitted #25 to help new users from hitting this issue until #21 or similar helm chart is automating this configuration for k8s users. I believe the feature of a schedule
This might deserve a distinct issue though |
Thanks again for sharing this work with the community. I'm observing that the mtr-exporter leaves many defunct zombies processes open. This reproduces with
ghcr.io/mgumz/mtr-exporter:0.4.0
and the command/usr/bin/mtr-exporter -bind :8089 -- iperf-public-listener-r4-z1-1.mydomain.org 4 --tcp -P 443 --timeout 2 --gracetime 2 --no-dns
One one system, I'm observing 120k+ processes
https://en.m.wikipedia.org/wiki/Zombie_process defines zombies/defunct as
This eventually leads to resource exhaustion on the host with messages such as
info: "mtr-exporter-cli" failed: fork/exec /usr/sbin/mtr: resource temporarily unavailable
.I'm suspecting the root cause in my case is the default schedule (60s) is too fast and mtr commands are sometimes created at an higher rate than they terminate, possibly hanging at times
mtr-exporter/README.md
Lines 67 to 68 in 6e852a7
The options given to the mtr
--timeout 2 --gracetime 2 --no-dns
might trigger the mtr process to hang (race condition between timeout and gracetime)Would it make sense to include a fail-safe mechanism that prevents running new mtr processes beyond a given nb of non completed commands ?
The text was updated successfully, but these errors were encountered: