Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Infinite recursion on list_windows.go #23984

Open
albertofem-scopely opened this issue Sep 17, 2024 · 1 comment
Open

Infinite recursion on list_windows.go #23984

albertofem-scopely opened this issue Sep 17, 2024 · 1 comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/platform-windows type/bug

Comments

@albertofem-scopely
Copy link

Nomad version

Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e

Operating system and Environment details

OS Name: Microsoft Windows Server 2022 Datacenter
OS Version: 10.0.20348 N/A Build 20348

Issue

It seems that we are hitting an infinite recursion issue running Nomad on Windows (as client). We have a large cluster of just long-running services usinf raw_exec. When the services goes up, all seems to be good and they can be stable for quite a few hours.

However, sometimes after a few hours, we start receiving a lot of these errors in the respective allocations:

Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:61056->127.0.0.1:14000: wsarecv: An existing connection was forcibly closed by the remote host."

I pulled the logs from a particular client in which one of these allocations failed and I found this go stack trace:

2024-09-17T22:07:20.135Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime: goroutine stack exceeds 1000000000-byte limit: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.135Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime: sp=0xc020d85380 stack=[0xc020d84000, 0xc040d84000]: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.135Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: fatal error: stack overflow: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime stack:: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime.throw({0x3863991?, 0x11c65ff701?}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         runtime/panic.go:1023 +0x65 fp=0x11c65ff770 sp=0x11c65ff740 pc=0xc20845: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime.newstack(): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         runtime/stack.go:1103 +0x5cc fp=0x11c65ff920 sp=0x11c65ff770 pc=0xc3a64c: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime.morestack(): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         runtime/asm_amd64.s:616 +0x79 fp=0x11c65ff928 sp=0x11c65ff920 pc=0xc588b9: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: goroutine 98 gp=0xc000b8dc00 m=10 mp=0xc000680008 [running]:: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: runtime.mapaccess1_fast64(0x31c6f60, 0xc040d83c60, 0x23dc): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         runtime/map_fast64.go:13 +0x173 fp=0xc020d85390 sp=0xc020d85388 pc=0xbf6ef3: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:24 +0xa5 fp=0xc020d853e0 sp=0xc020d85390 pc=0x21af7a5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85430 sp=0xc020d853e0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85480 sp=0xc020d85430 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d854d0 sp=0xc020d85480 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85520 sp=0xc020d854d0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85570 sp=0xc020d85520 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d855c0 sp=0xc020d85570 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85610 sp=0xc020d855c0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85660 sp=0xc020d85610 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.138Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d856b0 sp=0xc020d85660 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85700 sp=0xc020d856b0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85750 sp=0xc020d85700 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d857a0 sp=0xc020d85750 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d857f0 sp=0xc020d857a0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85840 sp=0xc020d857f0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85890 sp=0xc020d85840 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d858e0 sp=0xc020d85890 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85930 sp=0xc020d858e0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85980 sp=0xc020d85930 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d859d0 sp=0xc020d85980 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85a20 sp=0xc020d859d0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85a70 sp=0xc020d85a20 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85ac0 sp=0xc020d85a70 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85b10 sp=0xc020d85ac0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85b60 sp=0xc020d85b10 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85bb0 sp=0xc020d85b60 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85c00 sp=0xc020d85bb0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85c50 sp=0xc020d85c00 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85ca0 sp=0xc020d85c50 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85cf0 sp=0xc020d85ca0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6e80}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85d40 sp=0xc020d85cf0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d7060}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85d90 sp=0xc020d85d40 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6ca0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85de0 sp=0xc020d85d90 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6cc0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85e30 sp=0xc020d85de0 pc=0x21af7c5: alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe: github.com/hashicorp/nomad/drivers/shared/executor/procstats.gather(0xc040d83c60, {0x41719b0, 0xc00061c018}, 0x2afc, {0x4142b20, 0xc0006d6da0}): alloc_id=c23109e4-beef-ce8d-a50f-dbf47c2782aa driver=raw_exec task_name=runner
2024-09-17T22:07:20.139Z [DEBUG] client.driver_mgr.raw_exec.executor.nomad.exe:         github.com/hashicorp/nomad/drivers/shared/executor/procstats/list_windows.go:25 +0xc5 fp=0xc020d85e80 sp=0xc020d85e30 pc=0x21af7c5: 

Our system is set to try restarting up to three times before reallocating. If it fails three times in a row, it triggers a new allocation. As you can see in the screenshot, even those new allocations can fail at first, but eventually, one of them sticks and things stabilize. The catch is, after a while, the service gets unstable again for the same reason, and the whole cycle starts over.

Screenshot 2024-09-17 at 3 53 10 PM

Reproduction steps

We don't have a synthetic project in which we can reliably reproduce this, as this is happening exclusively on our production workload.

Here are some guesses and some more information about what this process is actually doing.

The main process being executed in this Nomad allocation is a Node application that spawns a Github Actions runner, that is configured to only execute a particular kind of a job: a Unity build. Unity is a game engine that, when building, can take a lot of resources and put the machine under a significant amount of stress. Although we have beefy machines that are way beyond what we have observed this process to take, this could explain part of this behaviour. Moreover, the actual Unity process run on a Docker for Windows container.

Nomad runs as a Windows Service under the NT AUTHORITY\SYSTEM user in the Windows machine. This is an example of the process tree that our nomad client spawns for each allocation:

    [1656 ] [nomad.exe]
      [3544 ] [nomad.exe]
        [9024 ] [conhost.exe]
      [10152] [nomad.exe]
        [8260 ] [conhost.exe]
        [5448 ] [powershell.exe]
          [4512 ] [cmd.exe]
            [9272 ] [node.exe]
              [13904] [cmd.exe]
                [13668] [node.exe]
                  [13464] [cmd.exe]
                    [15220] [Runner.Listener.exe]
                      [12008] [Runner.Worker.exe]
                        [10828] [conhost.exe]
                        [7868 ] [powershell.exe]
                          [17072] [cmd.exe]
                            [13480] [node.exe]
                              [17076] [cmd.exe]
                                [8392 ] [node.exe]
                                  [5468 ] [node.exe]
                                    [15016] [docker.exe]

Moreover, it looks like when this happen, only the Nomad executor dies but the underlying process is left alive, which becomes problematic as these live outside our Nomad cluster and take resources in the machines.

Finally, we started to experience this issue after upgrading Nomad from version 1.7.7 to 1.8.3. We are considering downgrading because of this issue, but would be nice to understand if we can do anything to mitigate at all.

If it helps, these are normal EC2 machines in the AWS Cloud.

Expected Result

The Nomad allocations don't crash with a go panic and they run normally

Actual Result

The Nomad allocations crash with a go panic, and the underlying spawned process is left alive taking resources in the machine.

@jrasell
Copy link
Member

jrasell commented Sep 18, 2024

Hi @albertofem-scopely and thanks for raising this issue with the detail included. This looks like a problem with the stats gathering here and something we should look into fixing. I'll mark this for roadmapping and also raise it internally.

understand if we can do anything to mitigate at all

I've given it a little thought, and I am not aware of a workaround for this currently. If I do think of something, I'll be sure to note it here.

@jrasell jrasell added theme/platform-windows theme/client stage/accepted Confirmed, and intend to work on. No timeline committment though. hcc/jira labels Sep 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/platform-windows type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants