-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Receptor Work unit expired #890
Comments
Ensure the time on AWX and Receptor node are in sync. |
@Iyappanj |
@kurokobo Yes, the time is in Sync. but still sometimes I could see this issue and then it will resolve by itself for few nodes. |
Ah sorry I misread the message as an error about the token expiration. |
Hi everyone, Similar to the OP, I'm encountering a similar issue on an execution node hosted on RHEL servers. It has been a moment since we deployed AWX in production, and this is the first time we've experienced an issue with the execution nodes. Problem descriptionI have many error in my ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u It's been awhile, and those errors never cause some issue but know we have more jobs running on AWX now. After a moment, I experience a timeout between the Wed Jan 10 15:56:24 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 288.899µs
1: XX.XX.XX.XX in 1.808779ms
Wed Jan 10 15:56:25 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 230.033µs
ERROR: 1: Error timeout from in 10.000210552s I can't figure out what could be causing this timeout. When my execution node switch from 'ready' to 'unavailable' state, Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Thu Jan 10 16:03:12 At this moment, my only workaround is to restart the I've already checked some things:
@kurokobo or someone else, do you have an idea please ? I'm running out of idea here ... Additional informationExecution node VM information : NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
... AWX information: Kubernetes version: v1.25.6
AWX version: 23.6.0
AWX Operator: 2.10.0
PostgreSQL version 15 Receptor information receptorctl 1.4.3
receptor v1.4.3 Ansible-runner version ansible-runner 2.3.4 Podman information: host:
arch: amd64
buildahVersion: 1.31.3
cgroupControllers: []
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: conmon-2.1.8-1.module+el8.9.0+20326+387084d0.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.1.8, commit: 579be593361fffcf49b6c5ba4006f2075fd1f52d'
cpuUtilization:
idlePercent: 96.97
systemPercent: 0.4
userPercent: 2.63
cpus: 4
databaseBackend: boltdb
distribution:
distribution: '"rhel"'
version: "8.9"
eventLogger: file
freeLocks: 2046
hostname: XX.XX.XX.XX
idMappings:
gidmap:
- container_id: 0
host_id: 21000
size: 1
- container_id: 1
host_id: 1214112
size: 65536
uidmap:
- container_id: 0
host_id: 12007
size: 1
- container_id: 1
host_id: 1214112
size: 65536
kernel: 4.18.0-513.9.1.el8_9.x86_64
linkmode: dynamic
logDriver: k8s-file
memFree: 11834806272
memTotal: 16480423936
networkBackend: cni
networkBackendInfo:
backend: cni
dns:
package: podman-plugins-4.6.1-4.module+el8.9.0+20326+387084d0.x86_64
path: /usr/libexec/cni/dnsname
version: |-
CNI dnsname plugin
version: 1.3.1
commit: unknown
package: containernetworking-plugins-1.3.0-4.module+el8.9.0+20326+387084d0.x86_64
path: /usr/libexec/cni
ociRuntime:
name: crun
package: crun-1.8.7-1.module+el8.9.0+20326+387084d0.x86_64
path: /usr/bin/crun
version: |-
crun version 1.8.7
commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
rundir: /tmp/podman-run-12007/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
os: linux
pasta:
executable: ""
package: ""
version: ""
remoteSocket:
path: /tmp/podman-run-12007/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
seccompProfilePath: /usr/share/containers/seccomp.json
selinuxEnabled: false
serviceIsRemote: false
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.2.1-1.module+el8.9.0+20326+387084d0.x86_64
version: |-
slirp4netns version 1.2.1
commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.2
swapFree: 4294963200
swapTotal: 4294963200
uptime: 18h 13m 25.00s (Approximately 0.75 days)
plugins:
authorization: null
log:
- k8s-file
- none
- passthrough
- journald
network:
- bridge
- macvlan
- ipvlan
volume:
- local
registries:
search:
- docker.io
store:
configFile: /home/awx/.config/containers/storage.conf
containerStore:
number: 2
paused: 0
running: 1
stopped: 1
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: fuse-overlayfs-1.12-1.module+el8.9.0+20326+387084d0.x86_64
Version: |-
fusermount3 version: 3.3.0
fuse-overlayfs: version 1.12
FUSE library version 3.3.0
using FUSE kernel interface version 7.26
graphRoot: /home/awx/.local/share/containers/storage
graphRootAllocated: 110867910656
graphRootUsed: 10662989824
graphStatus:
Backing Filesystem: extfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageCopyTmpDir: /var/tmp
imageStore:
number: 13
runRoot: /tmp/podman-run-12007/containers
transientStore: false
volumePath: /home/awx/.local/share/containers/storage/volumes
version:
APIVersion: 4.6.1
Built: 1700309421
BuiltTime: Sat Nov 18 12:10:21 2023
GitCommit: ""
GoVersion: go1.19.13
Os: linux
OsArch: linux/amd64
Version: 4.6.1 |
The issue is not related to : ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u I tried to disable the cleanup from AWX and do it on my side and I don't have this error anymore but my execution node continue to timeout randomly. |
Hi I have the same issue but it is following red hat update on the execution node from 8.8 to latest 8.8 kernel No solution ? Can you advise @koro Thanks for your support |
Similar topic: #934 Could anyone here who facing this issue share your |
+1, have the same issue. |
I ran into this on Friday. My last job was one with id
The failed job ran on aap-1 and I see this in the messages at about that time:
However this is not the only instance of that error. Please find attached the logs:
Please accept my apologies for the fact that some log lines are duplicated in the AWX task logs, this is because I can only download them 500 messages at a time from Google's logging console. AWX is running inside a Google Kubernetes Engine cluster while aap-0 and app-1 are running on RHEL 9 VMs inside Google compute engine. Here is a screen clip of the topology screen screen for my cluster per comment from @kurokobo : ^^^ Note that this is after restarting the awx-task deployment so the awx task node has changed its id. The podman version on aap-1 is 3.4.4. I wonder if I upgrade to something where containers/conmon#440 had been fixed, I wouldn't see this again? |
I've had similar issues and downgrading receptor to version 1.4.2 seems to solve it somehow. |
Recently we see one of our receptor node showing unavailable on AWX and we see the error below
Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Mon Oct 30 12:04:34
Restart of the receptor service did not fix the issue. Any idea on what is causing this ?
The text was updated successfully, but these errors were encountered: