Receptor Work unit expired #890

Iyappanj · 2023-10-30T12:18:03Z

Recently we see one of our receptor node showing unavailable on AWX and we see the error below

Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Mon Oct 30 12:04:34

Restart of the receptor service did not fix the issue. Any idea on what is causing this ?

kurokobo · 2023-10-30T14:00:36Z

Ensure the time on AWX and Receptor node are in sync.

djyasin · 2023-11-22T16:50:59Z

@Iyappanj Were you able to resolve this with @kurokobo's feedback? Are you still encountering difficulties?

Iyappanj · 2023-11-23T11:58:01Z

@djyasin @kurokobo I still see the issue even when the time zone is same on both

kurokobo · 2023-11-23T12:18:59Z

@Iyappanj
It is not a time zone issue; make sure both AWX and Receptor node are synchronized with an NTP server that provides accurate time.

Iyappanj · 2023-12-07T09:23:38Z

@kurokobo Yes, the time is in Sync. but still sometimes I could see this issue and then it will resolve by itself for few nodes.

kurokobo · 2023-12-07T13:34:00Z

Ah sorry I misread the message as an error about the token expiration.
Are there any logs on awx-ee container and awx-task container in awx-task pod?

LalosBastien · 2024-01-11T12:20:55Z

Hi everyone,

Similar to the OP, I'm encountering a similar issue on an execution node hosted on RHEL servers. It has been a moment since we deployed AWX in production, and this is the first time we've experienced an issue with the execution nodes.

Problem description

I have many error in my receptor.log similar to this one :

ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u

It's been awhile, and those errors never cause some issue but know we have more jobs running on AWX now.

After a moment, I experience a timeout between the awx-task pod (awx-ee container) and my execution node.
Here is the result of the receptorctl traceroute command from awx-ee container :

Wed Jan 10 15:56:24 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 288.899µs
1: XX.XX.XX.XX in 1.808779ms

Wed Jan 10 15:56:25 UTC 2024
0: awx-task-79bc97cd7-mrb5g in 230.033µs
ERROR: 1: Error timeout from  in 10.000210552s

I can't figure out what could be causing this timeout.

When my execution node switch from 'ready' to 'unavailable' state, awx-task cannot peer my execution node anymore and I encounter this error when attempting to run a health check:

Receptor error from XX.XX.XX.XX, detail:
Work unit expired on Thu Jan 10 16:03:12

At this moment, my only workaround is to restart the awx-task pod when one of my execution node is lost.

I've already checked some things:

NTP synchronized
Firewall (rate-limiting, bruteforce, etc)
MaxLogFile
File descriptor
Reinstallation

@kurokobo or someone else, do you have an idea please ? I'm running out of idea here ...

Additional information

Execution node VM information :

NAME="Red Hat Enterprise Linux"
VERSION="8.9 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.9"
PLATFORM_ID="platform:el8"
...

AWX information:

Kubernetes version: v1.25.6
AWX version: 23.6.0
AWX Operator: 2.10.0
PostgreSQL  version 15

Receptor information

receptorctl  1.4.3
receptor     v1.4.3

Ansible-runner version

ansible-runner 2.3.4

Podman information:

host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.8-1.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 579be593361fffcf49b6c5ba4006f2075fd1f52d'
  cpuUtilization:
    idlePercent: 96.97
    systemPercent: 0.4
    userPercent: 2.63
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: '"rhel"'
    version: "8.9"
  eventLogger: file
  freeLocks: 2046
  hostname: XX.XX.XX.XX
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 21000
      size: 1
    - container_id: 1
      host_id: 1214112
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 12007
      size: 1
    - container_id: 1
      host_id: 1214112
      size: 65536
  kernel: 4.18.0-513.9.1.el8_9.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 11834806272
  memTotal: 16480423936
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: podman-plugins-4.6.1-4.module+el8.9.0+20326+387084d0.x86_64
      path: /usr/libexec/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
    package: containernetworking-plugins-1.3.0-4.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/libexec/cni
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.module+el8.9.0+20326+387084d0.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /tmp/podman-run-12007/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    path: /tmp/podman-run-12007/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.module+el8.9.0+20326+387084d0.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 18h 13m 25.00s (Approximately 0.75 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /home/awx/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 1
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: fuse-overlayfs-1.12-1.module+el8.9.0+20326+387084d0.x86_64
      Version: |-
        fusermount3 version: 3.3.0
        fuse-overlayfs: version 1.12
        FUSE library version 3.3.0
        using FUSE kernel interface version 7.26
  graphRoot: /home/awx/.local/share/containers/storage
  graphRootAllocated: 110867910656
  graphRootUsed: 10662989824
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 13
  runRoot: /tmp/podman-run-12007/containers
  transientStore: false
  volumePath: /home/awx/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1700309421
  BuiltTime: Sat Nov 18 12:10:21 2023
  GitCommit: ""
  GoVersion: go1.19.13
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

LalosBastien · 2024-01-14T10:10:02Z

The issue is not related to :

ERROR 2024/01/11 09:46:21 Error locating unit: IbVMji5u
ERROR 2024/01/11 09:46:21 : unknown work unit IbVMji5u

I tried to disable the cleanup from AWX and do it on my side and I don't have this error anymore but my execution node continue to timeout randomly.

birb57 · 2024-02-12T23:49:54Z

Hi

I have the same issue but it is following red hat update on the execution node from 8.8 to latest 8.8 kernel

No solution ?

Can you advise @koro

Thanks for your support

kurokobo · 2024-02-22T04:36:20Z

Similar topic: #934

Could anyone here who facing this issue share your Administration > Topology View screen and receptor logs from both control nodes and execution nodes?

dchittibala · 2024-02-29T17:09:14Z

+1, have the same issue.

heretic098 · 2024-03-25T11:42:53Z

I ran into this on Friday. My last job was one with id 11464, which shows as "error" in the graphical user interface and has the message No output found for this job. displayed. All subsequent jobs have then failed, it finished at 22/03/2024, 14:13:10 according to the GUI. The thing that stands out for me is the insertId losg0lm7agp9jyem:

WARNING 2024/03/22 14:11:13 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection

The failed job ran on aap-1 and I see this in the messages at about that time:

Mar 22 14:10:47 aap-1 conmon[3350942]: conmon 46f779b515a9eef4b1d5 <nwarn>: stdio_input read failed Input/output error
Mar 22 14:12:52 aap-1 conmon[3351784]: conmon 90d76f3492c4764bfef0 <nwarn>: stdio_input read failed Input/output error

However this is not the only instance of that error.

Please find attached the logs:

json formatted logging from the awx pod awx-task-b6ff7d555-b7lt5
receptor log from aap-0
/var/log/messages from aap-0
receptor log from aap-1
/var/log/messages from aap-1

Please accept my apologies for the fact that some log lines are duplicated in the AWX task logs, this is because I can only download them 500 messages at a time from Google's logging console. AWX is running inside a Google Kubernetes Engine cluster while aap-0 and app-1 are running on RHEL 9 VMs inside Google compute engine.

logs.tar.gz

Here is a screen clip of the topology screen screen for my cluster per comment from @kurokobo :

^^^ Note that this is after restarting the awx-task deployment so the awx task node has changed its id.

The podman version on aap-1 is 3.4.4. I wonder if I upgrade to something where containers/conmon#440 had been fixed, I wouldn't see this again?

rexberg · 2024-11-24T12:48:48Z

I've had similar issues and downgrading receptor to version 1.4.2 seems to solve it somehow.

github-actions bot added the needs_triage label Oct 30, 2023

LalosBastien mentioned this issue Feb 12, 2024

awx execution node Another cluster node has determined this instance to be unresponsive #934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receptor Work unit expired #890

Receptor Work unit expired #890

Iyappanj commented Oct 30, 2023

kurokobo commented Oct 30, 2023

djyasin commented Nov 22, 2023

Iyappanj commented Nov 23, 2023

kurokobo commented Nov 23, 2023

Iyappanj commented Dec 7, 2023

kurokobo commented Dec 7, 2023

LalosBastien commented Jan 11, 2024

LalosBastien commented Jan 14, 2024

birb57 commented Feb 12, 2024 •

edited

Loading

kurokobo commented Feb 22, 2024

dchittibala commented Feb 29, 2024

heretic098 commented Mar 25, 2024 •

edited

Loading

rexberg commented Nov 24, 2024

Receptor Work unit expired #890

Receptor Work unit expired #890

Comments

Iyappanj commented Oct 30, 2023

kurokobo commented Oct 30, 2023

djyasin commented Nov 22, 2023

Iyappanj commented Nov 23, 2023

kurokobo commented Nov 23, 2023

Iyappanj commented Dec 7, 2023

kurokobo commented Dec 7, 2023

LalosBastien commented Jan 11, 2024

Problem description

Additional information

LalosBastien commented Jan 14, 2024

birb57 commented Feb 12, 2024 • edited Loading

kurokobo commented Feb 22, 2024

dchittibala commented Feb 29, 2024

heretic098 commented Mar 25, 2024 • edited Loading

rexberg commented Nov 24, 2024

birb57 commented Feb 12, 2024 •

edited

Loading

heretic098 commented Mar 25, 2024 •

edited

Loading