Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Stuck and inconsistent consumer when deleting non-acked messages #6374

Open
roeschter opened this issue Jan 14, 2025 · 1 comment
Open

Stuck and inconsistent consumer when deleting non-acked messages #6374

roeschter opened this issue Jan 14, 2025 · 1 comment
Labels
defect Suspected defect such as a bug or regression

Comments

@roeschter
Copy link

Observed behavior

A consumer gets stuck and eventually unstuck after a random time, ranging from a few seconds to an hour. Even before the consumer gets stuck the statistics reported by the nats consumer info are seemingly incorrect. Outstanding Acks are too high. Redelivery count jumps wildly and sometimes goes backwards.

The effect was observed independently in separate environment. It is a bit random though and may take a few minutes to appears:

  1. Java, Windows 11, local 3 node cluster
  2. Golang, CLoud Linux, 3 nod cluster

Expected behavior

Consistent consumer behavior even when messages marked for redelivery are deleted

Server and client version

Nats server 2.10.24
Latest Java client
Latest Go client

Host environment

Windows 11 as well as Cloud Linux

Steps to reproduce

Conceptually:

  1. A high percentage of message not acked and being redelivered
  2. A high percentage of message deleted, while pending redelivery
  3. 3 node cluster

Exact steps:

  1. Start a 3 node cluster
  2. Start the Feeder (recreates a replica 3 stream)
  3. Start the Consumer
  4. Wait for "consumer stalled"
  5. It may be required to stop and restart the consumer.
  6. Observe the consumer info - Redelivery count may go down. Outstanding ack may be growing to unrealistic values.

Reproducer20250114_noack_delete.zip

@roeschter roeschter added the defect Suspected defect such as a bug or regression label Jan 14, 2025
@tehsphinx
Copy link

tehsphinx commented Jan 14, 2025

@roeschter thank you for reporting this!

We are seeing this behaviour as well on a linux based 3 node cluster installation. I was also able to reproduce it on a docker based local cluster (macOS).

After a few hundred to a few thousand messages our consumers have unprocessed messages that do not get processed even though there is 1 pull request waiting per consumer:

96368b23-6895-4fd0-9392-244bb27679a5

Also the Ack Pending doesn't seem to make sense. The numbers keep increasing although we have limited redeliveries to 4 (with 35s delay).

The consumers seem to wake up randomly and deliver the unprocessed messages, but that happens with increasing delays the longer the consumers run. We have observed delays of several hours.

The consumers are durable in our case. Restarting the subscription on a consumer often does nothing. Unprocessed messages stay unprocessed. Removing the consumers and recreating them with the same config (DeliverAll) on the existing stream helps for a while. As mentioned above: after a few hundred/thousand new messages things start to pile up again.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants