Stuck and inconsistent consumer when deleting non-acked messages #6374

roeschter · 2025-01-14T16:09:43Z

Observed behavior

A consumer gets stuck and eventually unstuck after a random time, ranging from a few seconds to an hour. Even before the consumer gets stuck the statistics reported by the nats consumer info are seemingly incorrect. Outstanding Acks are too high. Redelivery count jumps wildly and sometimes goes backwards.

The effect was observed independently in separate environment. It is a bit random though and may take a few minutes to appears:

Java, Windows 11, local 3 node cluster
Golang, CLoud Linux, 3 nod cluster

Expected behavior

Consistent consumer behavior even when messages marked for redelivery are deleted

Server and client version

Nats server 2.10.24
Latest Java client
Latest Go client

Host environment

Windows 11 as well as Cloud Linux

Steps to reproduce

Conceptually:

A high percentage of message not acked and being redelivered
A high percentage of message deleted, while pending redelivery
3 node cluster

Exact steps:

Start a 3 node cluster
Start the Feeder (recreates a replica 3 stream)
Start the Consumer
Wait for "consumer stalled"
It may be required to stop and restart the consumer.
Observe the consumer info - Redelivery count may go down. Outstanding ack may be growing to unrealistic values.

Reproducer20250114_noack_delete.zip

The text was updated successfully, but these errors were encountered:

tehsphinx · 2025-01-14T16:32:43Z

@roeschter thank you for reporting this!

We are seeing this behaviour as well on a linux based 3 node cluster installation. I was also able to reproduce it on a docker based local cluster (macOS).

After a few hundred to a few thousand messages our consumers have unprocessed messages that do not get processed even though there is 1 pull request waiting per consumer:

Also the Ack Pending doesn't seem to make sense. The numbers keep increasing although we have limited redeliveries to 4 (with 35s delay).

The consumers seem to wake up randomly and deliver the unprocessed messages, but that happens with increasing delays the longer the consumers run. We have observed delays of several hours.

The consumers are durable in our case. Restarting the subscription on a consumer often does nothing. Unprocessed messages stay unprocessed. Removing the consumers and recreating them with the same config (DeliverAll) on the existing stream helps for a while. As mentioned above: after a few hundred/thousand new messages things start to pile up again.

roeschter added the defect Suspected defect such as a bug or regression label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck and inconsistent consumer when deleting non-acked messages #6374

Stuck and inconsistent consumer when deleting non-acked messages #6374

roeschter commented Jan 14, 2025

tehsphinx commented Jan 14, 2025 •

edited

Loading

Stuck and inconsistent consumer when deleting non-acked messages #6374

Stuck and inconsistent consumer when deleting non-acked messages #6374

Comments

roeschter commented Jan 14, 2025

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

tehsphinx commented Jan 14, 2025 • edited Loading

tehsphinx commented Jan 14, 2025 •

edited

Loading