Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Accuracy of inactive threshold for ephemeral consumers #777

Closed
1 task done
erdemiru opened this issue Oct 25, 2022 · 5 comments
Closed
1 task done

Accuracy of inactive threshold for ephemeral consumers #777

erdemiru opened this issue Oct 25, 2022 · 5 comments

Comments

@erdemiru
Copy link

Defect

Versions of io.nats:jnats and nats-server:

nats-server: 2.9.3
io.nats:jnats: 2.16.1

OS/Container environment:

  • macOS, running local NATS server.
  • Windows, running local NATS server.

Steps or code to reproduce the issue:

NATS documentation indicates that the default value of the inactive threshold is 5 seconds. When processing messages, if I add a processing delay (e.g. 3s) after several pull requests, nextMessage() always returns null, even though the subscription is active and there are still more messages on the server.

The example project contains two test methods with different test parameters.

shouldConsumeAllMessagesWithBatchPull method

  • Publishes 7 messages, and tries to pull those messages with a batch size of 3.
  • Each message is acknowledged synchronously as soon as it is received. Then the consumer thread sleeps 3 seconds.
  • After the 6th message, consumer becomes inactive, and nextMessage returns always null.

shouldConsumeAllMessages is a slightly more complex test method

  • I parameterized the number of messages, batch size, and processing delay so that different behaviours can be investigated easier.
  • If the batch size is 1 or 2, the consumer can pull all messages without any problem,
  • if the batch size is 3, the consumer becomes inactive and cannot consume all messages.

Some additional information:

  • Acknowledging messages synchronously or asynchronously does not change the result.
  • Acknowledging before or after the processing delay also does not change the result.
  • You can see the timing information in console output. For example:

0s 7ms: Received message #1 content: 1, consumer information...
3s 38ms: Received message #2 content: 2, consumer information...

Expected result:

  • Consumer should not be removed if it is active within the inactive threshold limit.
  • If the consumer is removed, pull requests should throw an exception to notify the client.

Actual result:

  • Consumer is removed, even though it is active within the inactive threshold limit.
  • Subsequent pull requests have no effect, nextMessage always returns null.
@scottf
Copy link
Contributor

scottf commented Oct 25, 2022

@erdemiru I will look at this in detail, but off the top of my head

processingDelayInMillis = 2000;

subscription.pull(3);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);
processAndAcknowledgeNextMessage(subscription, processingDelayInMillis);

2000 + 2000 + 2000 = 6000; 6 seconds. The subscription is already inactive because it did not get another pull. Reading and acking I'm pretty sure do not reset the threshold (I'll verify) I'm surprised you got the second set of 3.

Either way it's not exact. The server is doing lots of work. The subscription may stay active longer than the threshold because of it, but won't be less.

@scottf scottf removed the 🐞 bug label Oct 25, 2022
@scottf
Copy link
Contributor

scottf commented Oct 25, 2022

We are working on improvements for handling inactive consumers, but it IS VERY difficult to know. You can ask the server for consumer info, but that's a round trip to the server. This are now heartbeats on pulls, so that's one way we will try to address inactivity.

@erdemiru
Copy link
Author

Hi @scottf,

Thanks for your quick response. It is interesting to hear acknowledging a message does not reset the threshold as I would expect it is a clear indication that the consumer is in the active state.

I also tried some other examples:

  • numberOfMessages:9, batch size:5, processing delay:2000 -> test OK, it receives 9 out of 9 messages.
  • numberOfMessages:12 batch size:5, processing delay:2000 > test FAILS, it receives 10 out of 12 messages.

In both cases, between pull interval is 10 seconds.

As a workaround, we can set inactive threshold something longer than max. processing time x batch size.

@erdemiru
Copy link
Author

I don't know if that is directly related to this issue, but I also see duplicate messages in some scenarios. Let's say we set inactive threshold to 10 minutes to isolate the inactive threshold limit issue.

number of messages:57 batch size:100 (greater than the available messages), processing delay:1000

In that case, consumer receives duplicates messages,

56s 355ms : Received message #57 content: 57
57s 361ms : Received message #58 content: 31 // 31th message is duplicate
58s 367ms : Received message #59 content: 32 // 32nd message is duplicate
....
82s 530ms : Received message #83 content: 56 // 56th message is duplicate
83s 538ms : Received message #84 content: 57 // 57th message is duplicate

Calling msg.ack() or msg.ackSync() before the sleep doesn't fix it. It seems related to ack_wait (which is 30 seconds by default). Setting it to a larger value avoid message duplicates.

Are the acknowledgements some how delayed for batch pull requests?

@scottf
Copy link
Contributor

scottf commented Oct 26, 2022

Is it taking you 30 seconds to ack something? Maybe you need to reduce your batch size and/or increase your ack wait. Ack wait and redelivering are a fundamental server things, I'd be surprised if it's broken, but I suppose it could be.

I'm moving this issue to a discussion.

@nats-io nats-io locked and limited conversation to collaborators Oct 26, 2022
@scottf scottf converted this issue into discussion #778 Oct 26, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants