Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NATS Push Consumer Client Stuck after NATS Node Failure #997

Closed
stefanLeo opened this issue Oct 3, 2023 · 43 comments
Closed

NATS Push Consumer Client Stuck after NATS Node Failure #997

stefanLeo opened this issue Oct 3, 2023 · 43 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@stefanLeo
Copy link

stefanLeo commented Oct 3, 2023

Observed behavior

We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).

Setup: In Memory Storage Option, 3 Replicas.

Config of createStream was changed to:
StreamConfiguration sc = StreamConfiguration.builder()
.name(streamName)
.storageType(storageType)
.subjects(subjects)
.replicas(3)
.description("LifeX-Test")
.build();

and connection builder to
Options.Builder builder = new Options.Builder()
.server(servers)
.connectionTimeout(Duration.ofMillis(500))
.pingInterval(Duration.ofSeconds(3))
.maxPingsOut(2)
.reconnectWait(Duration.ofMillis(500))
.connectionListener(EXAMPLE_CONNECTION_LISTENER)
.traceConnection()
.errorListener(EXAMPLE_ERROR_LISTENER)
.maxReconnects(-1)
.reconnectDelayHandler(new PsReconnectDelayHandler())
.reconnectJitter(Duration.ofMillis(500));

When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).

Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.

After the failure of the NATS master node the following happens:

  1. Producer detects the failure, reconnects and continues sending after ~15 seconds.
  2. Consumer DOES NOT detect the failure and is stuck. It does not log any error or disconnection info nor any new message reception in the message handler. Note that if we restart the consumer it can consume ALL message sent by the producer incl. the ones after the producer reconnected.
    Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed.

Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details.
The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...

Expected behavior

Producer detects the failure, reconnects and continues sending.
Consumer detects the failure, reconnects and continues consuming.

Server and client version

Server: 2.9.22
Java Client: 2.16.14

Host environment

We used the official NATS container images and the HELM charts for deployment.

Steps to reproduce

Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster
Start producer with settings as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java
Start consumer with settnigs as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
Kill leader node (find stream leader via using nats cli)

Logs:
nats-server-2-logs.txt
nats-server-1-logs.txt
logs.zip

@stefanLeo stefanLeo added the defect Suspected defect such as a bug or regression label Oct 3, 2023
@stefanLeo
Copy link
Author

stefanLeo commented Oct 3, 2023

What also caught my eyes is that we added some code measure the time it takes from publishing till the Exception is thrown - you can see this here:
2023-10-03 12:49:28:190 +0000 [main] INFO com.frq.nats.Producer - Published message took [746]us. On subject [lx-subject], stream [lx-stream], seqno [127076]
2023-10-03 12:49:28:191 +0000 [main] INFO com.frq.nats.Producer - Published message took [743]us. On subject [lx-subject], stream [lx-stream], seqno [127077]
2023-10-03 12:49:28:192 +0000 [main] INFO com.frq.nats.Producer - Published message took [795]us. On subject [lx-subject], stream [lx-stream], seqno [127078]
2023-10-03 12:49:28:193 +0000 [main] INFO com.frq.nats.Producer - Wait due to rate limit [0.047]msec. [127079]
2023-10-03 12:49:43:194 +0000 [main] WARN com.frq.nats.Producer - Cannot publish due to NATS unavailable [127079]
2023-10-03 12:49:43:194 +0000 [main] INFO com.frq.nats.Producer - Failover took [15000]msec

And the failover always takes exactly 15 secs... Why is that? and can we improve on that via some setting?

UPDATE; This behaviour can be changed/set via Request Timeout on JetStreamOptions -> Solved.

@stefanLeo
Copy link
Author

stefanLeo commented Oct 4, 2023

Update: The consumer is really only stuck if the "NATS STREAM Leader" is killed. The failover works if any other node is killed.
This is even the case if the consumer is not directly connected to the stream leader.

@scottf
Copy link
Contributor

scottf commented Oct 4, 2023


Experiment 1. Kill server with the consumer leader if it's not the same server you are connected to...
------------------------------------------------------------------------------------------------------
Connection Event: nats: discovered servers
Connection Event: nats: connection opened
Sync Exception: Status{code=409, message='Server Shutdown'}
SEVERE: pullStatusError, Connection: 30, Subscription: 1820791985, Consumer Name: SyncConsumer, Status:Status{code=409, message='Server Shutdown'}
SEVERE: pullStatusError, Connection: 30, Subscription: 1648420689, Consumer Name: CallbackConsumer, Status:Status{code=409, message='Server Shutdown'}
Connection Event: nats: connection closed


Experiment 2. Kill server you are connected to if it's not the same as the consumer leader...
------------------------------------------------------------------------------------------------------
CONSUMER IS REMOVED.


Experiment 3. Kill server you are connected if it is the same as the consumer leader...
------------------------------------------------------------------------------------------------------
Connection Event: nats: discovered servers
Connection Event: nats: connection opened
Sync Exception: Status{code=409, message='Server Shutdown'}
Error Listener: Pull Status Error for 'CallbackConsumer', Status{code=409, message='Server Shutdown'}
Error Listener: Pull Status Error for 'SyncConsumer', Status{code=409, message='Server Shutdown'}
ErrorListenerLoggerImpl: SEVERE: exceptionOccurred, Connection: 15, Exception: java.io.IOException: Read channel closed.
Connection Event: nats: connection disconnected
Connection Event: nats: connection closed



Just reporting some experiment results. Java library version is 2.16.10. 3 node cluster. Stream is file and R1.
* 1) If I kill the consumer leader server, the subscription is invalid and the server closes the pull or it becomes close as an effect of the consumer leader being gone.
Depending on how the server came down, the client gets a status message 409 'Server Shutdown' The user gets an exception on nextMessage, nothing in a handler, and error listeners get a pull status error. 
The subscription currently just leaves it up to the user to be aware of the pull status error, which are terminal. I look at making the subscription better aware that it is in this state.
* 2) If I kill the server that is not the consumer leader but is just who the client is connected to, the client reconnects.



--- bootstrap is nats://host:4222, nats://host:5222, nats://host:6222, nats://host:7222, nats://host:8222
--- connection is set to randomize (default) and max retries of 3
--- 4222 is the only server available.

trying nats://host:7222 - tried, marked as failed once & retried 0 times
trying nats://host:8222 - tried, marked as failed once & retried 0 times
trying nats://host:4222 - tried, retries cleared to 0

--- after being connected to 4222, server 4222 was brought down, there are no servers available.

trying nats://host:6222 - tried, marked as failed once & retried 0 times
trying nats://host:5222 - tried, marked as failed once & retried 0 times
trying nats://host:7222 - tried, marked as failed twice & retried 1 time
trying nats://host:8222 - tried, marked as failed twice & retried 1 time
trying nats://host:4222 - tried, marked as failed once & retried 0 times

trying nats://host:6222 - tried, marked as failed twice & retried 1 time
trying nats://host:5222 - tried, marked as failed twice & retried 1 time
trying nats://host:7222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:8222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:4222 - tried, marked as failed twice & retried 1 time

trying nats://host:6222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:5222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:7222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:8222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:4222 - tried, marked as failed 3 times & retried 2 times

trying nats://host:6222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:5222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:4222 - tried, marked as failed 4 times & retried 3 times

done retrying



@stefanLeo
Copy link
Author

Our tests reveal the following for your test scenaios:

  1. Kill Stream Leader, Consumer connected to other node.
    Result: Client is stuck and we get no callback via one of the handlers. We also get no pullstatuserror.
    So from our PoV there is no chance to react in any case to this situation.
  2. Kill Node Consumer is connected to, which is not the stream leader:
    Client detects this and reconnects and continues consuming as expected.

@scottf
Copy link
Contributor

scottf commented Oct 5, 2023

Pull status error does not reflect that, that's probably something we should document better. Pull status error is for when there is a pull request that results in an error from the server because of some constraint.
For when the stream leader dies, yeah this is a tough one, as you can see I started looking into this and am trying to get the time to address these, which may be through documentation and examples in addition to code fixes.

@stefanLeo
Copy link
Author

Great to hear! Let me know if we can help or in case there is something we can test.

@stefanLeo
Copy link
Author

stefanLeo commented Oct 12, 2023

Just some update from our side - We updated to 2.10 withou any change on the topic.
What caught my eyes but what I struggle to understand if any of the following would help:

  1. Set the number of replicas in cosumer options to 3.
    The java lib however enforces this value to be "1" for an ordered consumer - Is that on purpose and if so why?
  2. The ACK Policy is enforced to be set to "none" like above for ordered consumers. - Is that on purpose and if so why?
    And a side question to that: what happens if the consumer still "acks" a message in that case?
  3. The Deno Client guys stated that a push consumer is deprecated - https://github.com/nats-io/nats.deno/blob/main/jetstream.md#processing-messages - is that a deno client thing or a general NATS message?
  4. Can you explain in details what an "ordered" consumer does compared to any other?

@stefanLeo
Copy link
Author

And one update - we did now switch from "ordered" consumer to a cusomt consumer config with replicas set to 3 and nowthis works also in case the leader node fails.

@scottf
Copy link
Contributor

scottf commented Oct 12, 2023

1/2/4 The java lib however enforces this value to be "1" for an ordered consumer. Ack Policy enforced
Is that on purpose and if so why?
Yes, these are intentional. The name ordered consumer doesn't adequately explain the intention of the consumer to be a fast, ordered ensured, read consumer. It's designed to fail and restart if there is an error such as a out of order message or connection failure. I can better document this.
3 sorta yes for all clients. The pull consumer is much more efficient and allows greater control of message flow. Simplification api have been built to make using the pull consumer appear like a push consumer, with endless consuming of messages and is the preferred way of doing things now.

@stefanLeo
Copy link
Author

stefanLeo commented Oct 17, 2023

@scottf : We found another issue with Leader Failure that happens only randomly:
If we kill the leader node and lets say the consumer is connected to it, then the reconnection of the library is as fast as expected (given the ping timeout config), but it is stuck after stating " subscriptions re-established". Then there is nothing going on for ~22 seconds and then the next message is consumed.
I managed to drill that down to "NatsConnections"'s "deliverMessage method which I have overridden just for logging purposes > "Deliver Msg called". So this method is not called for ~22 sec after re-subscription was successful.
Logs with trace option set are attached - as said this happens in 50% of our test cases. Is this also a "config" or "docu" issue or a library or NATS server issue?

Its about that timespan here:
[08:27:47.068887] connect trace: status updated
2023-10-17 08:27:47:068 +0000 [pool-2-thread-1] INFO com.frq.nats.listener.PsErrorListenerLogger - Connection Event [CONNECTED] [### Connection ###
Reconnects: 1
Requests Sent: 7
Replies Received: 7
Pings Sent: 10
+OKs Received: 0
-Errs Received: 0
Handled Exceptions: 1
Successful Flush Calls: 0
Outstanding Request Futures: 0
Dropped Messages: 0
Reader
Messages in: 57,390
Bytes in: 16,143,590
Writer
Messages out: 57,403
Bytes out: 4,455,018

] [nats: connection reconnected]
2023-10-17 08:27:47:071 +0000 [pool-2-thread-1] INFO com.frq.nats.listener.PsErrorListenerLogger - Connection Event [CONNECTED] [ Connection
Reconnects: 1
Requests Sent: 7
Replies Received: 7
Pings Sent: 11
+OKs Received: 0
-Errs Received: 0
Handled Exceptions: 1
Successful Flush Calls: 1
Outstanding Request Futures: 0
Dropped Messages: 0
Reader
Messages in: 57,390
Bytes in: 16,143,590
Writer
Messages out: 57,406
Bytes out: 4,455,100
] [nats: subscriptions re-established]
2023-10-17 08:28:09:444 +0000 [NATS-LX-5] INFO io.nats.client.impl.FrqNatsConnection - Deliver Msg called [57382] took [55908]ns

nats-java-client.log

2023-10-17 08:27:47:071 vs 2023-10-17 08:28:09:444 = 22 sec delta.

NATS Server Logs - from a different test run - during the client freeze time:
nats_server_log

@stefanLeo
Copy link
Author

And another update: We tried the pull based subscriber now as well with basically the same result:

PullSubscribeOptions so = PullSubscribeOptions.builder()
.bind(false)
.stream(exArgs.stream)
.durable("LX-NATS")
.configuration(ConsumerConfiguration.builder()
.numReplicas(3)
.ackPolicy(AckPolicy.Explicit)
.ackWait(Duration.ofSeconds(5))
.memStorage(true)
.build())
.build();

JetStreamSubscription subscription = js.subscribe(null, so);
nc.flush(Duration.ofSeconds(1));
JetStreamReader reader = subscription.reader(1,1);
PSMessageHandler handler = new PSMessageHandler(totalLatency, msgLatch);
while (msgLatch.getCount() != 0) {
handler.onMessage(reader.nextMessage(0));
}

once the stream leader node on server side is killed, the consumer is stuck. It reconnencts and successfully resubscribes but then no new messages are cnosumed...
[nats: connection reconnected]
[nats: subscriptions re-established]
And then it is stuck

@scottf
Copy link
Contributor

scottf commented Oct 19, 2023

Duplicate in nats-io/nats-server#4624

@scottf
Copy link
Contributor

scottf commented Oct 19, 2023

@stefanLeo As you have noticed, when using the raw pull, and the stream leader (or consumer leader node) is killed, the consumer does not recover. For raw pulls, it will never recover, it cannot. You must pay attention to heartbeats and connection events. I will try to make an example.
For push, it is very similar how to handle this.
For simplification, I am in the process of adding error handling to cover as much of the situation as possible, but it won't always be possible.

@stefanLeo
Copy link
Author

@scottf : Thx for the feedback. I will try to add a reconnection handler to my pull example. I will try to share with you guys once I have it.
For push: What I do not understand here is that the consumer continues now in 100% of all times - so the push consumer is not stuck as it continues consuming messages. BUT sometimes it takes 22 secs between the sucessful resubscribe and the next consumed message.
And imho this is a server side issue from what I can tell having investigated the java client code and adding all sorts of debug logs.

@albionb96
Copy link

albionb96 commented Oct 31, 2023

@scottf I tested the issue that @stefanLeo is having, using the pull consumer in three "modes": durable, non-durable with consumer replicas, and non-durable with no consumer replicas.

The results are the following:

  1. Durable mode:
    a) The failover-time is constant during multiple tests and lies at 8sec (which is the minimum failover-time which can be achieved with the actual values for timeouts etc, but these 8 seconds can be further fine-tuned).
    b) The message that is sent exactly at the failure moment is registered as lost and is resent approx. 1,5-3 seconds later.
    c) The avg latency lies between 7-15msec, which is much higher than the avg latency of the push consumer which was between 0-1msec. I don't know if this has to do generally with the pull consumer or specifically with the failover scenario.

  2. non-durable with consumer replicas.
    a) The failovertime is not constant during multiple tests varying between 8-14 Seconds.
    b) No messages lost this time
    c) The avg latency between multiple tests is very variable, varying from 2msec up to 84msec.

  3. non-durable with no consumer replicas.
    a) The failovertime is not constant during multiple tests varying between 8-17 Seconds.
    b) No messages lost this time
    c) The avg latency between multiple tests is very variable, varying from 1msec up to 32msec.

We see that the only time we got a reasonable, constant and predictable failover-time is when we use the durable pull consumer. The problem in this case is then the avg latency which is much higher than when using a push consumer.

Regarding the push-consumer:
a) The consumer resumes the connection always, but the failover-time was very variable, varying from 8-30seconds.
b) Sometimes the message that was sent exactly at the failure moment is registered as lost and is re-sent approx. 1,5-3 seconds later.
c) The avg message latency was very small lying between 0-1 msec, during multiple tests with over 500k messages and failover in between.

I don't know if there is something that can be improved in this regard, but based on these results (at least), I couldn't find a silver bullet for all the issues.

p.s. I was running an Openshift cluster on-prem (vSphere). A VM on vSphere corresponds in this case to a Kubernetes node, and in order to trigger the failover I would just shut-down the VM where the relevant nats server replica pod was living. With relevant nats server replica pod I mean that the consumer was connected to this pod and at the same the stream leader was also located on this pod.

@albionb96
Copy link

@scottf Any update on this? Which one from the 3 tested modes would you recommend for this use-case?

@scottf
Copy link
Contributor

scottf commented Nov 13, 2023

@scottf Any update on this? Which one from the 3 tested modes would you recommend for this use-case?

I'm doing my best and working on it. I currently have a pr ready for changes to Simplification. I'm working on examples to demonstrate it. Things are moving slower than we all want but they are moving. Thanks for your patience.

@scottf
Copy link
Contributor

scottf commented Nov 28, 2023

@albionb96
This is addressed in these PRs: #1043 / #1045

This affects the following consumers.

  1. Durable Push Consumer
  2. Ephemeral Push Consumer with a long enough inactive threshold to survive a server outage or disconnection
  3. Ordered Push Consumer
  4. Durable Pull Consumer when used with endless Simplification API
  5. Ephemeral Pull Consumer with a long enough inactive threshold to survive a server outage or disconnection when used with endless Simplification API
  6. Ordered Pull Consumer with endless Simplification

@scottf
Copy link
Contributor

scottf commented Nov 28, 2023

2.17.2-SNAPSHOT should be available, gradle/maven instructions here in the readme
https://github.com/nats-io/nats.java#using-gradle

@scottf
Copy link
Contributor

scottf commented Nov 29, 2023

Fetch isn't going to recover because it's not endless. Once Fetch returns a null, it's done. Time might have expired anyway. I should have made my note refer to endless simplification, sorry for the confusion.

@stefanLeo
Copy link
Author

@scottf : Can you point me to the endless API? THX

@scottf
Copy link
Contributor

scottf commented Nov 29, 2023

This is a handler example:

ConsumerContext cc = streamContext.getConsumerContext("myConsumer");

MessageHandler handler = m -> {
    // do your work
    m.ack();
};
        
ConsumeOptions co = ConsumeOptions.builder()
    .batchSize(batchSize)
    .expiresIn(expiresIn)
    .build();

MessageConsumer mc = cc.consume(co, handler); // hold onto the handle for stop / isStopped

@scottf
Copy link
Contributor

scottf commented Nov 29, 2023

https://github.com/nats-io/nats.java/tree/main/src/examples/java/io/nats/examples/jetstream/simple

@stefanLeo
Copy link
Author

@scottf : Thx. tested it witht the
try (FetchConsumer consumer = consumerContext.fetch(consumeOptions)) {
option - still there is no auto recover... Is the auto-recover from the library a wrong asumption of mine?
It works again if I hook in my reconenct logic to restart the subscription once fetch returns null like
startSeq = handler.getLastSeqNo() + 1; consumerBuilder.deliverPolicy(DeliverPolicy.ByStartSequence).startSequence(startSeq); consumerContext = streamContext.createOrUpdateConsumer(consumerBuilder.build());

@scottf
Copy link
Contributor

scottf commented Nov 30, 2023

Correct, fetch won't auto recover because it's a one time (not endless call) for an amount of messages or bytes over time. While it's possible that the fetch may not have expired by the time the client has recovered. But fetch is simple enough that once it's done or broken, it will return null, which is enough of a signal for the developer to make another fetch. I'll add an example when I get a chance.

@stefanLeo
Copy link
Author

stefanLeo commented Nov 30, 2023

@scottf : Yes but a simple refetch does not do the trick. I need to re-create my consumer in that case and tell it to start consuming at X. Otherwise tall subsequent fetches fail with an exception.

Sorry - That was too quick! I retested and it worked now. Will come back with more info after more tests!

The only thing I changed compared to my previous test : I set the numberOfReplicas to 3 now. Is this the reason?
Update: Tests without the numberOfReplicas had the consumer client stuck after failover. With numberOfReplicas set, the tests passed so far. More tomorrow...

@scottf
Copy link
Contributor

scottf commented Nov 30, 2023

@scottf scottf closed this as completed Nov 30, 2023
@stefanLeo
Copy link
Author

stefanLeo commented Dec 1, 2023

@scottf : We ran 60 failover tests over night and 33 of them failed with two different failure patterns:

  1. In some tests, after the failover, subsequent fetches always returned NULL - so the consumer was stuck.
  2. In other tests the NATS server reported and then was stuck with fetch always returning NULL:

ERROR com.frq.nats.SimplePullConsumer - ERROR fetching message: io.nats.client.JetStreamStatusCheckedException: io.nats.client.JetStreamStatusException: 409 Consumer Deleted at io.nats.client.impl.NatsFetchConsumer.nextMessage(NatsFetchConsumer.java:90) at com.frq.nats.SimplePullConsumer.main(SimplePullConsumer.java:115) Caused by: io.nats.client.JetStreamStatusException: 409 Consumer Deleted at io.nats.client.impl.NatsJetStreamSubscription._nextUnmanaged(NatsJetStreamSubscription.java:172) at io.nats.client.impl.NatsFetchConsumer.nextMessage(NatsFetchConsumer.java:87) ... 1 more

ERROR com.frq.nats.listener.PsErrorListenerLogger - pullStatusError, Connection: 65, Subscription: 1833947145, Consumer Name: PS-SIMPLE-NATS, Status:Status{code=409, message='Consumer Deleted'}

Note: There are just msecs between the last successfully consumed msg and this exception message thrown on "fetch".

TestSetup:

  1. Kubernetes Cluster setup with 3 NATS nodes, producer, consumer
  2. we kill one node (and do not bring it back)
  3. we assume the producer as well as the consumer auto recover
  4. we bring back up the failed node again, wait for it to rejoin the cluster and repeat with 1.

How can we further investigate those issues?
Is it possibly a server side issue?

@scottf
Copy link
Contributor

scottf commented Dec 1, 2023

@stefanLeo
What client version?
What tests? The fetch example I provided?
Are your consumers durable (infinite inactive threshold) or ephemeral?
Is the stream File or Memory? What is the R on the stream?
Did you give any pause between fetch loops / fails to give server time to come back?

@scottf
Copy link
Contributor

scottf commented Dec 1, 2023

Regarding the fetch example. There is some text at the top of the example. I followed it to test recovery. It assumes a durable consumer. Ephemeral is a crapshoot based on inactive threshold, you could try making it long, 1-10 minutes, enough time to recognize and restart a downed server.

/**
 * This example will demonstrate simplified fetch that is resilient
 * HOW TO TEST
 * 1. Set up and run a simple cluster. See https://github.com/nats-io/java-nats-examples/tree/main/example-cluster-config
 * 2. Run this program, watch the output for some time
 * 3. Kill the server that is the leader, let it stay down for a short time or a long time
 * 4. See the output showing things aren't running.
 * 5. Bring the killed server back up and watch the output showing recovery.
 */

@stefanLeo
Copy link
Author

Client Version = 2.17.2-SNAPSHOT
Server Version = 2.10.2 (will update to 2.10.5 for a retest)
The fetch example you provided with some changes as the examples seems to target a single NATS server enviornment only.
Non durable.
Memory Storage.
Replication = 3 (if this is what you mean by R?)
YES - the test runs infinitely - so it refetches until it consumed all 500k messages produced or after 15 minutes of no fetch message received.

See my config:

ConsumerConfiguration.Builder consumerBuilder = ConsumerConfiguration.builder()
    .numReplicas(3)
    .ackPolicy(AckPolicy.Explicit)
    .name("PS-SIMPLE-NATS")
    .ackWait(Duration.ofSeconds(5))
    .memStorage(true);
FetchConsumeOptions consumeOptions = FetchConsumeOptions.builder()
                    .maxMessages(1)
                    .expiresIn(2000)
                    .build();
while (ExampleUtils.MSG_LATCH.getCount() > 0) {
   try (FetchConsumer consumer = consumerContext.fetch(consumeOptions)) {
                    Message msg = consumer.nextMessage();
                    //some measurement code and null handling logic... 
   }  catch (Exception ex) {
                    LOG.error("ERROR fetching message: ", ex);
   }
}

@scottf
Copy link
Contributor

scottf commented Dec 1, 2023

Non durable

Recovery is completely dependent upon the consumer configuration inactive threshold being long enough to survive the outage, because it if isn't, the consumer is deleted and there is nothing that can be done.

Memory Storage

I'm not 100% sure, but even with replication, memory storage is never going to recover. I'll add it to my test and figure it out when I get a chance.

@stefanLeo
Copy link
Author

Thx. I set the incativeThreshold now explicitly to a higher value. I think the 409 code is a special case maybe caused by improper cleanup between the testruns...

What puzzles me more are the cases where the client is just stuck without any indication of "why" and it just does not receive any more msgs.

What do you mean by memstorage is not going to recover? The failover at least worked as expected in 50% of all cases :-)

@scottf scottf reopened this Dec 1, 2023
@scottf
Copy link
Contributor

scottf commented Dec 1, 2023

Got some feedback on memory storage. With replication, it might be ok and a lot depends on if it's the stream/consumer leader that goes down or not, the consumer threshold and consumer R. There are several variables.

As far as 409, do you know which? These are what I know about

public static String CONSUMER_DELETED               = "Consumer Deleted"; // 409
public static String CONSUMER_IS_PUSH_BASED         = "Consumer is push based"; // 409

public static String MESSAGE_SIZE_EXCEEDS_MAX_BYTES = "Message Size Exceeds MaxBytes"; // 409
public static String EXCEEDED_MAX_WAITING           = "Exceeded MaxWaiting"; // 409
public static String EXCEEDED_MAX_REQUEST_BATCH     = "Exceeded MaxRequestBatch"; // 409
public static String EXCEEDED_MAX_REQUEST_EXPIRES   = "Exceeded MaxRequestExpires"; // 409
public static String EXCEEDED_MAX_REQUEST_MAX_BYTES = "Exceeded MaxRequestMaxBytes"; // 409

public static String BATCH_COMPLETED                = "Batch Completed"; // 409 informational
public static String SERVER_SHUTDOWN                = "Server Shutdown"; // 409 informational with headers

@scottf
Copy link
Contributor

scottf commented Dec 1, 2023

I've been running fetch against our chaos server. It finally had a outage of the server with the stream leader. It was offline for about 5 minutes. The fetch example looping calling fetch recovered.
If you use endless "consume" you don't have to manage the loop. Still have to have durable/long enough inactive thresholds, but the behavior in the latest client is solid.

462554253 | MONITOR | Uptime: 3H43M32.929S| Publisher: 583797 (Running) | fc-dur-wRAcQKosbG: 583797
462559254 | MONITOR | Uptime: 3H43M37.93S| Publisher: 584059 (Running) | fc-dur-wRAcQKosbG: 584059
462564254 | MONITOR | Uptime: 3H43M42.93S| Publisher: 584335 (Running) | fc-dur-wRAcQKosbG: 584335
462569255 | MONITOR | Uptime: 3H43M47.931S| Publisher: 584624 (Running) | fc-dur-wRAcQKosbG: 584624
462574255 | MONITOR | Uptime: 3H43M52.931S| Publisher: 584900 (Running) | fc-dur-wRAcQKosbG: 584900
462579256 | MONITOR | Uptime: 3H43M57.932S| Publisher: 585179 (Running) | fc-dur-wRAcQKosbG: 585179
462584260 | MONITOR | Uptime: 3H44M2.936S| Publisher: 585464 (Running) | fc-dur-wRAcQKosbG: 585464
462589273 | MONITOR | Uptime: 3H44M7.949S| Publisher: 585743 (Running) | fc-dur-wRAcQKosbG: 585743
462594274 | MONITOR | Uptime: 3H44M12.95S| Publisher: 586011 (Running) | fc-dur-wRAcQKosbG: 586011
462599275 | MONITOR | Uptime: 3H44M17.951S| Publisher: 586295 (Running) | fc-dur-wRAcQKosbG: 586295
462604275 | MONITOR | Uptime: 3H44M22.951S| Publisher: 586566 (Running) | fc-dur-wRAcQKosbG: 586566
462609276 | MONITOR | Uptime: 3H44M27.952S| Publisher: 586848 (Running) | fc-dur-wRAcQKosbG: 586848
462614276 | MONITOR | Uptime: 3H44M32.952S| Publisher: 587120 (Running) | fc-dur-wRAcQKosbG: 587120
462619277 | MONITOR | Uptime: 3H44M37.953S| Publisher: 587400 (Running) | fc-dur-wRAcQKosbG: 587400
462624277 | MONITOR | Uptime: 3H44M42.953S| Publisher: 587678 (Running) | fc-dur-wRAcQKosbG: 587678
462629280 | MONITOR | Uptime: 3H44M47.956S| Publisher: 587962 (Running) | fc-dur-wRAcQKosbG: 587962
462631968 | PUBLISHER | SEVERE exceptionOccurred, CONN: 48, EX: java.io.IOException: Read channel closed.
462631968 | MONITOR | SEVERE exceptionOccurred, CONN: 49, EX: java.io.IOException: Read channel closed.
462631975 | PUBLISHER | Connection: 4222 nats: connection disconnected
462631978 | MONITOR | Connection: 4222 nats: connection disconnected
462631983 | MONITOR | Connection: 4222 nats: connection reconnected
462631983 | PUBLISHER | Connection: 4222 nats: connection reconnected
462631985 | MONITOR | Connection: 4222 nats: subscriptions re-established
462631985 | PUBLISHER | Connection: 4222 nats: subscriptions re-established
462633970 | PUBLISHER | Timeout or no response waiting for NATS JetStream server
462636281 | MONITOR | Timeout or no response waiting for NATS JetStream server
462643289 | MONITOR | Timeout or no response waiting for NATS JetStream server
462650300 | MONITOR | Timeout or no response waiting for NATS JetStream server
462657313 | MONITOR | Timeout or no response waiting for NATS JetStream server
462664314 | MONITOR | Timeout or no response waiting for NATS JetStream server
462671327 | MONITOR | Timeout or no response waiting for NATS JetStream server
462678329 | MONITOR | Timeout or no response waiting for NATS JetStream server
462685342 | MONITOR | Timeout or no response waiting for NATS JetStream server
462692347 | MONITOR | Timeout or no response waiting for NATS JetStream server
462699349 | MONITOR | Timeout or no response waiting for NATS JetStream server
462706354 | MONITOR | Timeout or no response waiting for NATS JetStream server
462713368 | MONITOR | Timeout or no response waiting for NATS JetStream server
462720376 | MONITOR | Timeout or no response waiting for NATS JetStream server
462727397 | MONITOR | Timeout or no response waiting for NATS JetStream server
462734405 | MONITOR | Timeout or no response waiting for NATS JetStream server
462741406 | MONITOR | Timeout or no response waiting for NATS JetStream server
462748407 | MONITOR | Timeout or no response waiting for NATS JetStream server
462755407 | MONITOR | Timeout or no response waiting for NATS JetStream server
462762425 | MONITOR | Timeout or no response waiting for NATS JetStream server
462769426 | MONITOR | Timeout or no response waiting for NATS JetStream server
462774935 | MONITOR | stream is offline [10118]
462780442 | MONITOR | stream is offline [10118]
462785947 | MONITOR | stream is offline [10118]
462791451 | MONITOR | stream is offline [10118]
462796957 | MONITOR | stream is offline [10118]
462802461 | MONITOR | stream is offline [10118]
462807966 | MONITOR | stream is offline [10118]
462813470 | MONITOR | stream is offline [10118]
462818974 | MONITOR | stream is offline [10118]
462824477 | MONITOR | stream is offline [10118]
462829989 | MONITOR | stream is offline [10118]
462835504 | MONITOR | stream is offline [10118]
462841008 | MONITOR | stream is offline [10118]
462846521 | MONITOR | stream is offline [10118]
462852025 | MONITOR | stream is offline [10118]
462857535 | MONITOR | stream is offline [10118]
462863039 | MONITOR | stream is offline [10118]
462868544 | MONITOR | stream is offline [10118]
462874048 | MONITOR | stream is offline [10118]
462879552 | MONITOR | stream is offline [10118]
462885056 | MONITOR | stream is offline [10118]
462890562 | MONITOR | stream is offline [10118]
462896065 | MONITOR | stream is offline [10118]
462901582 | MONITOR | stream is offline [10118]
462907085 | MONITOR | stream is offline [10118]
462912588 | MONITOR | stream is offline [10118]
462918092 | MONITOR | stream is offline [10118]
462923598 | MONITOR | stream is offline [10118]
462929102 | MONITOR | stream is offline [10118]
462933797 | PUBLISHER | Restarting Publish
462934115 | MONITOR | Stream
    StreamConfiguration{
      "name":"app-stream",
      "subjects":["app-subject"],
      "retention":"limits",
      "storage":"file",
      "num_replicas":1,
      "discard":"old",
      "duplicate_window":120000000000,
      "consumer_limits":{}}
    ClusterInfo{
      name='nats',
      leader='nats-3',
      replicas=null}
462934116 | MONITOR | Uptime: 3H49M52.792S| Publisher: 588137 (Running) | fc-dur-wRAcQKosbG: 588117
462939119 | MONITOR | Uptime: 3H49M57.795S| Publisher: 588428 (Running) | fc-dur-wRAcQKosbG: 588428
462944119 | MONITOR | Uptime: 3H50M2.795S| Publisher: 588714 (Running) | fc-dur-wRAcQKosbG: 588714
462949124 | MONITOR | Uptime: 3H50M7.8S| Publisher: 589000 (Running) | fc-dur-wRAcQKosbG: 589000
462954129 | MONITOR | Uptime: 3H50M12.805S| Publisher: 589280 (Running) | fc-dur-wRAcQKosbG: 589280
462959129 | MONITOR | Uptime: 3H50M17.805S| Publisher: 589548 (Running) | fc-dur-wRAcQKosbG: 589548
462964130 | MONITOR | Uptime: 3H50M22.806S| Publisher: 589834 (Running) | fc-dur-wRAcQKosbG: 589834
462969130 | MONITOR | Uptime: 3H50M27.806S| Publisher: 590122 (Running) | fc-dur-wRAcQKosbG: 590122
462974130 | MONITOR | Uptime: 3H50M32.806S| Publisher: 590403 (Running) | fc-dur-wRAcQKosbG: 590403
462979131 | MONITOR | Uptime: 3H50M37.807S| Publisher: 590669 (Running) | fc-dur-wRAcQKosbG: 590669
462984132 | MONITOR | Uptime: 3H50M42.808S| Publisher: 590961 (Running) | fc-dur-wRAcQKosbG: 590961
462989132 | MONITOR | Uptime: 3H50M47.808S| Publisher: 591244 (Running) | fc-dur-wRAcQKosbG: 591244
462994135 | MONITOR | Uptime: 3H50M52.811S| Publisher: 591514 (Running) | fc-dur-wRAcQKosbG: 591514
462999136 | MONITOR | Uptime: 3H50M57.812S| Publisher: 591786 (Running) | fc-dur-wRAcQKosbG: 591786
463004150 | MONITOR | Uptime: 3H51M2.826S| Publisher: 592053 (Running) | fc-dur-wRAcQKosbG: 592053
463009167 | MONITOR | Uptime: 3H51M7.843S| Publisher: 592314 (Running) | fc-dur-wRAcQKosbG: 592314
463014174 | MONITOR | Uptime: 3H51M12.85S| Publisher: 592590 (Running) | fc-dur-wRAcQKosbG: 592590
463019174 | MONITOR | Uptime: 3H51M17.85S| Publisher: 592859 (Running) | fc-dur-wRAcQKosbG: 592859
463024174 | MONITOR | Uptime: 3H51M22.85S| Publisher: 593125 (Running) | fc-dur-wRAcQKosbG: 593125
463029175 | MONITOR | Uptime: 3H51M27.851S| Publisher: 593410 (Running) | fc-dur-wRAcQKosbG: 593410
463034175 | MONITOR | Uptime: 3H51M32.851S| Publisher: 593674 (Running) | fc-dur-wRAcQKosbG: 593674
463039176 | MONITOR | Uptime: 3H51M37.852S| Publisher: 593939 (Running) | fc-dur-wRAcQKosbG: 593939

@scottf scottf closed this as completed Dec 1, 2023
@albionb96
Copy link

@scottf I don't know if it was clear from the beginning but the test case which generates the worst results is the test case when the consumer is connected on the leader node and exactly this node is shut-down.

Tests cases which generate bad results (problematic test cases):

  1. Kill the node where consumer is connected
  2. Kill the node where the leader resides and also the consumer is connected
  3. Kill the node where leader resides but also both producer and consumer are connected.

Failover test steps:

  1. Deploy the nats cluster using the nats helm charts.

Clarification: There is statefulset with 3 nats server pods/replicas. There are only 3 K8s nodes reserved for the nats server replicas and each nats server replica can live only on a specific separate K8s Node. If one of these K8s Nodes is shut down then its nats server replica/pod won't start on another K8s Node. Only when this specific K8s Node comes up again, then the nats server replica pod can also be redeployed.

  1. Deploy a producer pod and a consumer pod on separate K8s Nodes. (Different also from the 3 K8s Nodes used exclusively for the 3 nats server replica pods)

  2. Wait 120 Seconds to let some message flow in.
    (We have a message rate of 1000msgs/sec, and message size of 1kB)

  3. We shut down the VM, which hosts the K8s Node, where the nats cluster leader resides and also where the consumer is connected to.

(We do not expect that the nats replica which is down to be redeployed on another K8s Node, but we expect that the cluster continues to work with two nats server replicas and also expect that one of them is chosen to be the new leader and that the consumer also is connected to one of these two remaining nodes. And exactly here sometimes the process happens as we expect with not too much delay (8 Seconds) and sometimes the consumer is stuck and cannot be re connected to one if the remaining two nats cluster nodes)

  1. Wait for 60 Seconds in order to make sure that the failover is over, and then power on again the VM which was shut down earlier.

  2. Wait an additional of 700 Seconds.
    (Just to make sure that if the consumer isn't stuck then all 500k messages were sent and consumed)

@scottf
Copy link
Contributor

scottf commented Dec 2, 2023

the test case which generates the worst results is the test case when the consumer is connected on the leader node and exactly this node is shut-down.

Yes, that's exactly what I'm testing for. I run a 3 cluster node on my single dev machine, determine which node is the leader and kill that one.

The output I showed you is from my non-manual server test, running against a cluster of servers in K8s with an external processes designed to create chaos/mayhem by killing servers and bringing them back up. This in the output I posted:

462774935 | MONITOR | stream is offline [10118]

It shows the leader being offline. My "MONITOR" is showing the result of a getStreamInfo management call that threw an exception because the server itself returned an error.

The outage is typically 5 minutes, as you can see where the consumer started getting messages again.

The output also shows the client, using a ConsumerContext then repeatedly "Fetch"ing, recovering. The Fetch does not recover by itself. It ends. You just call Fetch again.

If you want an endless consume that does this for you, then you can use consume or iterate from a ConsumerContext

If you want Fetch, start with this example: https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/simple/FetchResilientExample.java

@stefanLeo
Copy link
Author

@scottf : Thx for the feedback and confirmation that the setup is supported and OK, And good to hear that the client is working as expected !
The 5 minute failover time is definitely too high for our use case. We demand a maximum failover time for a node failure of <=15 seconds.

How can we progress on improving on that? Should I create a NATS server ticket and link our discussion here?

@scottf
Copy link
Contributor

scottf commented Dec 4, 2023

@stefanLeo The 5 minutes is our chaos engine not bringing up the downed server for 5 minutes. This means that the example client/consumer survived a 5 minute leader server outage. The fetch example is designed to retry every 10 milliseconds. The endless consume implementation relies on the heartbeats setup with the consume options.

@stefanLeo
Copy link
Author

stefanLeo commented Dec 4, 2023

@scottf : OK understood - Our expectation is different in the way that we expect the NATS cluster to continue operations even though a node is down and may remain down for hours or even days. The client reconnects to another node and continues fetching.
In fact it even does so - unfortunately only for 50% of all runs. Note that the other 50% of the tests did NEVER recover - we let them run for 15 mins after the node failure (apparently without restarting the failed node).

Does that scenario make sense for you?

@scottf
Copy link
Contributor

scottf commented Dec 5, 2023

@stefanLeo On a stream with replicas (i.e. R3) the stream leader will switch to a different node. This is a much shorter outage that the consumer needs to survive.
I have been testing with R1 streams since it is a bigger failure and the stream is actually down.

@stefanLeo
Copy link
Author

@scottf @derekcollison : Final update from my side on this topic: We managed to run > 200 successful failovers in a row with NATS 2.10.6 and the Java Snapshot version discussed in here. All passed successfully with a failover time < 10 sec!

So Thank you very much for your help and support. Great Work!
This is really proving that NATS incl. the Java Lib is rock solid (and we have tested some alternatives as well that all struggled with failover).

@scottf
Copy link
Contributor

scottf commented Dec 20, 2023

Thank you for the response and feedback. I'll have a release out this week.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants