-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
NATS Push Consumer Client Stuck after NATS Node Failure #997
Comments
What also caught my eyes is that we added some code measure the time it takes from publishing till the Exception is thrown - you can see this here: And the failover always takes exactly 15 secs... Why is that? and can we improve on that via some setting? UPDATE; This behaviour can be changed/set via Request Timeout on JetStreamOptions -> Solved. |
Update: The consumer is really only stuck if the "NATS STREAM Leader" is killed. The failover works if any other node is killed. |
--- bootstrap is nats://host:4222, nats://host:5222, nats://host:6222, nats://host:7222, nats://host:8222 trying nats://host:7222 - tried, marked as failed once & retried 0 times --- after being connected to 4222, server 4222 was brought down, there are no servers available. trying nats://host:6222 - tried, marked as failed once & retried 0 times trying nats://host:6222 - tried, marked as failed twice & retried 1 time trying nats://host:6222 - tried, marked as failed 3 times & retried 2 times trying nats://host:6222 - tried, marked as failed 4 times & retried 3 times done retrying
|
Our tests reveal the following for your test scenaios:
|
Pull status error does not reflect that, that's probably something we should document better. Pull status error is for when there is a pull request that results in an error from the server because of some constraint. |
Great to hear! Let me know if we can help or in case there is something we can test. |
Just some update from our side - We updated to 2.10 withou any change on the topic.
|
And one update - we did now switch from "ordered" consumer to a cusomt consumer config with replicas set to 3 and nowthis works also in case the leader node fails. |
|
@scottf : We found another issue with Leader Failure that happens only randomly: Its about that timespan here: ] [nats: connection reconnected] 2023-10-17 08:27:47:071 vs 2023-10-17 08:28:09:444 = 22 sec delta. NATS Server Logs - from a different test run - during the client freeze time: |
And another update: We tried the pull based subscriber now as well with basically the same result:
once the stream leader node on server side is killed, the consumer is stuck. It reconnencts and successfully resubscribes but then no new messages are cnosumed... |
Duplicate in nats-io/nats-server#4624 |
@stefanLeo As you have noticed, when using the raw pull, and the stream leader (or consumer leader node) is killed, the consumer does not recover. For raw pulls, it will never recover, it cannot. You must pay attention to heartbeats and connection events. I will try to make an example. |
@scottf : Thx for the feedback. I will try to add a reconnection handler to my pull example. I will try to share with you guys once I have it. |
@scottf I tested the issue that @stefanLeo is having, using the pull consumer in three "modes": durable, non-durable with consumer replicas, and non-durable with no consumer replicas. The results are the following:
We see that the only time we got a reasonable, constant and predictable failover-time is when we use the durable pull consumer. The problem in this case is then the avg latency which is much higher than when using a push consumer. Regarding the push-consumer: I don't know if there is something that can be improved in this regard, but based on these results (at least), I couldn't find a silver bullet for all the issues. p.s. I was running an Openshift cluster on-prem (vSphere). A VM on vSphere corresponds in this case to a Kubernetes node, and in order to trigger the failover I would just shut-down the VM where the relevant nats server replica pod was living. With relevant nats server replica pod I mean that the consumer was connected to this pod and at the same the stream leader was also located on this pod. |
@scottf Any update on this? Which one from the 3 tested modes would you recommend for this use-case? |
I'm doing my best and working on it. I currently have a pr ready for changes to Simplification. I'm working on examples to demonstrate it. Things are moving slower than we all want but they are moving. Thanks for your patience. |
@albionb96 This affects the following consumers.
|
2.17.2-SNAPSHOT should be available, gradle/maven instructions here in the readme |
Fetch isn't going to recover because it's not endless. Once Fetch returns a null, it's done. Time might have expired anyway. I should have made my note refer to endless simplification, sorry for the confusion. |
@scottf : Can you point me to the endless API? THX |
This is a handler example: ConsumerContext cc = streamContext.getConsumerContext("myConsumer");
MessageHandler handler = m -> {
// do your work
m.ack();
};
ConsumeOptions co = ConsumeOptions.builder()
.batchSize(batchSize)
.expiresIn(expiresIn)
.build();
MessageConsumer mc = cc.consume(co, handler); // hold onto the handle for stop / isStopped |
@scottf : Thx. tested it witht the |
Correct, fetch won't auto recover because it's a one time (not endless call) for an amount of messages or bytes over time. While it's possible that the fetch may not have expired by the time the client has recovered. But fetch is simple enough that once it's done or broken, it will return null, which is enough of a signal for the developer to make another fetch. I'll add an example when I get a chance. |
@scottf : Yes but a simple refetch does not do the trick. I need to re-create my consumer in that case and tell it to start consuming at X. Otherwise tall subsequent fetches fail with an exception. Sorry - That was too quick! I retested and it worked now. Will come back with more info after more tests! The only thing I changed compared to my previous test : I set the numberOfReplicas to 3 now. Is this the reason? |
Here is a pretty thorough fetch example: https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/simple/FetchResilientExample.java |
@scottf : We ran 60 failover tests over night and 33 of them failed with two different failure patterns:
Note: There are just msecs between the last successfully consumed msg and this exception message thrown on "fetch". TestSetup:
How can we further investigate those issues? |
@stefanLeo |
Regarding the fetch example. There is some text at the top of the example. I followed it to test recovery. It assumes a durable consumer. Ephemeral is a crapshoot based on inactive threshold, you could try making it long, 1-10 minutes, enough time to recognize and restart a downed server.
|
Client Version = 2.17.2-SNAPSHOT See my config:
|
Recovery is completely dependent upon the consumer configuration inactive threshold being long enough to survive the outage, because it if isn't, the consumer is deleted and there is nothing that can be done.
I'm not 100% sure, but even with replication, memory storage is never going to recover. I'll add it to my test and figure it out when I get a chance. |
Thx. I set the incativeThreshold now explicitly to a higher value. I think the 409 code is a special case maybe caused by improper cleanup between the testruns... What puzzles me more are the cases where the client is just stuck without any indication of "why" and it just does not receive any more msgs. What do you mean by memstorage is not going to recover? The failover at least worked as expected in 50% of all cases :-) |
Got some feedback on memory storage. With replication, it might be ok and a lot depends on if it's the stream/consumer leader that goes down or not, the consumer threshold and consumer R. There are several variables. As far as 409, do you know which? These are what I know about
|
I've been running fetch against our chaos server. It finally had a outage of the server with the stream leader. It was offline for about 5 minutes. The fetch example looping calling fetch recovered.
|
@scottf I don't know if it was clear from the beginning but the test case which generates the worst results is the test case when the consumer is connected on the leader node and exactly this node is shut-down. Tests cases which generate bad results (problematic test cases):
Failover test steps:
Clarification: There is statefulset with 3 nats server pods/replicas. There are only 3 K8s nodes reserved for the nats server replicas and each nats server replica can live only on a specific separate K8s Node. If one of these K8s Nodes is shut down then its nats server replica/pod won't start on another K8s Node. Only when this specific K8s Node comes up again, then the nats server replica pod can also be redeployed.
(We do not expect that the nats replica which is down to be redeployed on another K8s Node, but we expect that the cluster continues to work with two nats server replicas and also expect that one of them is chosen to be the new leader and that the consumer also is connected to one of these two remaining nodes. And exactly here sometimes the process happens as we expect with not too much delay (8 Seconds) and sometimes the consumer is stuck and cannot be re connected to one if the remaining two nats cluster nodes)
|
Yes, that's exactly what I'm testing for. I run a 3 cluster node on my single dev machine, determine which node is the leader and kill that one. The output I showed you is from my non-manual server test, running against a cluster of servers in K8s with an external processes designed to create chaos/mayhem by killing servers and bringing them back up. This in the output I posted:
It shows the leader being offline. My "MONITOR" is showing the result of a The outage is typically 5 minutes, as you can see where the consumer started getting messages again. The output also shows the client, using a ConsumerContext then repeatedly "Fetch"ing, recovering. The Fetch does not recover by itself. It ends. You just call Fetch again. If you want an endless consume that does this for you, then you can use If you want Fetch, start with this example: https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/simple/FetchResilientExample.java |
@scottf : Thx for the feedback and confirmation that the setup is supported and OK, And good to hear that the client is working as expected ! How can we progress on improving on that? Should I create a NATS server ticket and link our discussion here? |
@stefanLeo The 5 minutes is our chaos engine not bringing up the downed server for 5 minutes. This means that the example client/consumer survived a 5 minute leader server outage. The fetch example is designed to retry every 10 milliseconds. The endless consume implementation relies on the heartbeats setup with the consume options. |
@scottf : OK understood - Our expectation is different in the way that we expect the NATS cluster to continue operations even though a node is down and may remain down for hours or even days. The client reconnects to another node and continues fetching. Does that scenario make sense for you? |
@stefanLeo On a stream with replicas (i.e. R3) the stream leader will switch to a different node. This is a much shorter outage that the consumer needs to survive. |
@scottf @derekcollison : Final update from my side on this topic: We managed to run > 200 successful failovers in a row with NATS 2.10.6 and the Java Snapshot version discussed in here. All passed successfully with a failover time < 10 sec! So Thank you very much for your help and support. Great Work! |
Thank you for the response and feedback. I'll have a release out this week. |
Observed behavior
We re-used the NATS JetStream Producer and PUSH Consumer examples from https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
and then killed the NATS leader node of the stream (we forcefully killed the VM hosting the Kube Worker Node).
Setup: In Memory Storage Option, 3 Replicas.
Config of createStream was changed to:
StreamConfiguration sc = StreamConfiguration.builder()
.name(streamName)
.storageType(storageType)
.subjects(subjects)
.replicas(3)
.description("LifeX-Test")
.build();
and connection builder to
Options.Builder builder = new Options.Builder()
.server(servers)
.connectionTimeout(Duration.ofMillis(500))
.pingInterval(Duration.ofSeconds(3))
.maxPingsOut(2)
.reconnectWait(Duration.ofMillis(500))
.connectionListener(EXAMPLE_CONNECTION_LISTENER)
.traceConnection()
.errorListener(EXAMPLE_ERROR_LISTENER)
.maxReconnects(-1)
.reconnectDelayHandler(new PsReconnectDelayHandler())
.reconnectJitter(Duration.ofMillis(500));
When connecting, we configure all 3 servers of the cluster and register connection, error and delay handlers (basically just logging the callbacks).
Setup Environment: NATS Cluster with 3 NATS Pods on top of RHAT Openshift Kubernetes cluster.
After the failure of the NATS master node the following happens:
Note as well that the consumer aborts once the producer is done and deletes the stream. Then some disconnect log is printed.
Logs of NATS nodes are attached... I cannot really add logs of the java client as there are none as it is seems to just remain stuck indefinitely. UPDATE: Added java client logs with traceConnection settings and now we see more details.
The Client seems to reconnect and resubscribe, but still does NOT get any further messages pushed...
Expected behavior
Producer detects the failure, reconnects and continues sending.
Consumer detects the failure, reconnects and continues consuming.
Server and client version
Server: 2.9.22
Java Client: 2.16.14
Host environment
We used the official NATS container images and the HELM charts for deployment.
Steps to reproduce
Setup 3 node cluster on RHAT Openshift or any other Kubernetes Cluster
Start producer with settings as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPub.java
Start consumer with settnigs as above > https://github.com/nats-io/nats.java/blob/main/src/examples/java/io/nats/examples/jetstream/NatsJsPushSubBasicAsync.java
Kill leader node (find stream leader via using nats cli)
Logs:
nats-server-2-logs.txt
nats-server-1-logs.txt
logs.zip
The text was updated successfully, but these errors were encountered: