-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
fix: always allow selecting a new node for cluster mode subscriptions when the current one fails #1589
Conversation
… when the current one fails
@luin any thoughts on this fix? |
Hey @headlessme , Thanks for the PR. I think the current code already handles subscriber reselecting: When a node is lost (either it is down or there are network issues), and it happens to be the subscriber, we will perform reselecting: https://github.com/luin/ioredis/blob/main/lib/cluster/ClusterSubscriber.ts#L22. Is there anything not working so we need to catch it as this PR does? |
@luin Thanks for looking into this. Yes I believe there is a case that this doesn't catch that we have seen in production. This happens when a node dies without a clean shutdown (e.g. via docker kill / hardware failure) and then the node fails to come back up for whatever reason. The connection won't get an The existing logic works in the case of a clean shutdown, where the connection is manually closed as the We managed to recreate it consistently as follows:
Another solution might be to define a more appropriate retry strategy for the subscriber connection, which fails quickly so the |
@headlessme Thanks for the details! I just tested locally with Redis's built-in create-cluster script and everything works for me. I kill the subscriber node with Reconnection is disabled for all cluster nodes: https://github.com/luin/ioredis/blob/main/lib/cluster/ConnectionPool.ts#L75, so although the subscriber may try to reconnect, the node will emit a close event and then immediately an end event. So seems to me, that if |
The reconnection is disabled on the cluster connections, but the |
Also note the comment above the connection constructor: |
Yeah, I realized that, but we have a listener for the So here is how it works:
|
Here's some logs with
No matter how long I leave it, the subscriber never recovers, so there's definitely a bug here somewhere it seems, despite 5 of the 6 nodes still being alive. |
Seems it could be caused by the use of |
Hmm yeah, that could be the reason. Nice catch! I don't recall why we rely on |
Yea, sounds like that should work. I'll try it out! |
Seems to work, although one thing to note when removing the |
Yeah I think we should keep the |
@luin I've updated the PR – Does this match what you're expecting from our discussion? |
It does! I'll double check it later this week. |
Tests failed. Pretty strange. Haven't found out the cause. |
Any idea if the failures are specifically related to these changes? |
I'm not sure but I don't think the changes break the functionalities covered by the failing tests. However we have to find out the cause to make sure. |
After some debuggings it turns out the new code creates a dead loop, which caused test failed. The breakdown:
I think a quick fix is to unset |
71adbb8
to
2baccc8
Compare
…r to prevent dead loop
2baccc8
to
ca54b37
Compare
@luin I've changed the logic to remove the Looks like the tests were failing because of some missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
## [5.2.1](v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](#1589)) ([1c8cb85](1c8cb85))
🎉 This PR is included in version 5.2.1 🎉 The release is available on: Your semantic-release bot 📦🚀 |
## [5.2.1](redis/ioredis@v5.2.0...v5.2.1) (2022-07-16) ### Bug Fixes * always allow selecting a new node for cluster mode subscriptions when the current one fails ([#1589](redis/ioredis#1589)) ([1c8cb85](redis/ioredis@1c8cb85))
We ran into an issue where some of our redis clients would miss messages from the pubsub bus following a network issue / crash in the redis cluster. This PR changes the reconnect logic in the ClusterSubscriber such that it can pick to connect to a different node when the current connection closes. The idea being that if a single node is having issues, the subscriber can connect immediately to another member of the cluster so it doesn't miss as many messages.