-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Failed to join: Member has conflicting node ID #3070
Comments
Hi @eladitzhakian this'll be generally covered by #1580 once that's complete, but there should be a way to get this working in the meantime. Option 4 should work no matter what - are you sure that the force-leave is actually working? It's by node name, not IP. To understand this more, what's preserved when you restart or relaunch a machine. Is it the same node name before and after? |
Thanks for your reply @slackpad Usually the same node name is preserved but as I said it doesn't help when I force a new node ID or ask consul to disable host node ID. getting the same error with the new node ID :/ |
Hmm do you have a simple repro case that'll show the force-leave not working? I'm not sure what's happening with this one. |
I'm afraid this is not easily reproduced. But it does happen on both clusters I'm maintaining. Each cluster has 3 consul server nodes and about 20 nodes running a single agent, each one has a registrator container running next to it. Everything dockerized. |
@eladitzhakian I've been seeing similar issue when deploying with the Kubernetes Helm Chart. There are two places I've noticed possible conflicts:
You didn't mention if you are running on k8s but I suspect this might be a problem for other cluster manager/orchestration networks with similar concepts of StatefulSets and PersistentVolumes. |
@nrvale0 thanks but i'm not using k8s or any orchestration manager :/ |
Hi All, The same is observed in 0.8.4, and I am clustering consul in docker swarm using dedicated overlay network. @slackpad, what's the possible workaround for this issue in current release? what's the schedule for the fix delivery? Thanks! |
@hehailong5 helm/charts#1289 references a config option for Consul : -disable-host-node-id :: https://www.consul.io/docs/agent/options.html#_disable_host_node_id I'm testing the use of that option in the Kubernetes Helm Chart for Consul now and its been working fine but I've also not spent a lot of time digging into it to fully understand any downstream consequences. I should also point out that the host node ID gets cached if the container is using a persistent storage volume so transitioning to use of -disable-host-node-id likely involves destroying / cleaning previous storage volumes for the container. The OP says that still does not solve it for him 100% of the time. I've not yet seen the combo of -disable-host-node-id + fresh volume fail. |
@hehailong5, like @nrvale0 points out, you can set disable_host_node_id to true There's an open ticket for making that the default behavior in the 0.9.0 release #3171 There shouldn't be any other downstream implications - it will generate a random nodeid if that option is set. As long as you don't have any monitoring/nagios alerts that rely on the values of nodeids in your environment, you should be fine. |
Oops, We already have had such an option to prevent this happening, will try that out, thank you guys! |
We have a very similar setup to the original issue by @eladitzhakian (terraform + consul running as ECS task). What's different is that we're using ECS Service to handle task deployment, so when we change the consul agent task definition, no EC2 recreation happens. What's important, the tasks had mounted Still, redeployment of the task, causes the master to reject clients with similar message:
What's weird in this situation is that this error message has nothing to do with the node I'm getting the log from. It looks like Node A cannot join the cluster, because nodes B and C have conflicting Workaround that worked for me was to use the
|
Also happen in my environment, in version 0.8.5. conflicts of any two nodes lock agent join of any other nodes. Pretty much add to clean data revert back to 0.7 and restored KV from backup. I would really emphasize this behavior in the documentation for the benefits of any one that consider to upgrade |
same thing here... I think this should be a big warning in the "breaking changes" part of the changelog, the config doc says that in 0.8.5 this is a MAJOR issue when working with AMIs, you need to ensure that when you build the AMI of the client (or even the consul servers) you stop consul agent and remove the /var/consul/node-id....or else, any instance being launched from that AMI (ie: an autoscaling group launching instaces of the same app when it need to scale) it will break since all of them will have the same id one way to fix this would ensure that when the agent starts it removes/replaces the node-id file with the new ID (like it would do with a pidfile or something like that) |
Hi @sebamontini that's an interesting consequence of this. In general it's not a good idea to bake a data-dir into your AMIs since there are other things in there that could also cause issues like Raft data, Serf snapshots, etc. I'd definitely recommend that your AMI build process either doesn't start Consul to populate the data-dir, or shuts down Consul at the end and clears it out. There should be no need to keep anything other than Consul's configuration files in the AMI. |
@slackpad it wasn't part of the idea, since the automated process that bakes the AMI, first provision the vanilla instance and the install everything, we wanted to startup consul to test if everything is running properly. we will add one more step to ensure we clear all the data. |
I'm running into this problem as well, trying to set up a 6 node "cluster" on my laptop, to test the migration from 0.7.5 to 0.9.2. I've got a script that sets up all 6 consul instances with unique data dirs, ports, etc. With 0.7.5 I can get a fully-working cluster. When I attempt to upgrade one client to 0.9.2 -- even with This only affects my proof-of-concept, not a production workload, but it does mean I'm gonna have to spin up a fleet of VMs, or just provision some real instances. :-( |
i think the problem @blalor is that in 0.7.5 the default was to use host-node-id, so when you upgrade to 0.9.x all 6 nodes are using the same ID, you could either delete the node-id data before upgrading each node or creating the 0.7.5 cluster with |
That option's not supported in 0.7.5.
|
I'm having sort of the same issue. My servers and agents are 0.7.4. All the agents are docker containers so all agent id's are the same. Long story short there is no consul version that will generate a random node id without checking if there is a conflict. This would have allowed me to upgrade in two steps. This is happening in our dev, test and production environment. EDIT: I've decided to get rid of consul inside the containers and solve the issue as described here: https://medium.com/zendesk-engineering/making-docker-and-consul-get-along-5fceda1d52b9 |
@blalor for your use case in dev you can start each old 0.7.5 agent with something like |
I'm thinking the best way to make this easier for operators is to only enforce unique host IDs for Consul agents running version 0.8.5 or later (that's when we made host-based IDs opt-in). This would be a small code change, and helps interoperability for folks that are skipping several major versions. If you have large pools of older agents this gets to be a pain. |
Thanks, @slackpad. I got past my issue and have completely upgraded to 0.9.2. |
I'm using dockerized consul agents and servers, version 0.8.1, and I'm also using terraform to manage the infrastructure
Whenever I'm relaunching (destory+create) a machine, or simply reboot it, the consul agent fails to rejoin the cluster with this error:
The consul servers show the corresponding error:
I tried everything the mailing list suggests:
Nothing. Works.
Lastly I downgraded to 0.7.5 (where uniqueness is not enforced) and the agent was able to rejoin the cluster. What are my options?
The text was updated successfully, but these errors were encountered: