Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

emqx clustered base on dns mode, when a node in the cluster is destroyed, other nodes can't restart successfully #9294

Closed
bennyQi opened this issue Nov 3, 2022 · 2 comments

Comments

@bennyQi
Copy link

bennyQi commented Nov 3, 2022

What happened?

emqx 配置dns的集群方式,当有集群内有节点销毁时,其他节点重启时进程启动失败。

What did you expect to happen?

在DNS A记录组成集群的模式下,其他节点退出或销毁,不影响集群内其他节点重启。

How can we reproduce it (as minimally and precisely as possible)?

第一步:配置dns解析A记录两个主机a和b;
第二步:在主机a和b上启动emqx,组成集群;
第三步:kill掉主机a上的emqx节点;
第四步:重启b节点上的emqx节点,观察进程状态;

Anything else we need to know?

没有

EMQX version

$ ./bin/emqx_ctl broker
# paste output here
```.
EMQX 5.0.9
</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Linux 2a22fa71363b 5.13.0-51-generic #58~20.04.1-Ubuntu SMP Tue Jun 14 11:29:12 UTC 2022 x86_64 GNU/Linux

Log files

<details>
root@2a22fa71363b:/home# emqx foreground
!!!!!!
WARNING: Default (insecure) Erlang cookie is in use.
WARNING: Configure node.cookie in /opt/emqx/etc/emqx.conf or override from environment variable EMQX_NODE__COOKIE
NOTE: Use the same config value for all nodes in the cluster.
!!!!!!
log.file_handlers.default.enable = EMQX_LOG__FILE_HANDLERS__DEFAULT__ENABLE = false
log.console_handler.enable = EMQX_LOG__CONSOLE_HANDLER__ENABLE = true
Listener ssl:default on 0.0.0.0:8883 started.
Listener tcp:default on 0.0.0.0:1883 started.
Listener ws:default on 0.0.0.0:8083 started.
Listener wss:default on 0.0.0.0:8084 started.
Listener http:dashboard on :18083 started.
EMQX 5.0.9 is running now!
2022-11-03T09:37:28.542400+00:00 [warning] line: 121, mfa: mria:stop/1, msg: Stopping mria, reason: join
Stop listener http:dashboard on :18083 successfully.
Listener ssl:default on 0.0.0.0:8883 stopped.
Listener tcp:default on 0.0.0.0:1883 stopped.
Listener ws:default on 0.0.0.0:8083 stopped.
Listener wss:default on 0.0.0.0:8084 stopped.
2022-11-03T09:37:35.730801+00:00 [warning] global: 'emqx@172.17.0.6' failed to connect to 'emqx@10.178.40.221'
2022-11-03T09:37:35.734828+00:00 [warning] global: 'emqx@172.17.0.6' failed to connect to 'emqx@10.178.179.42'
2022-11-03T09:37:35.773997+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:37:42.861866+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:37:49.950346+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:37:57.053313+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:04.144287+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:11.236385+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:18.361215+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:25.468491+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:32.552148+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
2022-11-03T09:38:39.721480+00:00 [error] Mnesia('emqx@172.17.0.6'): ** ERROR ** Mnesia on 'emqx@172.17.0.6' could not connect to node(s) ['emqx@10.178.179.42','emqx@10.178.40.221']
@bennyQi bennyQi added the BUG label Nov 3, 2022
@lafirest lafirest changed the title emqx 配置dns的集群方式,当有集群内有节点销毁时,其他节点重启时进程启动失败 emqx clustered base on dns mode, when a node in the cluster is destroyed, other nodes can't restart successfully Nov 4, 2022
@ieQu1 ieQu1 self-assigned this Nov 10, 2022
@ieQu1
Copy link
Member

ieQu1 commented Nov 10, 2022

Hello,
I would like to ask for some additional information:

  1. You mentioned that you configured 2 nodes in the DNS A records, but in the logs I see that there are three nodes involved: emqx@172.17.0.6, emqx@172.17.0.6 and emqx@10.178.179.42. I assume emqx@10.178.179.42 is node b, is this correct? Which one is node a?
  2. Did you wait for node b to become fully up before restarting node a? Autocluster procedure works like this:
    1. A new node finds the list of potential cluster nodes it can join with
    2. It checks that these nodes are up and running
    3. If the previous condition is met, EMQX application is stopped on the local node and data from the oldest remote node is copied to the local node.
    4. Once transfer of the data is complete, EMQX application is restarted, and it becomes part of the cluster.

However, if you restart the remote node between steps 3 and 4 the situation is unrecoverable:
EMQX doesn't have the full copy of the data, so it cannot proceed.
According to the logs step 3 happened at 09:37:28, and at 09:37:35 EMQX detected that the remote node is unreachable. Did you kill node a during this period?

@id id added #triage/wait and removed BUG labels Oct 9, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 23, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants