Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] k3s cluster 添加 arm 节点报错 v3.11.5 可以,但是 v3.11.6 网络异常 #21088

Closed
elvizlai opened this issue Aug 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@elvizlai
Copy link

elvizlai commented Aug 23, 2024

问题描述/What happened:
k3s 集群搭建完毕后,添加 2 台 x86 宿主机正常,添加 arm 机器时报错如下(以上机器在同机房):

[info 2024-08-23 15:31:27 hostinfo.NewNIC(hostinfohelper.go:241)] IP 172.20.1.115/br0/enp1s0f0np0
[info 2024-08-23 15:31:27 hostbridge.(*SBaseBridgeDriver).ConfirmToConfig(hostbridge.go:180)] bridge br0 already has ip 172.20.1.115
[info 2024-08-23 15:31:27 hostinfo.NewNIC(hostinfohelper.go:291)] Confirm to configuration!!
[info 2024-08-23 15:31:27 hostinfo.NewNIC(hostinfohelper.go:241)] IP /br1/bond0
[info 2024-08-23 15:31:27 netutils2.(*SNetInterface).IsSecretAddress(netutils.go:352)] MASK ---
[info 2024-08-23 15:31:27 hostinfo.NewNIC(hostinfohelper.go:291)] Confirm to configuration!!
[info 2024-08-23 15:31:27 hostinfo.(*SNIC).SetupDhcpRelay(hostinfohelper.go:203)] Not enable dhcp relay on nic: &hostinfo.SNIC{Inter:"enp1s0f0np0", Bridge:"br0", Ip:"172.20.1.115", Wire:"", WireId:"", Mask:24, Bandwidth:1000, BridgeDev:(*hostbridge.SOVSBridgeDriver)(0x4002578630), dhcpServer:(*hostdhcp.SGuestDHCPServer)(0x4002579020)}
[info 2024-08-23 15:31:27 hostinfo.(*SNIC).SetupDhcpRelay(hostinfohelper.go:203)] Not enable dhcp relay on nic: &hostinfo.SNIC{Inter:"bond0", Bridge:"br1", Ip:"", Wire:"bcast1", WireId:"", Mask:0, Bandwidth:1000, BridgeDev:(*hostbridge.SOVSBridgeDriver)(0x4001e365a0), dhcpServer:(*hostdhcp.SGuestDHCPServer)(0x4001e36ff0)}
[info 2024-08-23 15:31:27 hostinfo.(*SHostInfo).setupOvnChassis(hostinfo.go:223)] Start setting up ovn chassis
goroutine 1 [running]:
runtime/debug.Stack()
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x68
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x20
yunion.io/x/onecloud/pkg/util/ovnutils.InitOvn.func1()
        /root/go/src/yunion.io/x/onecloud/pkg/util/ovnutils/ovnutils.go:125 +0x40
panic({0x2644460, 0x40008d4780})
        /usr/lib/go/src/runtime/panic.go:838 +0x20c
yunion.io/x/onecloud/pkg/util/ovnutils.mustPrepOvsdbConfig({{0x40016e8120, 0x1b}, {0x40016c1930, 0x5}, {0x0, 0x0}, {0x0, 0x0}, 0x5dc, {0x40016c1958, ...}, ...})
        /root/go/src/yunion.io/x/onecloud/pkg/util/ovnutils/ovnutils.go:93 +0x5a0
yunion.io/x/onecloud/pkg/util/ovnutils.InitOvn({{0x40016e8120, 0x1b}, {0x40016c1930, 0x5}, {0x0, 0x0}, {0x0, 0x0}, 0x5dc, {0x40016c1958, ...}, ...})
        /root/go/src/yunion.io/x/onecloud/pkg/util/ovnutils/ovnutils.go:130 +0xc0
yunion.io/x/onecloud/pkg/hostman/hostinfo.(*OvnHelper).Init(...)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/hostinfo/hostovn.go:41
yunion.io/x/onecloud/pkg/hostman/hostinfo.(*SHostInfo).setupOvnChassis(0x4000b81ce0?)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/hostinfo/hostinfo.go:225 +0x15c
yunion.io/x/onecloud/pkg/hostman/hostinfo.(*SHostInfo).Init(0x51354d0?)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/hostinfo/hostinfo.go:210 +0xe8
yunion.io/x/onecloud/pkg/hostman.(*SHostService).RunService(0x4000236398?)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/host_services.go:80 +0x5c
yunion.io/x/onecloud/pkg/cloudcommon/service.(*SServiceBase).StartService(0x400000e108)
        /root/go/src/yunion.io/x/onecloud/pkg/cloudcommon/service/services.go:58 +0xe0
yunion.io/x/onecloud/pkg/hostman.StartService(...)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/host_services.go:167
main.main()
        /root/go/src/yunion.io/x/onecloud/cmd/host/main.go:30 +0x124
goroutine 1 [running]:
runtime/debug.Stack()
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x68
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x20
yunion.io/x/log.Fatalf({0x2b3a53c, 0x1c}, {0x400052be88, 0x1, 0x1})
        /root/go/src/yunion.io/x/onecloud/vendor/yunion.io/x/log/log.go:138 +0x34
yunion.io/x/onecloud/pkg/hostman.(*SHostService).RunService(0x4000236398?)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/host_services.go:81 +0x90
yunion.io/x/onecloud/pkg/cloudcommon/service.(*SServiceBase).StartService(0x400000e108)
        /root/go/src/yunion.io/x/onecloud/pkg/cloudcommon/service/services.go:58 +0xe0
yunion.io/x/onecloud/pkg/hostman.StartService(...)
        /root/go/src/yunion.io/x/onecloud/pkg/hostman/host_services.go:167
main.main()
        /root/go/src/yunion.io/x/onecloud/cmd/host/main.go:30 +0x124
[fatal 2024-08-23 15:32:17 hostman.(*SHostService).RunService(host_services.go:81)] Host instance init error: Setup OVN Chassis: normalize db host: dns lookup (default-ovn-north) failed: lookup default-ovn-north on 10.96.0.10:53: read udp 10.96.0.10:45069->10.96.0.10:53: i/o timeout

可能的原因是:

[fatal 2024-08-23 15:32:17 hostman.(*SHostService).RunService(host_services.go:81)] Host instance init error: Setup OVN Chassis: normalize db host: dns lookup (default-ovn-north) failed: lookup default-ovn-north on 10.96.0.10:53: read udp 10.96.0.10:45069->10.96.0.10:53: i/o timeout

------问题补充
根据 https://www.cloudpods.org/docs/operations/k8s/dnserror/#calico%E9%9A%A7%E9%81%93%E5%8D%8F%E8%AE%AE%E7%9A%84%E5%88%87%E6%8D%A2 排查Pod网络问题文档,查询结果如下:

[root@anode-01 ~]# ipvsadm -Ln | grep -A 3 10.96.0.10
TCP  10.96.0.10:53 rr
  -> 10.40.75.28:53               Masq    1      0          0
TCP  10.96.0.10:9153 rr
  -> 10.40.75.28:9153             Masq    1      0          0
TCP  10.96.124.119:30357 rr
  -> 10.40.75.14:30357            Masq    1      0          0
--
UDP  10.96.0.10:53 rr
  -> 10.40.75.28:53               Masq    1      0          64

[root@anode-01 ~]# ip route | grep 10.40.75
10.40.75.0/26 via 172.20.1.200 dev tunl0 proto bird onlink

[root@anode-01 ~]# ping 172.20.1.200
PING 172.20.1.200 (172.20.1.200) 56(84) 字节的数据。
64 字节,来自 172.20.1.200: icmp_seq=1 ttl=64 时间=0.210 毫秒
64 字节,来自 172.20.1.200: icmp_seq=2 ttl=64 时间=0.143 毫秒

解决过程1:
尝试关闭 gso,实测无效

[root@anode-01 ~]# ethtool -K enp1s0f0np0 gso off
[root@anode-01 ~]# ethtool -k enp1s0f0np0  | grep generic-segmentation-offload
generic-segmentation-offload: off

解决过程2:
控制节点-抓包结果

listening on br0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:24:38.662983 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.34303 > 10.40.75.28.53: 56371+ AAAA? default-keystone.onecloud.svc.cluster.local. (61)
09:24:38.663066 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.49959 > 10.40.75.28.53: 16876+ A? default-keystone.onecloud.svc.cluster.local. (61)
09:24:38.806921 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.39662 > 10.40.75.28.53: 39831+ AAAA? default-keystone. (34)
09:24:38.806924 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.60606 > 10.40.75.28.53: 44761+ A? default-keystone. (34)
09:24:39.328664 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.16256 > 10.40.75.28.53: 48135+ A? default-apimap. (32)
09:24:39.328674 IP 172.20.1.115 > 172.20.1.200: IP 10.40.106.128.31164 > 10.40.75.28.53: 51897+ AAAA? default-apimap. (32)

过程3:
使用 v3.11.5 版本重新搭建了一个管理端尝试,正常:
image

查看了两个版本的changelog,v3.11.5 版本安装的计算节点,gso是关闭的。
有没有可能是这个导致的?但是我在 v3.11.6 计算阶段手动关闭了 gso 没有生效,重启机器后 gso 又是打开的,需要额外配置或重启某服务吗?
image


问题解决 Solved:在 host.conf 中添加以下配置并重启,恢复正常

ethtool_enable_gso: false

环境/Environment:

  • OS (e.g. cat /etc/os-release):
NAME="openEuler"
VERSION="22.03 (LTS-SP3)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP3)"
ANSI_COLOR="0;31"
  • Kernel (e.g. uname -a):
Linux anode-01.icc.local 5.10.0-224.0.0.127.oe2203sp3.aarch64 #1 SMP Wed Aug 21 15:03:40 CST 2024 aarch64 aarch64 aarch64 GNU/Linux
  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)
	Manufacturer: HiSilicon
	Manufacturer: Huawei
	Manufacturer: HUAWEI
	Manufacturer: JINGSHIJI
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Module Manufacturer ID: Unknown
	Module Product ID: Unknown
	Product Name: BC82AMDYA
	Product Name: TaiShan 200 (Model 2280)
  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
[root@cloud-mgr ocboot]# climc version-list
Get "https://172.20.1.200:30898/version": dial tcp 172.20.1.200:30898: connect: connection refused
Get "https://172.20.1.200:30443/version": dial tcp 172.20.1.200:30443: connect: connection refused
+---------------+--------------------------------------------+
|     Field     |                   Value                    |
+---------------+--------------------------------------------+
| ansible       | release/3.11.6(6184774c6e24081713)         |
| apimap        | release/3.11.6(6184774c6e24081713)         |
| cloudmon      | release/3.11.6(6184774c6e24081713)         |
| cloudproxy    | release/3.11.6(6184774c6e24081713)         |
| compute_v2    | release/3.11.6(6184774c6e24081712)         |
| devtool       | release/3.11.6(6184774c6e24081712)         |
| identity      | release/3.11.6(6184774c6e24081713)         |
| image         | release/3.11.6(6184774c6e24081712)         |
| k8s           | heads/v3.11.6-20240815.1(e6c3e48724081712) |
| log           | release/3.11.6(6184774c6e24081712)         |
| monitor       | release/3.11.6(6184774c6e24081713)         |
| notify        | release/3.11.6(6184774c6e24081713)         |
| scheduledtask | release/3.11.6(6184774c6e24081713)         |
| scheduler     | release/3.11.6(6184774c6e24081712)         |
| vpcagent      | release/3.11.6(6184774c6e24081713)         |
| webconsole    | release/3.11.6(6184774c6e24081712)         |
| yunionconf    | release/3.11.6(6184774c6e24081712)         |
+---------------+--------------------------------------------+
@elvizlai elvizlai added the bug Something isn't working label Aug 23, 2024
@elvizlai elvizlai changed the title [BUG] k3s cluster 添加 arm 节点报错 [BUG] k3s cluster 添加 arm 节点报错 v3.11.5 可以,但是 v3.11.6 网络异常 Aug 24, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant