Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] 宿主机处理probe-isolated-devices请求期间创建虚拟机会出现报错超时 #21610

Open
yulongz opened this issue Nov 15, 2024 · 1 comment
Labels
bug Something isn't working state/awaiting processing

Comments

@yulongz
Copy link

yulongz commented Nov 15, 2024

问题描述/What happened:
出现两个虚机创建失败,从对应宿主机host服务中看到日志如下:
[info 2024-11-14 07:19:34 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address []
[info 2024-11-14 07:19:35 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:321)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.2/0000:02:00.0/0000:03:00.0/boot_vga
[info 2024-11-14 07:19:35 isolated_device.getPassthroughGPUS(gpu.go:98)] skip boot vga device 03:00.0
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:36 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 workmanager.(*workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false}
[info 2024-11-14 07:19:37 modules.TaskComplete(task.go:34)] Sync task a8fb0d5c-f5a6-415f-84da-585d48be5f7f complete succ
[info 2024-11-14 07:19:37 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 882365-bbf8c2 POST /servers/cb6eb842-c430-409f-8e7a-6ccd0914b192/start (10.x.x.x:52693:compute_v2) 6.17ms
[error 2024-11-14 07:19:37 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference
goroutine 57750 [running]:
runtime/debug.Stack()
/usr/lib/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/lib/go/src/runtime/debug/stack.go:16 +0x19
yunion.io/x/onecloud/pkg/appsrv.execCallback.func1()
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd
panic({0x29a2920, 0x54c7ac0})
/usr/lib/go/src/runtime/panic.go:838 +0x207
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).hasGPU(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).HideKVM(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd
yunion.io/x/onecloud/pkg/hostman/guestman/arch.(*X86).GenerateCpuDesc(0xc000f54380?, 0x10, 0xf0, {0x3565cf8, 0xc000f54380})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initCpuDesc(0xc000f54380, 0x0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initGuestDesc(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).updateGuestDesc(0xc000f54380)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).asyncScriptStart(0xc000f54380, {0x355a300, 0xc00184e660}, {0x2e6a6a0?, 0xc00228aa60})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5
yunion.io/x/onecloud/pkg/hostman/guestman.(*guestStartTask).Run(0xc00228afa0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b
yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc001286f50?)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58
yunion.io/x/onecloud/pkg/appsrv.(*SWorker).run(0xc0028089f0)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70
created by yunion.io/x/onecloud/pkg/appsrv.(*SWorkerManager).scheduleWithLock
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:37 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 workmanager.(*workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false}
[info 2024-11-14 07:19:38 modules.TaskComplete(task.go:34)] Sync task 6dc8a284-d3e4-4882-894a-54d72d4c8be3 complete succ
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:38 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:39 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 5b5cde-d3e11d POST /servers/c3d8d60a-c311-47e8-8c00-4e84707893aa/start (10.x.x.x:26790:compute_v2) 4.28ms
[error 2024-11-14 07:19:39 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference
goroutine 57857 [running]:
runtime/debug.Stack()
/usr/lib/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
/usr/lib/go/src/runtime/debug/stack.go:16 +0x19
yunion.io/x/onecloud/pkg/appsrv.execCallback.func1()
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd
panic({0x29a2920, 0x54c7ac0})
/usr/lib/go/src/runtime/panic.go:838 +0x207
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).hasGPU(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).HideKVM(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd
yunion.io/x/onecloud/pkg/hostman/guestman/arch.(*X86).GenerateCpuDesc(0xc000eba460?, 0x10, 0xf0, {0x3565cf8, 0xc000eba460})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initCpuDesc(0xc000eba460, 0x0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).initGuestDesc(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).updateGuestDesc(0xc000eba460)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3
yunion.io/x/onecloud/pkg/hostman/guestman.(*SKVMGuestInstance).asyncScriptStart(0xc000eba460, {0x355a300, 0xc00234ad20}, {0x2e6a6a0?, 0xc0003f7e80})
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5
yunion.io/x/onecloud/pkg/hostman/guestman.(*guestStartTask).Run(0xc0017603e0)
/root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b
yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc002118780?)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58
yunion.io/x/onecloud/pkg/appsrv.(*SWorker).run(0xc000ba79b0)
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70
created by yunion.io/x/onecloud/pkg/appsrv.(*SWorkerManager).scheduleWithLock
/root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165
[info 2024-11-14 07:19:39 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:39 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:40 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver
[info 2024-11-14 07:19:40 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver

环境/Environment:

  • OS (e.g. cat /etc/os-release): ubuntu2204
  • Kernel (e.g. uname -a):Linux cloud-node-0133 5.15.0-124-generic fix: recode host convert hypervisor, make logic more clear #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Host: (e.g. dmidecode | egrep -i 'manufacturer|product' |sort -u)
    idProduct: 0x03ee
    Manufacturer: Intel(R) Corporation
    Manufacturer: NO DIMM
    Manufacturer: Samsung
    Manufacturer: Supermicro
    Manufacturer: SUPERMICRO
    Memory Subsystem Controller Manufacturer ID: Unknown
    Memory Subsystem Controller Product ID: Unknown
    Module Manufacturer ID: Bank 1, Hex 0xCE
    Module Manufacturer ID: Unknown
    Module Product ID: Unknown
    Product Name: SYS-420GP-TNR
    Product Name: X12DPG-OA6
  • Service Version (e.g. kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
    3.10.15
@yulongz yulongz added the bug Something isn't working label Nov 15, 2024
@yulongz
Copy link
Author

yulongz commented Nov 15, 2024

补充信息:报错期间应该是有人点击宿主机-透传设备,导致region发送了probe-isolated-devices请求给宿主机,这个请求大概持续了78秒,在这78秒内正好需要创建两台虚机,然后就出现了虚机创建超时。

[info 2024-11-14 07:19:42 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7_-U9zE9ymrQmA= 200 6fb8db-54d8d5-3f07c8 POST /hosts/0a70d90d-f1d5-4dc5-8aaa-0306d88936f9/probe-isolated-devices (10.x.x.x:62394:compute_v2) 7446.48ms

@yulongz yulongz changed the title [BUG] hasGPU panic runtime error: invalid memory address or nil pointer dereference [BUG] 宿主机处理probe-isolated-devices请求期间创建虚拟机会出现报错超时 Nov 15, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working state/awaiting processing
Projects
None yet
Development

No branches or pull requests

1 participant