Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

uevent listening goroutine is not (always) working #372

Closed
bergwolf opened this issue Sep 17, 2018 · 20 comments · Fixed by #380 or #734
Closed

uevent listening goroutine is not (always) working #372

bergwolf opened this issue Sep 17, 2018 · 20 comments · Fixed by #380 or #734

Comments

@bergwolf
Copy link
Member

As I was adding network hotplug support for vm factory and found out that the uevent listening goroutine is not working for me. After hotplugging a new nic, however long I wait in UpdateInterface, the uevent listening goroutine cannot get the new device sysfs uevent.

Then I tested the network monitor process (kata-containers/runtime#534), and it failed on getting the new uevent as well:

Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.532400733+08:00" level=info msg="time="2018-09-13T01:37:20.511006103Z" level=debug msg="new request" name=kata-agent pid=128 req="interface:<device:"eth1" name:"eth1" IPAddresses:<family:2 address:"172.18.0.2" mask:"16" > mtu:1500 hwAddr:"02:42:ac:12:00:02" pciAddr:"02/01" > " request=/grpc.AgentService/UpdateInterface sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.533607734+08:00" level=info msg="time="2018-09-13T01:37:20.512336531Z" level=info msg="Fetched PCI address for device" completePCIAddr="0000:00:02.0/0000:01:01.0" name=kata-agent pid=128 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:20 macbeth kata-proxy[54869]: time="2018-09-13T09:37:20.535074125+08:00" level=info msg="time="2018-09-13T01:37:20.513595652Z" level=info msg="Waiting on channel for device notification" name=kata-agent pciID=02/01 pid=128 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:23 macbeth kata-proxy[54869]: time="2018-09-13T09:37:23.537970764+08:00" level=info msg="time="2018-09-13T01:37:23.51786459Z" level=debug msg="request end" duration=3.005565104s error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" name=kata-agent pid=128 request=/grpc.AgentService/UpdateInterface resp="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent\n" name=kata-proxy pid=54869 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=agent
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.540837194+08:00" level=error msg="update interface request failed" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" interface-requested="device:"eth1" name:"eth1" IPAddresses:<family:2 address:"172.18.0.2" mask:"16" > mtu:1500 hwAddr:"02:42:ac:12:00:02" pciAddr:"02/01" " name=kata-runtime pid=54984 resulting-interface="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=virtcontainers subsystem=kata_agent
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.540955376+08:00" level=error msg="add interface failed" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" name=kata-runtime pid=54984 resulting-interface="" sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=runtime
Sep 13 09:37:23 macbeth kata-runtime[54984]: time="2018-09-13T09:37:23.541177053+08:00" level=error msg="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" arch=amd64 command=kata-network container=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 name=kata-runtime pid=54984 sandbox=e2145826daa97ea538d49bffc8c25ad13e0849a1edf5546478458fa17aa63002 source=runtime

It's not that the uevent is not sent to the agent. If I remove the waiting and ignore UpdateInterface error, the device shows up very quickly and the uevent listening goroutine is able to pick up uevents for it. So it's like the goroutine is not woke up when agent is waiting in UpdateInterface call. But I don't know why that is happening.

Also since netmon works in @sboeuf 's local environment, I put up a testing PR (kata-containers/tests#743) to get a baseline for environment comparison. And it turns out that two times out of three, it failed on getting the IP of new hotplugged nic. So the failure/success do not seem constant even for CI.

It's not clear to me what has caused the failure/success. @sboeuf @devimc @amshinde @caoruidong help to take a look.

@amshinde
Copy link
Member

@bergwolf I'll take a look at this.

@amshinde
Copy link
Member

@bergwolf I have not been able to to reproduce this issue at all on my local setup. Looks like the device gets hotplugged well in advance on my setup. I havent had any luck reproducing this on an Azure instance while running the test for 10 min. I'll try to reproduce this issue on Azure tomorrow.
I dont think it should matter, but can you let me know what rootfs (and other environement) you were using for your tests?

Meanwhile, I had a look at the changes for uevent and I think change this could be causing the uevent listening loop to be locked permanently:

agent/device.go

Lines 125 to 132 in 7caf1c8

// Check if the PCI path is present before we setup the pci device map.
_, err = os.Stat(filepath.Join(rootBusPath, pciAddr))
if err == nil {
// Found pci path. It's there before we wait for the device map.
// We cannot find the name for it.
fieldLogger.Info("Device found on pci device path")
return "", nil
}

With that change the sandbox lock is never released.
I have pushed a revert for it here: #375
Can you test with that and let me know. Meanwhile I will continue testing tomorrow, to see if there is any other issue related to network hotplug uevents.

@bergwolf
Copy link
Member Author

@amshinde Thanks for looking into it. Nice catch on the missing lock! I guess I never hit that code path though. I can see Waiting on channel for device notification printed in the agent console. E.g., (just reverted that commit and ran kata-containers/runtime#534):

Sep 19 16:09:55 vmfarm kata-proxy[7203]: time="2018-09-19T16:09:55.999956371+08:00" level=info msg="time="2018-09-19T08:09:55.990219937Z" level=info msg="Fetched PCI address for device" completePCIAddr="0000:00:02.0/0000:01:01.0" name=kata-agent pid=1 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent\n" name=kata-proxy pid=7203 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent
Sep 19 16:09:56 vmfarm kata-proxy[7203]: time="2018-09-19T16:09:56.000040707+08:00" level=info msg="time="2018-09-19T08:09:55.990355952Z" level=info msg="Waiting on channel for device notification" name=kata-agent pciID=02/01 pid=1 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent\n" name=kata-proxy pid=7203 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent
Sep 19 16:09:56 vmfarm kata-proxy[7203]: time="2018-09-19T16:09:56.573418625+08:00" level=info msg="[ 1.184191] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready\n" name=kata-proxy pid=7203 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent
Sep 19 16:09:56 vmfarm kata-proxy[7203]: time="2018-09-19T16:09:56.575381281+08:00" level=info msg="[ 1.185224] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x39c5ded8016, max_idle_ns: 440795406786 ns\n" name=kata-proxy pid=7203 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent
Sep 19 16:09:59 vmfarm kata-proxy[7203]: time="2018-09-19T16:09:59.000995824+08:00" level=info msg="time="2018-09-19T08:09:58.990735646Z" level=debug msg="request end" duration=3.000515709s error="rpc error: code = DeadlineExceeded desc = Timeout reached after 3s waiting for device 0000:00:02.0/0000:01:01.0" name=kata-agent pid=1 request=/grpc.AgentService/UpdateInterface resp="" sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent\n" name=kata-proxy pid=7203 sandbox=5a84955650e0852ff6b513bed1e9305c5a24d16ddd0d01a695d7cc645d522b3d source=agent

And I tried both initrd (make DISTRO=alpine USE_DOCKER=yes AGENT_INIT=yes initrd) and rootfs image (make DISTRO=centos USE_DOCKER=yes AGENT_INIT=no image). Both failed the same way.

Here is my complete kata-collect-data.sh result: https://pastebin.com/TSfD3A6e

@amshinde
Copy link
Member

@bergwolf I see the important difference from your environment is that you are using "q35" machine type, while I am using "pc" machine type. Just running the test for netmon, I dont think the network hotplug for q35 is even successful. I dont see any uevent related to network device for q35 at all.

@sboeuf Were you able to reproduce this issue at all with "pc" machine type?

@amshinde
Copy link
Member

@bergwolf Can you give this a try with "pc" machine type and paste the logs here if you see any issues with "pc".

@amshinde
Copy link
Member

@bergwolf The q35 hotplug was not working because of missing kernel config CONFIG_HOTPLUG_PCI_SPC.
Without this, we would need to explicitly rescan the PCI bus which is not the best solution.
@devimc Will be submitting a PR to enable this in our kernel configuration.

@amshinde
Copy link
Member

@bergwolf This PR should solve the hotplug issue for q35:
kata-containers/packaging#186

For pc, I am wondering if we are running into similar SHPC related issues in our CI (although I cant reproduce it on my setup). We have shpc turned on for pci-bridges in virtcontainers. So I am not sure if it has to do with the timing of scanning the PCI bus.
With your current guest kernel and "pc"machine type, can you turn off shpc for pci-bridges and run the netmon test ?
You would need to turn off shpc here:
https://github.com/kata-containers/runtime/blob/d6e4a98387fe81b448da9eb8fca9bed500e7c241/virtcontainers/qemu_arch_base.go#L358

and

https://github.com/kata-containers/runtime/blob/7d8ce4ec922551845865325ca8e24ba8f8e2074a/virtcontainers/qemu.go#L1201

@bergwolf
Copy link
Member Author

Thanks @amshinde ! You do find the root cause. I am able to pass the test as long as I change the machine type back to pc.

And if I disable SHPC, I cannot pass it with pc + current guest kernel.

I'll build a new kernel w/ CONFIG_HOTPLUG_PCI_SPC and test again. Thanks a lot for your help!

@bergwolf
Copy link
Member Author

@amshinde In my environment, with CONFIG_HOTPLUG_PCI_SPC enabled in guest kernel + SHPC enabled in qemu, the nic device still shows up after UpdateInterface request timeout. Are you able to pass the test with CONFIG_HOTPLUG_PCI_SPC + q35?

bergwolf added a commit to bergwolf/kata-agent that referenced this issue Sep 20, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
bergwolf added a commit to bergwolf/kata-agent that referenced this issue Sep 20, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
@bergwolf
Copy link
Member Author

@amshinde I added pci bus rescanning and fixed an issue with the listening channel. See #380. With that I can pass the test w/ and w/o CONFIG_HOTPLUG_PCI_SPC using q35 machine type.

And the test fails if I take out the pci bus rescan commit. The device shows up a few seconds after the waiting fails. So I think we would always want to scan the pci bus for new devices to find out new devices timely.

@sboeuf
Copy link

sboeuf commented Sep 20, 2018

@bergwolf I'd like to understand what's going on with pc and shpc.
It's clear that for q35, we need shpc support as the hotplug mechanism, hence we need the SHPC to be enabled from the guest kernel config.
But, I don't get how this interact in case of pc. pc should be using ACPI in order to hotplug PCI devices, and I don't think this should take any extra time compared to the other mechanism. Do you have anymore input on that?

And yes, adding PCI rescan works since we don't need any hotplug mechanism, as we simply force PCI enumeration every time we add a new device from Qemu.
But I don't think this should be needed, and I'd like to make sure we understand what's going on.

@sboeuf
Copy link

sboeuf commented Sep 20, 2018

And just to add a comment here, I am not against forcing the PCI rescan, but I want to make sure we're not hiding a potential lack of support regarding PCI hotplug.

@sboeuf
Copy link

sboeuf commented Sep 20, 2018

/cc @mcastelino
I can see that not using shpc with pc is a problem when using a pci-bridge (-device pci-bridge,id=pci-bridge-0,chassis_nr=1,shpc=off). The PCI hotplug on the main bus (PCI host bridge) works just fine through ACPI.

But here is the error I get when I try to hotplug through the bridge:

[   11.227923] pci 0000:01:04.0: BAR 6: no space for [mem size 0x00040000 pref]
[   11.228251] pci 0000:01:04.0: BAR 6: failed to assign [mem size 0x00040000 pref]
[   11.228431] pci 0000:01:04.0: BAR 4: no space for [mem size 0x00004000 64bit pref]
[   11.228583] pci 0000:01:04.0: BAR 4: failed to assign [mem size 0x00004000 64bit pref]
[   11.228741] pci 0000:01:04.0: BAR 1: no space for [mem size 0x00001000]
[   11.228871] pci 0000:01:04.0: BAR 1: failed to assign [mem size 0x00001000]
[   11.229018] pci 0000:01:04.0: BAR 0: no space for [io  size 0x0020]
[   11.229167] pci 0000:01:04.0: BAR 0: failed to assign [io  size 0x0020]
[   11.251395] PCI Interrupt Link [LNKA] enabled at IRQ 11
[   11.251769] virtio-pci 0000:01:04.0: virtio_pci: leaving for legacy driver
[   11.267333] virtio-pci: probe of 0000:01:04.0 failed with error -12

Any idea why there's not enough space for the PCI holes defined by the BARs?

@bergwolf
Copy link
Member Author

@sboeuf As I tested it, CONFIG_HOTPLUG_PCI_SPC does bring up the network device for q35 but that is a few seconds after we timeout in UpdateInterface API so the device is still not properly configured.

But, I don't get how this interact in case of pc. pc should be using ACPI in order to hotplug PCI devices, and I don't think this should take any extra time compared to the other mechanism.

We enable SHPC on pci bridges for both pc and q35 right now.

bergwolf added a commit to bergwolf/kata-agent that referenced this issue Sep 20, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
bergwolf added a commit to bergwolf/kata-agent that referenced this issue Sep 25, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
bergwolf added a commit to bergwolf/kata-agent that referenced this issue Sep 25, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
@miaoyq
Copy link

miaoyq commented Oct 23, 2018

I was testing the network vfio device hotplug localy based on kata v1.3.0, but always failed with the error print:

kata-proxy[17695]: time="2018-10-23T19:38:49.342930445+08:00" level=info msg="[  237.429816] pci 0000:01:01.0: [8086:1581] type 00 class 0x020000\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.343713099+08:00" level=info msg="[  237.430427] pci 0000:01:01.0: reg 0x10: [mem 0x00000000-0x007fffff 64bit pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.344064018+08:00" level=info msg="[  237.430989] pci 0000:01:01.0: reg 0x1c: [mem 0x00000000-0x00007fff 64bit pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.344546049+08:00" level=info msg="[  237.431597] pci 0000:01:01.0: reg 0x30: [mem 0x00000000-0x0007ffff pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.344831647+08:00" level=info msg="[  237.431913] pci 0000:01:01.0: enabling Extended Tags\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.346025215+08:00" level=info msg="[  237.433022] pci 0000:01:01.0: PME# supported from D0 D3hot\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.346824703+08:00" level=info msg="[  237.433766] pci 0000:01:01.0: BAR 0: no space for [mem size 0x00800000 64bit pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.347071407+08:00" level=info msg="[  237.434093] pci 0000:01:01.0: BAR 0: failed to assign [mem size 0x00800000 64bit pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.347301228+08:00" level=info msg="[  237.434314] pci 0000:01:01.0: BAR 6: assigned [mem 0xfe400000-0xfe47ffff pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.347513392+08:00" level=info msg="[  237.434548] pci 0000:01:01.0: BAR 3: assigned [mem 0xfe800000-0xfe807fff 64bit pref]\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.375500627+08:00" level=info msg="[  237.462523] i40e 0000:01:01.0: enabling device (0100 -> 0102)\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.377076131+08:00" level=info msg="[  237.464083] i40e 0000:01:01.0: ioremap(0x0000, 0x0000) failed: 0xfffffffb\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent
kata-proxy[17695]: time="2018-10-23T19:38:49.377774156+08:00" level=info msg="[  237.464810] i40e: probe of 0000:01:01.0 failed with error -5\n" name=kata-proxy pid=17695 sandbox=c50c01b8205f4813d6c1410a5834fd16cda283210f7e0c666886bdbb9900d10d source=agent

I have seen the device in the container, but the device has no driver:

/sys/devices/pci0000:00/0000:00:02.0/0000:01:01.0 # cat uevent
PCI_CLASS=20000
PCI_ID=8086:1581
PCI_SUBSYS_ID=8086:0000
PCI_SLOT_NAME=0000:01:01.0
MODALIAS=pci:v00008086d00001581sv00008086sd00000000bc02sc00i00

Machine type is pc with shpc=on
Is there something wrong with my environment, or something else? @bergwolf @sboeuf

@sboeuf
Copy link

sboeuf commented Oct 23, 2018

@miaoyq
Have you been able to make this work with a simple veth hotplug?

The network hotplug of a VFIO device is not supported yet, but should land soon. I just want to make sure the standard veth use case works for you.

@miaoyq
Copy link

miaoyq commented Oct 24, 2018

Have you been able to make this work with a simple veth hotplug?

@sboeuf Yeah, works very well :-)

We were implementing the network hotplug of a VFIO device locally based on v1.3.0. it works well in one of my environments, but always failed in another one with above error print.

@sboeuf
Copy link

sboeuf commented Oct 24, 2018

@miaoyq Oh okay! So I think the issue might come from the fact that your device needs a lot of memory and because it is hotplugged behind a PCI bridge, the reserved space is not large enough.
Try to use:

hotplug_vfio_on_root_bus = true

from the configuration.toml. The main PCI bus has more space and should be able to support your device.

@miaoyq
Copy link

miaoyq commented Oct 26, 2018

Thanks @sboeuf
I set hotplug_vfio_on_root_bus = true, and get following print:

"{\"error\": {\"class\": \"GenericError\", \"desc\": \"Bus 'pcie.0' does not support hotplugging\"}}" 

It seems that root bus does not support hotplug.

@sboeuf
Copy link

sboeuf commented Oct 26, 2018

@miaoyq are you using q35 machine type? You need to use pc, which will allow devices to be plugged onto the PCI host bridge directly.
By using q35, you would require some extra code taking care of creating a PCIe hierarchy so that you can hotplug to a downstream root port for instance.

jshachm pushed a commit to jshachm/agent that referenced this issue Nov 22, 2018
Otherwise the channel we are waiting on is not the one we
put in the device watchers map.

Fixes: kata-containers#372

Signed-off-by: Peng Tao <bergwolf@gmail.com>
jcvenegas added a commit to jcvenegas/agent that referenced this issue Feb 13, 2020
When the client tries to connect sometimes a race condition
could happen when the between bind and listen calls in the agent
vsock side.

This will block the hypervisor wanting for a response and
as consequence the agent client where it checks for an OK
response.

This case needs to be handled by the guest kernel, see
https://lore.kernel.org/netdev/668b0eda8823564cd604b1663dc53fbaece0cd4e.camel@intel.com/

As an extra protection make the agent client timeout if no OK response
is given. The response should be quick so is OK to wait a few seconds
and then timeout.

This also allow to return an error from the dialler function so retry
does not fallback on grpc retry making retries faster.

Fixes: kata-containers#372

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
jcvenegas added a commit to jcvenegas/agent that referenced this issue Feb 13, 2020
When the client tries to connect sometimes a race condition
could happen when the between bind and listen calls in the agent
vsock side.

This will block the hypervisor wanting for a response and
as consequence the agent client where it checks for an OK
response.

This case needs to be handled by the guest kernel, see
https://lore.kernel.org/netdev/668b0eda8823564cd604b1663dc53fbaece0cd4e.camel@intel.com/

As an extra protection make the agent client timeout if no OK response
is given. The response should be quick so is OK to wait a few seconds
and then timeout.

This also allow to return an error from the dialler function so retry
does not fallback on grpc retry making retries faster.

Fixes: kata-containers#372

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
jcvenegas added a commit to jcvenegas/agent that referenced this issue Feb 14, 2020
When the client tries to connect sometimes a race condition
could happen when the between bind and listen calls in the agent
vsock side.

This will block the hypervisor wanting for a response and
as consequence the agent client where it checks for an OK
response.

This case needs to be handled by the guest kernel, see
https://lore.kernel.org/netdev/668b0eda8823564cd604b1663dc53fbaece0cd4e.camel@intel.com/

As an extra protection make the agent client timeout if no OK response
is given. The response should be quick so is OK to wait a few seconds
and then timeout.

This also allow to return an error from the dialler function so retry
does not fallback on grpc retry making retries faster.

Fixes: kata-containers#372

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
jcvenegas added a commit to jcvenegas/agent that referenced this issue Feb 20, 2020
When the client tries to connect sometimes a race condition
could happen when the between bind and listen calls in the agent
vsock side.

This will block the hypervisor wanting for a response and
as consequence the agent client where it checks for an OK
response.

This case needs to be handled by the guest kernel, see
https://lore.kernel.org/netdev/668b0eda8823564cd604b1663dc53fbaece0cd4e.camel@intel.com/

As an extra protection make the agent client timeout if no OK response
is given. The response should be quick so is OK to wait a few seconds
and then timeout.

This also allow to return an error from the dialler function so retry
does not fallback on grpc retry making retries faster.

Fixes: kata-containers#372

Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
4 participants