-
Notifications
You must be signed in to change notification settings - Fork 111
Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850
Conversation
I'm making the claim that no backport is needed on the basis that while this is a bug, it's apparently more or less worked up to this point. |
/test |
Codecov Report
@@ Coverage Diff @@
## master #850 +/- ##
==========================================
+ Coverage 57.72% 57.74% +0.02%
==========================================
Files 19 17 -2
Lines 2375 2381 +6
==========================================
+ Hits 1371 1375 +4
Misses 841 841
- Partials 163 165 +2 |
It looks like my depends-on tag broke the CI for reasona I don't understand yet. Removing it. |
/retest-ubuntu |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dgibson - lgtm - I have one question
@@ -181,7 +181,7 @@ var logsVSockPort = uint32(0) | |||
var debugConsoleVSockPort = uint32(0) | |||
|
|||
// Timeout waiting for a device to be hotplugged | |||
var hotplugTimeout = 3 * time.Second | |||
var hotplugTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some container engines like docker and podman kill the container if its status has not changed to started
after 10s - so the kata container will be killed before timing out - did you consider this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in that case the higher level engine could time out before us, but that's kind of unavoidable. In practice the total delay with SHPC is usualy ~6s, so we'd probably complete in time. The 10s here is just on the basis that the timeout should generally be substantially longer than the actual expected time, to avoid false positives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks @dgibson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dgibson.
Could you try adding a few unit tests though please?
Although the virtioBlkCCWDeviceHandler
device handler sneaked in without any tests, I'd rather each handler has something in device_test.go
. Related to this, not that getDevicePCIAddress
is a variable, so its implementation can be changed to provoke interesting test scenarios. Also, we could consider doing the same for getDeviceName()
to exercise more of vfioDeviceHandler()
error paths potentially.
@jodh-intel new push has some unit tests, as requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dgibson. We do also like negative testing (assert.Error(err)
for as many scenarios as possible), so feel free to consider adding additional tests for that.
lgtm
/test |
I'm a bit disinclined to do heaps here at this time, since it will all have to be ported to Rust pretty soon anyway - and I have notions for longer term cleanups that may obsolete a bunch of it. I also think quite a lot of the Kata unit tests are so tightly coupled to the code as not to actually be all that useful (they end up testing how it's done, rather than what it's doing). |
thanks @dgibson but VFIO CI is still failing :(
|
Yeah, I basically wasted a day chasing down the wrong failure :( (the one in jenkins-ci-ubunut-18-04-initrd). Now to figure out how to reproduce and debug the one in jenkins-ubuntu-18-04-vfio. |
device.go
Outdated
} | ||
|
||
hostBdf := tokens[0] | ||
guestPCIPath := tokens[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the vfio CI failure @devimc spotted in #850 (comment), please could you check both these are != ""
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um.. no.. or at least not right away. I need to figure out why/how we're getting an empty string here, and probably fix the other side of this.
/retest-vfio |
/test-vfio |
/test-vfio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Catching up on the changes from runtime.
Rebased on #855, which we'll need to fix the CI failure on clh. |
/test-vfio |
/test-ubuntu |
/test-vfio |
/test-vfio |
We send information about several kinds of devices to the agent so that it can apply specific handling. We don't currently do this with VFIO devices. However we need to do that so that the agent can properly wait for VFIO devices to be ready (previously it did that using a PCI rescan which may not be reliable and has some very bad side effects). This patch collates and sends the relevant information. Depends-on: github.com/kata-containers/agent#850 fixes kata-containers#2664 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently TestGetDeviceName checks just one pair of sysfs path with /dev node path, with a fair bit of setup to do that. To allow future tests with different pairs of sys/dev paths, factor out the guts of the test into a parameterized helper `oneGetDeviceNameTest`. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently getDeviceName() and listenToUdevEvents() which supplies it with information ignore events which don't supply a /dev path. However, we have upcoming use cases where we need to wait for a device (in the broad sense) to be ready, even though it doesn't have an actual /dev node. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
hotplugTimeout controls how long we'll wait for a uevent indicating that a device is ready in getDeviceName(). It's currently 3s, which is plenty of time to allow for udev processing that we're usually dealing with. However, we have upcoming cases where we may need to wait for a complete SHPC PCI hotplug to complete. SHPC has a 5s delay built into the protocol (and therefore the guest implementations). To accomodate this case, increase the timeout to 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If extra devices are given in the OCI spec, the user will expect them to be ready and available once the (inner) container starts executing. So, if that means hotplugging something into the Kata VM, we need to wait for that hotplug operation to complete before executing the container. For most devices that's handled by getDeviceName() which waits for a uevent indicating the device is ready. VFIO devices, however, don't have any handler so we don't explicitly wait for them to be ready. We usually get away with it, because we force a PCI rescan in in finishCreateContainer() which blocks until the rescan is complete. That really only works by accident though, completing a rescan doesn't necessarily mean all the device probing logic is complete. Worse, in some cases the PCI rescan can collide with the hotplug processing and cause the device to go into a broken state. So, instead of relying on the rescan, accept information from the runtime about what VFIO devices we expect, and explicitly wait for them to be ready. VFIO devices in Kata are (for now) weird - they will bind to whatever the Kata VM's driver for them is, so could appear as any kind of device (char, block, network interface, etc.). That means waiting for just the PCI device to be ready isn't foolproof, since there could be some extra processing time for the driver to complete probing and create the appropriate secondary devices. But, really, that was true of the forced rescan as well, so this is a clear improvement. fixes #781 Depends-on: github.com/kata-containers/runtime#2981 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PCI bus rescan code was added long time ago in Clear Containers due to lack of ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug in Kata Containers. A workaround to this issue is the "lazy attach" mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the PCI bus, unfourtunately some non-LBS devices are being affected too, for instance SR-IOV devices. It would not make sense to lazy-attach non-LBS devices because kata will end up lazy-attaching all the devices, having said that, the PCI bus rescan code and the "lazy attach" mechanism should be removed I'm not sure why, but this seems to expose a problem in TestStorageHandlers where in some cases it relies on sb.deviceWatchers being initialized, but it isn't. So, fix that up as well. fixes #781 fixes kata-containers/runtime#2664 [1] clearcontainers/agent#139 [2] kata-containers/runtime#2461 Signed-off-by: Julio Montes <julio.montes@intel.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Since the CI issues are proving so hard to debug, I've decided not to pursue this in Kata1, and leave it to Kata2. |
Currently the only thing that "ensures" that VFIO devices are actually probed in the VM before the container executes is a forced PCI rescan. That has a bunch of problems as described in issue #781 .
This PR, in conjuction with a runtime PR to come shortly avoids the rescan by instead having the agent explicitly wait for the expected VFIO devices to become present.