Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

dgibson · 2020-09-23T11:21:29Z

Currently the only thing that "ensures" that VFIO devices are actually probed in the VM before the container executes is a forced PCI rescan. That has a bunch of problems as described in issue #781 .

This PR, in conjuction with a runtime PR to come shortly avoids the rescan by instead having the agent explicitly wait for the expected VFIO devices to become present.

dgibson · 2020-09-23T11:23:50Z

I'm making the claim that no backport is needed on the basis that while this is a bug, it's apparently more or less worked up to this point.

dgibson · 2020-09-23T11:24:52Z

/test

codecov · 2020-09-23T11:28:55Z

Codecov Report

Merging #850 (bdc00e8) into master (25d7471) will increase coverage by 0.02%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master     #850      +/-   ##
==========================================
+ Coverage   57.72%   57.74%   +0.02%     
==========================================
  Files          19       17       -2     
  Lines        2375     2381       +6     
==========================================
+ Hits         1371     1375       +4     
  Misses        841      841              
- Partials      163      165       +2

dgibson · 2020-09-24T01:18:42Z

It looks like my depends-on tag broke the CI for reasona I don't understand yet. Removing it.

bpradipt · 2020-09-24T10:09:41Z

@dgibson shall we wait for #849 to merge and then you can update this PR ?

dgibson · 2020-09-25T04:00:53Z

@dgibson shall we wait for #849 to merge and then you can update this PR ?

Done :)

dgibson · 2020-09-25T04:23:48Z

/retest-ubuntu

devimc · 2020-09-25T15:31:57Z

/test

devimc

thanks @dgibson - lgtm - I have one question

devimc · 2020-09-25T16:02:06Z

agent.go

@@ -181,7 +181,7 @@ var logsVSockPort = uint32(0)
 var debugConsoleVSockPort = uint32(0)

 // Timeout waiting for a device to be hotplugged
-var hotplugTimeout = 3 * time.Second
+var hotplugTimeout = 10 * time.Second


some container engines like docker and podman kill the container if its status has not changed to started after 10s - so the kata container will be killed before timing out - did you consider this?

Well, in that case the higher level engine could time out before us, but that's kind of unavoidable. In practice the total delay with SHPC is usualy ~6s, so we'd probably complete in time. The 10s here is just on the basis that the timeout should generally be substantially longer than the actual expected time, to avoid false positives.

bpradipt

/lgtm
Thanks @dgibson

jodh-intel

Thanks @dgibson.

Could you try adding a few unit tests though please?

Although the virtioBlkCCWDeviceHandler device handler sneaked in without any tests, I'd rather each handler has something in device_test.go. Related to this, not that getDevicePCIAddress is a variable, so its implementation can be changed to provoke interesting test scenarios. Also, we could consider doing the same for getDeviceName() to exercise more of vfioDeviceHandler() error paths potentially.

dgibson · 2020-09-30T07:42:56Z

@jodh-intel new push has some unit tests, as requested.

jodh-intel

Thanks @dgibson. We do also like negative testing (assert.Error(err) for as many scenarios as possible), so feel free to consider adding additional tests for that.

lgtm

jodh-intel · 2020-09-30T07:51:22Z

/test

dgibson · 2020-09-30T08:37:14Z

Thanks @dgibson. We do also like negative testing (assert.Error(err) for as many scenarios as possible), so feel free to consider adding additional tests for that.

I'm a bit disinclined to do heaps here at this time, since it will all have to be ported to Rust pretty soon anyway - and I have notions for longer term cleanups that may obsolete a bunch of it. I also think quite a lot of the Kata unit tests are so tightly coupled to the code as not to actually be all that useful (they end up testing how it's done, rather than what it's doing).

devimc · 2020-10-01T13:37:57Z

thanks @dgibson but VFIO CI is still failing :(

01:05:25 + sudo kata-runtime --kata-config /tmp/tmp.dwr2HumN0S/configuration.toml run --detach -b /tmp/tmp.dwr2HumN0S/bundle --pid-file=/tmp/tmp.dwr2HumN0S/pid vfiotest
01:05:34 rpc error: code = Unknown desc = PCI Identifier for device should be of format [bridgeAddr/deviceAddr], got 
01:05:34 ++ handle_error 86

dgibson · 2020-10-02T03:21:16Z

thanks @dgibson but VFIO CI is still failing :(

Yeah, I basically wasted a day chasing down the wrong failure :( (the one in jenkins-ci-ubunut-18-04-initrd). Now to figure out how to reproduce and debug the one in jenkins-ubuntu-18-04-vfio.

jodh-intel · 2020-10-02T10:25:24Z

device.go

+		}
+
+		hostBdf := tokens[0]
+		guestPCIPath := tokens[1]


Given the vfio CI failure @devimc spotted in #850 (comment), please could you check both these are != "".

Um.. no.. or at least not right away. I need to figure out why/how we're getting an empty string here, and probably fix the other side of this.

dgibson · 2020-10-07T02:55:17Z

/retest-vfio

dgibson · 2020-10-07T03:03:38Z

/test-vfio

dgibson · 2020-10-07T03:09:28Z

/test-vfio

likebreath

LGTM. Catching up on the changes from runtime.

dgibson · 2020-10-08T03:35:11Z

Rebased on #855, which we'll need to fix the CI failure on clh.

dgibson · 2020-10-08T05:26:58Z

/test-vfio

jodh-intel · 2020-10-08T07:35:12Z

/test-ubuntu
/test-vfio

dgibson · 2020-10-29T04:00:48Z

/test-vfio

dgibson · 2020-10-30T01:59:26Z

/test-vfio

We send information about several kinds of devices to the agent so that it can apply specific handling. We don't currently do this with VFIO devices. However we need to do that so that the agent can properly wait for VFIO devices to be ready (previously it did that using a PCI rescan which may not be reliable and has some very bad side effects). This patch collates and sends the relevant information. Depends-on: github.com/kata-containers/agent#850 fixes kata-containers#2664 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Currently TestGetDeviceName checks just one pair of sysfs path with /dev node path, with a fair bit of setup to do that. To allow future tests with different pairs of sys/dev paths, factor out the guts of the test into a parameterized helper `oneGetDeviceNameTest`. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Currently getDeviceName() and listenToUdevEvents() which supplies it with information ignore events which don't supply a /dev path. However, we have upcoming use cases where we need to wait for a device (in the broad sense) to be ready, even though it doesn't have an actual /dev node. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

hotplugTimeout controls how long we'll wait for a uevent indicating that a device is ready in getDeviceName(). It's currently 3s, which is plenty of time to allow for udev processing that we're usually dealing with. However, we have upcoming cases where we may need to wait for a complete SHPC PCI hotplug to complete. SHPC has a 5s delay built into the protocol (and therefore the guest implementations). To accomodate this case, increase the timeout to 10s. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

If extra devices are given in the OCI spec, the user will expect them to be ready and available once the (inner) container starts executing. So, if that means hotplugging something into the Kata VM, we need to wait for that hotplug operation to complete before executing the container. For most devices that's handled by getDeviceName() which waits for a uevent indicating the device is ready. VFIO devices, however, don't have any handler so we don't explicitly wait for them to be ready. We usually get away with it, because we force a PCI rescan in in finishCreateContainer() which blocks until the rescan is complete. That really only works by accident though, completing a rescan doesn't necessarily mean all the device probing logic is complete. Worse, in some cases the PCI rescan can collide with the hotplug processing and cause the device to go into a broken state. So, instead of relying on the rescan, accept information from the runtime about what VFIO devices we expect, and explicitly wait for them to be ready. VFIO devices in Kata are (for now) weird - they will bind to whatever the Kata VM's driver for them is, so could appear as any kind of device (char, block, network interface, etc.). That means waiting for just the PCI device to be ready isn't foolproof, since there could be some extra processing time for the driver to complete probing and create the appropriate secondary devices. But, really, that was true of the forced rescan as well, so this is a clear improvement. fixes #781 Depends-on: github.com/kata-containers/runtime#2981 Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

PCI bus rescan code was added long time ago in Clear Containers due to lack of ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug in Kata Containers. A workaround to this issue is the "lazy attach" mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the PCI bus, unfourtunately some non-LBS devices are being affected too, for instance SR-IOV devices. It would not make sense to lazy-attach non-LBS devices because kata will end up lazy-attaching all the devices, having said that, the PCI bus rescan code and the "lazy attach" mechanism should be removed I'm not sure why, but this seems to expose a problem in TestStorageHandlers where in some cases it relies on sb.deviceWatchers being initialized, but it isn't. So, fix that up as well. fixes #781 fixes kata-containers/runtime#2664 [1] clearcontainers/agent#139 [2] kata-containers/runtime#2461 Signed-off-by: Julio Montes <julio.montes@intel.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

dgibson · 2020-12-18T01:47:17Z

Since the CI issues are proving so hard to debug, I've decided not to pursue this in Kata1, and leave it to Kata2.

dgibson self-assigned this Sep 23, 2020

dgibson requested review from jodh-intel, devimc, c3d and bpradipt and removed request for jodh-intel September 23, 2020 11:21

dgibson added needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository labels Sep 23, 2020

dgibson linked an issue Sep 23, 2020 that may be closed by this pull request

[qemu] q35: PCI bus rescan code is messing up PCIe hotplug #781

Closed

dgibson mentioned this pull request Sep 23, 2020

Provide information to agent to let it safely wait for VFIO devices to complete hotplug kata-containers/runtime#2981

Closed

devimc mentioned this pull request Sep 23, 2020

Consistent device address matching between getDeviceName() and listenToUdevEvents() #849

Merged

dgibson mentioned this pull request Sep 24, 2020

Allow VFIO devices to be passed through as VFIO devices #845

Closed

dgibson mentioned this pull request Sep 25, 2020

Allow VFIO devices to be passed into container *as* VFIO devices. kata-containers/runtime#2963

Closed

devimc approved these changes Sep 25, 2020

View reviewed changes

bpradipt approved these changes Sep 26, 2020

View reviewed changes

jodh-intel reviewed Sep 28, 2020

View reviewed changes

dgibson mentioned this pull request Sep 30, 2020

agent: Remove PCI rescan kata-containers/kata-containers#684

Closed

jodh-intel approved these changes Sep 30, 2020

View reviewed changes

jodh-intel mentioned this pull request Oct 2, 2020

device: Do not rescan PCI bus #782

Closed

jodh-intel reviewed Oct 2, 2020

View reviewed changes

dgibson added the do-not-merge PR has problems or depends on another label Oct 7, 2020

dgibson mentioned this pull request Oct 7, 2020

device: Accept the PCIAddress in the BDF format for block devices #823

Closed

devimc mentioned this pull request Oct 7, 2020

Issue with NVIDIA GPU V100 and kata-containers kata-containers/kata-containers#835

Closed

likebreath approved these changes Oct 7, 2020

View reviewed changes

dgibson added wip Work in Progress (PR incomplete - needs more work or rework) and removed do-not-merge PR has problems or depends on another labels Oct 8, 2020

devimc mentioned this pull request Oct 27, 2020

[RFE] Switch to Q35 as the default machine type for QEMU for 2.1.0 kata-containers/kata-containers#1038

Closed

dgibson marked this pull request as draft October 29, 2020 04:01

dgibson and others added 5 commits December 2, 2020 14:58

dgibson closed this Dec 18, 2020

dgibson deleted the bug781 branch December 18, 2020 01:47

dgibson mentioned this pull request Apr 13, 2021

Remove forced PCI rescans from agent kata-containers/kata-containers#1682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

dgibson commented Sep 23, 2020

dgibson commented Sep 23, 2020

dgibson commented Sep 23, 2020

codecov bot commented Sep 23, 2020 •

edited

Loading

dgibson commented Sep 24, 2020

bpradipt commented Sep 24, 2020

dgibson commented Sep 25, 2020

dgibson commented Sep 25, 2020

devimc commented Sep 25, 2020

devimc left a comment

devimc Sep 25, 2020

dgibson Sep 26, 2020

bpradipt left a comment

jodh-intel left a comment

dgibson commented Sep 30, 2020

jodh-intel left a comment

jodh-intel commented Sep 30, 2020

dgibson commented Sep 30, 2020

devimc commented Oct 1, 2020

dgibson commented Oct 2, 2020

jodh-intel Oct 2, 2020

dgibson Oct 2, 2020

dgibson commented Oct 7, 2020

dgibson commented Oct 7, 2020

dgibson commented Oct 7, 2020

likebreath left a comment

dgibson commented Oct 8, 2020

dgibson commented Oct 8, 2020

jodh-intel commented Oct 8, 2020

dgibson commented Oct 29, 2020

dgibson commented Oct 30, 2020

dgibson commented Dec 18, 2020

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

Conversation

dgibson commented Sep 23, 2020

dgibson commented Sep 23, 2020

dgibson commented Sep 23, 2020

codecov bot commented Sep 23, 2020 • edited Loading

Codecov Report

dgibson commented Sep 24, 2020

bpradipt commented Sep 24, 2020

dgibson commented Sep 25, 2020

dgibson commented Sep 25, 2020

devimc commented Sep 25, 2020

devimc left a comment

Choose a reason for hiding this comment

devimc Sep 25, 2020

Choose a reason for hiding this comment

dgibson Sep 26, 2020

Choose a reason for hiding this comment

bpradipt left a comment

Choose a reason for hiding this comment

jodh-intel left a comment

Choose a reason for hiding this comment

dgibson commented Sep 30, 2020

jodh-intel left a comment

Choose a reason for hiding this comment

jodh-intel commented Sep 30, 2020

dgibson commented Sep 30, 2020

devimc commented Oct 1, 2020

dgibson commented Oct 2, 2020

jodh-intel Oct 2, 2020

Choose a reason for hiding this comment

dgibson Oct 2, 2020

Choose a reason for hiding this comment

dgibson commented Oct 7, 2020

dgibson commented Oct 7, 2020

dgibson commented Oct 7, 2020

likebreath left a comment

Choose a reason for hiding this comment

dgibson commented Oct 8, 2020

dgibson commented Oct 8, 2020

jodh-intel commented Oct 8, 2020

dgibson commented Oct 29, 2020

dgibson commented Oct 30, 2020

dgibson commented Dec 18, 2020

codecov bot commented Sep 23, 2020 •

edited

Loading