Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Wait explicitly for VFIO devices to complete hotplug, instead of relying on PCI rescan #850

Closed
wants to merge 5 commits into from

Conversation

dgibson
Copy link
Contributor

@dgibson dgibson commented Sep 23, 2020

Currently the only thing that "ensures" that VFIO devices are actually probed in the VM before the container executes is a forced PCI rescan. That has a bunch of problems as described in issue #781 .

This PR, in conjuction with a runtime PR to come shortly avoids the rescan by instead having the agent explicitly wait for the expected VFIO devices to become present.

@dgibson dgibson self-assigned this Sep 23, 2020
@dgibson dgibson requested review from jodh-intel, devimc, c3d and bpradipt and removed request for jodh-intel September 23, 2020 11:21
@dgibson dgibson added needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository labels Sep 23, 2020
@dgibson
Copy link
Contributor Author

dgibson commented Sep 23, 2020

I'm making the claim that no backport is needed on the basis that while this is a bug, it's apparently more or less worked up to this point.

@dgibson
Copy link
Contributor Author

dgibson commented Sep 23, 2020

/test

@dgibson dgibson linked an issue Sep 23, 2020 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #850 (bdc00e8) into master (25d7471) will increase coverage by 0.02%.
The diff coverage is 71.42%.

@@            Coverage Diff             @@
##           master     #850      +/-   ##
==========================================
+ Coverage   57.72%   57.74%   +0.02%     
==========================================
  Files          19       17       -2     
  Lines        2375     2381       +6     
==========================================
+ Hits         1371     1375       +4     
  Misses        841      841              
- Partials      163      165       +2     

@dgibson
Copy link
Contributor Author

dgibson commented Sep 24, 2020

It looks like my depends-on tag broke the CI for reasona I don't understand yet. Removing it.

@bpradipt
Copy link
Contributor

@dgibson shall we wait for #849 to merge and then you can update this PR ?

@dgibson
Copy link
Contributor Author

dgibson commented Sep 25, 2020

@dgibson shall we wait for #849 to merge and then you can update this PR ?

Done :)

@dgibson
Copy link
Contributor Author

dgibson commented Sep 25, 2020

/retest-ubuntu

@devimc
Copy link

devimc commented Sep 25, 2020

/test

Copy link

@devimc devimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dgibson - lgtm - I have one question

@@ -181,7 +181,7 @@ var logsVSockPort = uint32(0)
var debugConsoleVSockPort = uint32(0)

// Timeout waiting for a device to be hotplugged
var hotplugTimeout = 3 * time.Second
var hotplugTimeout = 10 * time.Second
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some container engines like docker and podman kill the container if its status has not changed to started after 10s - so the kata container will be killed before timing out - did you consider this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in that case the higher level engine could time out before us, but that's kind of unavoidable. In practice the total delay with SHPC is usualy ~6s, so we'd probably complete in time. The 10s here is just on the basis that the timeout should generally be substantially longer than the actual expected time, to avoid false positives.

Copy link
Contributor

@bpradipt bpradipt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks @dgibson

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgibson.

Could you try adding a few unit tests though please?

Although the virtioBlkCCWDeviceHandler device handler sneaked in without any tests, I'd rather each handler has something in device_test.go. Related to this, not that getDevicePCIAddress is a variable, so its implementation can be changed to provoke interesting test scenarios. Also, we could consider doing the same for getDeviceName() to exercise more of vfioDeviceHandler() error paths potentially.

@dgibson
Copy link
Contributor Author

dgibson commented Sep 30, 2020

@jodh-intel new push has some unit tests, as requested.

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgibson. We do also like negative testing (assert.Error(err) for as many scenarios as possible), so feel free to consider adding additional tests for that.

lgtm

@jodh-intel
Copy link
Contributor

/test

@dgibson
Copy link
Contributor Author

dgibson commented Sep 30, 2020

Thanks @dgibson. We do also like negative testing (assert.Error(err) for as many scenarios as possible), so feel free to consider adding additional tests for that.

I'm a bit disinclined to do heaps here at this time, since it will all have to be ported to Rust pretty soon anyway - and I have notions for longer term cleanups that may obsolete a bunch of it. I also think quite a lot of the Kata unit tests are so tightly coupled to the code as not to actually be all that useful (they end up testing how it's done, rather than what it's doing).

@devimc
Copy link

devimc commented Oct 1, 2020

thanks @dgibson but VFIO CI is still failing :(

01:05:25 + sudo kata-runtime --kata-config /tmp/tmp.dwr2HumN0S/configuration.toml run --detach -b /tmp/tmp.dwr2HumN0S/bundle --pid-file=/tmp/tmp.dwr2HumN0S/pid vfiotest
01:05:34 rpc error: code = Unknown desc = PCI Identifier for device should be of format [bridgeAddr/deviceAddr], got 
01:05:34 ++ handle_error 86

@dgibson
Copy link
Contributor Author

dgibson commented Oct 2, 2020

thanks @dgibson but VFIO CI is still failing :(

Yeah, I basically wasted a day chasing down the wrong failure :( (the one in jenkins-ci-ubunut-18-04-initrd). Now to figure out how to reproduce and debug the one in jenkins-ubuntu-18-04-vfio.

device.go Outdated
}

hostBdf := tokens[0]
guestPCIPath := tokens[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the vfio CI failure @devimc spotted in #850 (comment), please could you check both these are != "".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um.. no.. or at least not right away. I need to figure out why/how we're getting an empty string here, and probably fix the other side of this.

@dgibson
Copy link
Contributor Author

dgibson commented Oct 7, 2020

/retest-vfio

@dgibson
Copy link
Contributor Author

dgibson commented Oct 7, 2020

/test-vfio

@dgibson dgibson added the do-not-merge PR has problems or depends on another label Oct 7, 2020
@dgibson
Copy link
Contributor Author

dgibson commented Oct 7, 2020

/test-vfio

Copy link
Contributor

@likebreath likebreath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Catching up on the changes from runtime.

@dgibson dgibson added wip Work in Progress (PR incomplete - needs more work or rework) and removed do-not-merge PR has problems or depends on another labels Oct 8, 2020
@dgibson
Copy link
Contributor Author

dgibson commented Oct 8, 2020

Rebased on #855, which we'll need to fix the CI failure on clh.

@dgibson
Copy link
Contributor Author

dgibson commented Oct 8, 2020

/test-vfio

@jodh-intel
Copy link
Contributor

/test-ubuntu
/test-vfio

@dgibson
Copy link
Contributor Author

dgibson commented Oct 29, 2020

/test-vfio

@dgibson dgibson marked this pull request as draft October 29, 2020 04:01
@dgibson
Copy link
Contributor Author

dgibson commented Oct 30, 2020

/test-vfio

bpradipt pushed a commit to bpradipt/runtime that referenced this pull request Nov 27, 2020
We send information about several kinds of devices to the agent so that it
can apply specific handling.  We don't currently do this with VFIO devices.
However we need to do that so that the agent can properly wait for VFIO
devices to be ready (previously it did that using a PCI rescan which may
not be reliable and has some very bad side effects).

This patch collates and sends the relevant information.

Depends-on: github.com/kata-containers/agent#850
fixes kata-containers#2664

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
dgibson and others added 5 commits December 2, 2020 14:58
Currently TestGetDeviceName checks just one pair of sysfs path with
/dev node path, with a fair bit of setup to do that.  To allow future
tests with different pairs of sys/dev paths, factor out the guts of
the test into a parameterized helper `oneGetDeviceNameTest`.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently getDeviceName() and listenToUdevEvents() which supplies it with
information ignore events which don't supply a /dev path.  However, we have
upcoming use cases where we need to wait for a device (in the broad sense)
to be ready, even though it doesn't have an actual /dev node.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
hotplugTimeout controls how long we'll wait for a uevent indicating that a
device is ready in getDeviceName().  It's currently 3s, which is plenty of
time to allow for udev processing that we're usually dealing with.

However, we have upcoming cases where we may need to wait for a complete
SHPC PCI hotplug to complete.  SHPC has a 5s delay built into the protocol
(and therefore the guest implementations).

To accomodate this case, increase the timeout to 10s.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If extra devices are given in the OCI spec, the user will expect them
to be ready and available once the (inner) container starts executing.
So, if that means hotplugging something into the Kata VM, we need to
wait for that hotplug operation to complete before executing the
container.  For most devices that's handled by getDeviceName() which
waits for a uevent indicating the device is ready.

VFIO devices, however, don't have any handler so we don't explicitly
wait for them to be ready.  We usually get away with it, because we
force a PCI rescan in in finishCreateContainer() which blocks until
the rescan is complete.  That really only works by accident though,
completing a rescan doesn't necessarily mean all the device probing
logic is complete.  Worse, in some cases the PCI rescan can collide
with the hotplug processing and cause the device to go into a broken
state.

So, instead of relying on the rescan, accept information from the
runtime about what VFIO devices we expect, and explicitly wait for
them to be ready.

VFIO devices in Kata are (for now) weird - they will bind to whatever
the Kata VM's driver for them is, so could appear as any kind of
device (char, block, network interface, etc.).  That means waiting for
just the PCI device to be ready isn't foolproof, since there could be
some extra processing time for the driver to complete probing and
create the appropriate secondary devices.

But, really, that was true of the forced rescan as well, so this is a
clear improvement.

fixes #781
Depends-on: github.com/kata-containers/runtime#2981

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
PCI bus rescan code was added long time ago in Clear Containers due to lack of
ACPI support in QEMU 2.9 + q35 [1]. Now this code is messing up PCIe hotplug
in Kata Containers. A workaround to this issue is the "lazy attach"
mechanism [2] that hotplugs LBS (Large BAR space) devices after re-scanning the
PCI bus, unfourtunately some non-LBS devices are being affected too, for
instance SR-IOV devices. It would not make sense to lazy-attach non-LBS
devices because kata will end up lazy-attaching all the devices, having said
that, the PCI bus rescan code and the "lazy attach" mechanism should be removed

I'm not sure why, but this seems to expose a problem in TestStorageHandlers
where in some cases it relies on sb.deviceWatchers being initialized, but
it isn't.  So, fix that up as well.

fixes #781
fixes kata-containers/runtime#2664

[1] clearcontainers/agent#139
[2] kata-containers/runtime#2461

Signed-off-by: Julio Montes <julio.montes@intel.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
@dgibson
Copy link
Contributor Author

dgibson commented Dec 18, 2020

Since the CI issues are proving so hard to debug, I've decided not to pursue this in Kata1, and leave it to Kata2.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
needs-forward-port Changes need to be applied to a newer branch / repository no-backport-needed Changed do not need to be applied to an older branch / repository wip Work in Progress (PR incomplete - needs more work or rework)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[qemu] q35: PCI bus rescan code is messing up PCIe hotplug
5 participants