-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Cirrus: Use updated VM images #1558
Conversation
Depends on containers/skopeo#1558 Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Signed-off-by: Chris Evich <cevich@redhat.com>
@mtrmac have you seen this failure before? I'm getting this (and another similar error) after updating the VM images:
|
@cevich There have been a few DNS flakes recently, but everything else in the tests does succeed. At least on my local system, lookups of those host names do fail immediately with NXDOMAIN. So my working hypothesis is that something, somewhere, is explicitly treating Can you tell whether the change is related to the different VM image, or to something in the CI environment? We can probably invent some other host name ( |
Gosh this would be really hard. Kernel updates, glibc updates, systemd updates, etc. There's likely a lot of incidental updates
Actually, we do have control over If you need domains that DO exist with predictable results, XIP and NIP can be useful for testing forward and reverse lookups. |
I was thinking something like “if it starts to fail with the existing images, and not just with this PR, it’s something in the environment”. I guess we’ll see in a few days.
This is testing the mirroring; i.e. we need 3 host names that clearly don’t contain registries, and then we test that 1) with a mirror configuration we use the mirror, and 2) without a mirror configuration we fail. Failing in a predictable, easy to detect, way, like NXDOMAIN, is best. It wouldn’t be too bad to just relax the regex of expected failure mode, instead. Right now, To experiment with this, it should be enough to replace the various Actually, per https://www.rfc-editor.org/rfc/rfc6761.html#section-6.4 , |
We run builds on the branches every day and continuously (after every merge), I believe the report e-mail is sent at 8pm EST. Regardless, the post-merge results can be observed here: https://cirrus-ci.com/github/containers/skopeo/main (hint: they're all green). So yeah, it's most def. "something" in the environment. Perhaps the tests were unintentionally depending on a bug that got fixed?
Oh interesting, I wouldn't have guessed that was actually defined. So maybe there was a bug-fix in the glibc resolver (assuming that affects golang)? |
Let me ask some stupid questions to make sure I understand the situation: Per the commit history, we’ve had a successful test run 11 yours ago. With this PR , and the updated VM image (which does affect this PR, I assume), the test is failing. So, the cause either something that has changed in the environment during the last 11 hours, or a change in the contents of the VM image, isn’t it? If that is correct, well, let’s see which of the two it is: #1559 . |
#1559 did pass just now. So is it the VM images? |
Or a flake...but assuming not, yes it's probably due to a change in the VM images. It may even be a new bug rather than a bug fix. It's just almost always really hard to tell by looking at the package change history. Is there some simple command I can run inside a |
11f2c7e
to
7ec98de
Compare
Depends on containers/skopeo#1558 Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Signed-off-by: Chris Evich <cevich@redhat.com>
Interesting...so with the systemd resolver in place, a dig of |
Damn, can't find it or I'm remembering something unrelated. |
be452ec
to
92de540
Compare
@mtrmac these images are more/less passing in every other repo.'s CI, so I'm thinking this is a problem with the tests. I'm going to get a pair of VMs up, one old, one new. Is there anything specific you'd like me to check/compare that would help find the problem? |
Weird. New VM:
Old VM:
Laptop:
|
Running the command from the test, new VM:
Old VM:
|
In case it matters, there is a slight difference in golang as well, New VM has |
I don’t really know:
Also, figure out what resolver is Skopeo contacting (systemd-resolved? The new Podman DNS server?). Is there a difference in implementations/versions/configurations of that resolver in the old/new case? And just to be sure, is the environment representative? AFAICS reading At some point, just relaxing the regex might be the most practical way to proceed. The tests check that a name is redirected when there is a match / that it isn’t redirected when there isn’t a match, and testing “the name was not redirected” is conceptually equally good when accepting two error strings as when accepting just one, as long as in “the name was redirected” case, the same error doesn’t happen. |
Ahh ha! I think this might be it. I just discovered both netavark and aardvark-dns are (unexpectedly) installed. I think this might be causing another problem in CI elsewhere. I'm going to re-build with them explicitly excluded.
So if I understand correctly, both roads (errors) end up at the same (desired) destination, but the tests are throwing a false-negative due to precise message-sensitivity? |
Yes, IIRC. We do check for a precise error, but we don’t quite care about the exact way it fails. A host pointing to an existing registry which refused to authenticate the client could also be used to test the mirror redirection, for example — we would be looking for the authentication failure. (A more precise check is a bit safer, primarily it’s good if the failure is very unlikely to happen on a successful redirect.) |
92de540
to
ddcb457
Compare
Great, so I would suggest maybe adding a loosening of the check for that, since eventually we will be using netavar/aardvark-dns everywhere. In any case, I've just force-pushed with some images that should have those packages excluded, let's see if it makes a difference... |
...damn (log). That wasn't it 😢 |
I s'pose I'll give this a try next. |
Per Fedora policy, just installing a package shouldn’t even start a network server, let alone change the default resolver… |
True, though it was still an error those packages were there to begin with (they wrecked havoc in podman CI). When I was in the VM, I checked and both old/new images do have |
That’s possible (https://www.freedesktop.org/software/systemd/man/nss-resolve.html ), and https://fedoraproject.org/wiki/Changes/systemd-resolved suggests the the plan was, at least at one time, to use it that way. |
Bingo...
So it's probably not even looking at |
(new VM)
|
ddcb457
to
1c3f0c7
Compare
@mtrmac okay, that fixed it! I've opened an issue re: desensitizing the tests, otherwise this PR is good-to-go. Thanks for your help 😃 |
Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Note: Glibc resolver configuration has changed from previous images. An additional setup command was added to remove systemd-resolved from the chain. Signed-off-by: Chris Evich <cevich@redhat.com>
1c3f0c7
to
0f36349
Compare
LGTM |
@cevich Thanks. If this is reproducible on “standard” installations, could you file a systemd bug as well? |
... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>
I'm happy to do the work, however I have not tried to reproduce it under the stock setup, nor do I have a good enough grasp on what's breaking down precisely. I would think the systemd maintainers would need a non-skopeo reproducer and/or low-level details regarding what/if systemd-resolved is doing wrong. I really don't feel knowledgeable enough to intelligently discuss the fine points with them in a bug 😕 |
... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:
containers/automation_images#111
Signed-off-by: Chris Evich cevich@redhat.com