Cirrus: Use updated VM images #1558

cevich · 2022-02-01T17:21:02Z

Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:

Signed-off-by: Chris Evich cevich@redhat.com

Depends on containers/skopeo#1558 Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Signed-off-by: Chris Evich <cevich@redhat.com>

cevich · 2022-02-01T17:50:39Z

@mtrmac have you seen this failure before? I'm getting this (and another similar error) after updating the VM images:

FAIL: copy_test.go:1320: SkopeoSuite.TestFailureCopySrcWithMirrorAndPrefixUnavailable
Running skopeo --registries-conf=./fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror579140904
copy_test.go:1324:
    assertSkopeoFails(c, ".*no such host.*", "--registries-conf="+regConfFixture, "copy",
        "docker://gcr.invalid/wrong/prefix/busybox", "dir:"+dir)
utils.go:80:
    c.Assert(string(out), check.Matches, "(?s)"+regexp) // (?s) : '.' will also match newlines
... value string = "time=\"2022-02-01T11:27:54-06:00\" level=fatal msg=\"initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get \\\"https://gcr.invalid/v2/\\\": dial tcp: lookup gcr.invalid: Temporary failure in name resolution\"\n"
... regex string = "(?s).*no such host.*"

mtrmac · 2022-02-01T18:23:09Z

@cevich There have been a few DNS flakes recently, but everything else in the tests does succeed.

At least on my local system, lookups of those host names do fail immediately with NXDOMAIN.

So my working hypothesis is that something, somewhere, is explicitly treating .invalid as special (or we are implicitly sharing state with some other code, maybe via a DNS cache)

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

We can probably invent some other host name (this.does.not.exist.example.com or, to be in actual control of the name, ….redhat.com) — for that, it would be nice to be able to tell whether we chose one that passes.

cevich · 2022-02-01T19:20:02Z

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

Gosh this would be really hard. Kernel updates, glibc updates, systemd updates, etc. There's likely a lot of incidental updates
to sift through. The main intentional change in these images is just the shifting around some python-related packages which (I would think) shouldn't affect anything here.

We can probably invent some other host name

Actually, we do have control over podman.io DNS domain. If we need to add some special testing-related entries, give me the specifics, and I can talk to Tom about it. The example.comdomain has special "documentation" designation so I'm not sure how useful that is for negative-testing.

If you need domains that DO exist with predictable results, XIP and NIP can be useful for testing forward and reverse lookups.

mtrmac · 2022-02-01T20:05:42Z

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

Gosh this would be really hard.

I was thinking something like “if it starts to fail with the existing images, and not just with this PR, it’s something in the environment”. I guess we’ll see in a few days.

We can probably invent some other host name

Actually, we do have control over podman.io DNS domain. If we need to add some special testing-related entries, give me the specifics, and I can talk to Tom about it. The example.comdomain has special "documentation" designation so I'm not sure how useful that is for negative-testing.

This is testing the mirroring; i.e. we need 3 host names that clearly don’t contain registries, and then we test that 1) with a mirror configuration we use the mirror, and 2) without a mirror configuration we fail.

Failing in a predictable, easy to detect, way, like NXDOMAIN, is best. It wouldn’t be too bad to just relax the regex of expected failure mode, instead.

Right now, this-does-not-exist.podman.io redirects to containers.github.io. Carving out domains that don’t exist, and don’t trigger that redirect (even ${uuidN}.podman.io) would work. OTOH it would be nice not to have that kind of very hard-to-see dependency between this GitHub repo and an, essentially, marketing domain.

To experiment with this, it should be enough to replace the various .invalid host names, consistently in all occurrences, in integration/copy_test.go and integration/fixtures/registries.conf` .

Actually, per https://www.rfc-editor.org/rfc/rfc6761.html#section-6.4 , .invalid is explicitly supposed to always result in NXDOMAIN. So something is fairly badly broken.

cevich · 2022-02-01T20:22:53Z

I guess we’ll see in a few days.

We run builds on the branches every day and continuously (after every merge), I believe the report e-mail is sent at 8pm EST. Regardless, the post-merge results can be observed here: https://cirrus-ci.com/github/containers/skopeo/main (hint: they're all green).

So yeah, it's most def. "something" in the environment. Perhaps the tests were unintentionally depending on a bug that got fixed?

Actually, per https://www.rfc-editor.org/rfc/rfc6761.html#section-6.4 , .invalid is explicitly supposed to always result in NXDOMAIN. So something is fairly badly broken.

Oh interesting, I wouldn't have guessed that was actually defined. So maybe there was a bug-fix in the glibc resolver (assuming that affects golang)?

mtrmac · 2022-02-01T20:31:13Z

Let me ask some stupid questions to make sure I understand the situation:

Per the commit history, we’ve had a successful test run 11 yours ago. With this PR , and the updated VM image (which does affect this PR, I assume), the test is failing.

So, the cause either something that has changed in the environment during the last 11 hours, or a change in the contents of the VM image, isn’t it?

If that is correct, well, let’s see which of the two it is: #1559 .

mtrmac · 2022-02-01T20:46:05Z

#1559 did pass just now. So is it the VM images?

cevich · 2022-02-01T20:54:01Z

So is it the VM images?

Or a flake...but assuming not, yes it's probably due to a change in the VM images. It may even be a new bug rather than a bug fix. It's just almost always really hard to tell by looking at the package change history. Is there some simple command I can run inside a hack/get_ci_vm.sh that would help narrow things down?

Depends on containers/skopeo#1558 Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Signed-off-by: Chris Evich <cevich@redhat.com>

cevich · 2022-02-01T21:29:02Z

Interesting...so with the systemd resolver in place, a dig of gcr.invalid gives me an empty response (as opposed to a failure). IIRC, I disabled the systemd resolver (somewhere) because it was causing problems with tests. Let me see if I can find that.

cevich · 2022-02-02T16:20:57Z

Damn, can't find it or I'm remembering something unrelated.

cevich · 2022-02-08T16:21:50Z

@mtrmac these images are more/less passing in every other repo.'s CI, so I'm thinking this is a problem with the tests. I'm going to get a pair of VMs up, one old, one new. Is there anything specific you'd like me to check/compare that would help find the problem?

cevich · 2022-02-08T16:37:53Z

Weird. New VM:

[root@cevich-fedora-c5533341981081600 skopeo]# dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 11896
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;gcr.invalid.                   IN      A

;; AUTHORITY SECTION:
.                       86400   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022020800 1800 900 604800 86400

;; Query time: 27 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Tue Feb 08 10:26:38 CST 2022
;; MSG SIZE  rcvd: 115

[root@cevich-fedora-c5533341981081600 skopeo]# cat /etc/resolv.conf
# Generated by NetworkManager
search c.skopeo.internal google.internal
nameserver 169.254.169.254

Old VM:

[root@cevich-fedora-c6226133906620416 skopeo]# dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 9410
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;gcr.invalid.                   IN      A

;; AUTHORITY SECTION:
.                       86400   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022020800 1800 900 604800 86400

;; Query time: 294 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Tue Feb 08 10:34:40 CST 2022
;; MSG SIZE  rcvd: 115

[root@cevich-fedora-c6226133906620416 skopeo]# cat /etc/resolv.conf
# Generated by NetworkManager
search c.skopeo.internal google.internal
nameserver 169.254.169.254

Laptop:

[cevich@localhost ~]$ dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 32132
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;gcr.invalid.			IN	A

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Tue Feb 08 11:36:25 EST 2022
;; MSG SIZE  rcvd: 40

cevich · 2022-02-08T16:52:16Z

Running the command from the test, new VM:

[root@cevich-fedora-c5533341981081600 integration]# ../bin/skopeo --registries-conf=./fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: Temporary failure in name resolution

Old VM:

[root@cevich-fedora-c6226133906620416 integration]# ../bin/skopeo --registries-conf=./
fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-
mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: no such host

cevich · 2022-02-08T16:55:05Z

In case it matters, there is a slight difference in golang as well, New VM has golang-1.16.13-1.fc35.x86_64, old one uses golang-1.16.8-2.fc35.x86_64

mtrmac · 2022-02-08T17:30:18Z

@mtrmac these images are more/less passing in every other repo.'s CI, so I'm thinking this is a problem with the tests. I'm going to get a pair of VMs up, one old, one new. Is there anything specific you'd like me to check/compare that would help find the problem?

I don’t really know:

dig, per the above, shows raw DNS responses.
getent hosts $host shows the glibc resolver’s view (but that doesn’t report errors on a detailed enough level)
A Go program, compiled the same way Skopeo is, would exercise the relevant code path — something like https://go.dev/play/p/ufwm6LMiZqw .
Ultimately I guess there’s strace or Wireshark.
Compare /etc/resolv.conf and /etc/nsswitch.conf (is sssd, nss-myhostname involved in one of them only?).

Also, figure out what resolver is Skopeo contacting (systemd-resolved? The new Podman DNS server?). Is there a difference in implementations/versions/configurations of that resolver in the old/new case?

And just to be sure, is the environment representative? AFAICS reading .cirrus.yml we are running these tests directly on the VM, so outputs directly from the VM should be relevant.

At some point, just relaxing the regex might be the most practical way to proceed. The tests check that a name is redirected when there is a match / that it isn’t redirected when there isn’t a match, and testing “the name was not redirected” is conceptually equally good when accepting two error strings as when accepting just one, as long as in “the name was redirected” case, the same error doesn’t happen.

cevich · 2022-02-08T18:35:11Z

figure out what resolver is Skopeo contacting (systemd-resolved? The new Podman DNS server?

Ahh ha! I think this might be it. I just discovered both netavark and aardvark-dns are (unexpectedly) installed. I think this might be causing another problem in CI elsewhere. I'm going to re-build with them explicitly excluded.

At some point, just relaxing the regex might be the most practical way to proceed.

So if I understand correctly, both roads (errors) end up at the same (desired) destination, but the tests are throwing a false-negative due to precise message-sensitivity?

mtrmac · 2022-02-08T18:39:39Z

So if I understand correctly, both roads (errors) end up at the same (desired) destination, but the tests are throwing a false-negative due to precise message-sensitivity?

Yes, IIRC. We do check for a precise error, but we don’t quite care about the exact way it fails. A host pointing to an existing registry which refused to authenticate the client could also be used to test the mirror redirection, for example — we would be looking for the authentication failure. (A more precise check is a bit safer, primarily it’s good if the failure is very unlikely to happen on a successful redirect.)

cevich · 2022-02-08T19:24:10Z

Great, so I would suggest maybe adding a loosening of the check for that, since eventually we will be using netavar/aardvark-dns everywhere. In any case, I've just force-pushed with some images that should have those packages excluded, let's see if it makes a difference...

cevich · 2022-02-08T19:40:10Z

...damn (log). That wasn't it 😢

cevich · 2022-02-08T19:41:32Z

A Go program, compiled the same way Skopeo is, would exercise the relevant code path — something like https://go.dev/play/p/ufwm6LMiZqw .

I s'pose I'll give this a try next.

mtrmac · 2022-02-08T19:44:02Z

Per Fedora policy, just installing a package shouldn’t even start a network server, let alone change the default resolver…

cevich · 2022-02-08T19:47:54Z

True, though it was still an error those packages were there to begin with (they wrecked havoc in podman CI). When I was in the VM, I checked and both old/new images do have systemd-resolved running, however /etc/resolv.conf doesn't appear to be referring to it (127.0.0.53). I didn't look at nsswitch.conf though, so that could be a pathway to systemd-resolved maybe?

mtrmac · 2022-02-08T19:55:07Z

That’s possible (https://www.freedesktop.org/software/systemd/man/nss-resolve.html ), and https://fedoraproject.org/wiki/Changes/systemd-resolved suggests the the plan was, at least at one time, to use it that way.

cevich · 2022-02-10T16:09:43Z

Bingo...nsswitch.conf contains:

hosts: files myhostname resolve [!UNAVAIL=return] dns

So it's probably not even looking at resolv.conf, ya?

cevich · 2022-02-10T16:21:15Z

(new VM)

[root@cevich-fedora-c4764556961513472 skopeo]# bin/skopeo --registries-conf=integration/fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/cop
y-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: Temporary failure in name resolution
[root@cevich-fedora-c4764556961513472 skopeo]# vi /etc/nsswitch.conf  # removed 'resolve [!UNAVAIL=return]'
[root@cevich-fedora-c4764556961513472 skopeo]# bin/skopeo --registries-conf=integration/fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid on 169.254.169.254:53: no such host

cevich · 2022-02-10T18:37:54Z

@mtrmac okay, that fixed it! I've opened an issue re: desensitizing the tests, otherwise this PR is good-to-go. Thanks for your help 😃

Mainly this is to confirm some changes needed for the podman-py CI setup don't disrupt operations here. Ref: containers/automation_images#111 Note: Glibc resolver configuration has changed from previous images. An additional setup command was added to remove systemd-resolved from the chain. Signed-off-by: Chris Evich <cevich@redhat.com>

rhatdan · 2022-02-10T19:23:13Z

LGTM

mtrmac · 2022-02-10T22:45:49Z

@cevich Thanks. If this is reproducible on “standard” installations, could you file a systemd bug as well?

... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

cevich · 2022-02-11T20:10:27Z

If this is reproducible on “standard” installations, could you file a systemd bug as well?

I'm happy to do the work, however I have not tried to reproduce it under the stock setup, nor do I have a good enough grasp on what's breaking down precisely. I would think the systemd maintainers would need a non-skopeo reproducer and/or low-level details regarding what/if systemd-resolved is doing wrong. I really don't feel knowledgeable enough to intelligently discuss the fine points with them in a bug 😕

... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

cevich mentioned this pull request Feb 1, 2022

[WIP] Cirrus: Use updated VM images containers/image#1456

Closed

cevich marked this pull request as draft February 1, 2022 17:56

cevich changed the title ~~Cirrus: Use updated VM images~~ [WIP] Cirrus: Use updated VM images Feb 1, 2022

mtrmac mentioned this pull request Feb 1, 2022

DO NOT MERGE: Trigger a test run #1559

Closed

cevich force-pushed the new_python_images branch from 11f2c7e to 7ec98de Compare February 1, 2022 21:06

cevich force-pushed the new_python_images branch 2 times, most recently from be452ec to 92de540 Compare February 7, 2022 17:26

cevich force-pushed the new_python_images branch from 92de540 to ddcb457 Compare February 8, 2022 19:22

cevich force-pushed the new_python_images branch from ddcb457 to 1c3f0c7 Compare February 10, 2022 16:43

cevich mentioned this pull request Feb 10, 2022

Integration tests overly sensitive to resolver result #1567

Closed

cevich changed the title ~~[WIP] Cirrus: Use updated VM images~~ Cirrus: Use updated VM images Feb 10, 2022

cevich force-pushed the new_python_images branch from 1c3f0c7 to 0f36349 Compare February 10, 2022 18:38

cevich marked this pull request as ready for review February 10, 2022 18:38

rhatdan merged commit 59a4522 into containers:main Feb 10, 2022

mtrmac added a commit to mtrmac/skopeo that referenced this pull request Feb 10, 2022

Work around systemd-resolved's handling of .invalid domains

438453a

... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac mentioned this pull request Feb 10, 2022

Resolved workaround #1568

Merged

mtrmac added a commit to mtrmac/skopeo that referenced this pull request Feb 15, 2022

Work around systemd-resolved's handling of .invalid domains

2318307

... per containers#1558 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

github-actions bot added the locked - please file new issue/PR label Sep 20, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cirrus: Use updated VM images #1558

Cirrus: Use updated VM images #1558

cevich commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

cevich commented Feb 1, 2022

cevich commented Feb 2, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 10, 2022

cevich commented Feb 10, 2022

cevich commented Feb 10, 2022

rhatdan commented Feb 10, 2022

mtrmac commented Feb 10, 2022

cevich commented Feb 11, 2022

Cirrus: Use updated VM images #1558

Cirrus: Use updated VM images #1558

Conversation

cevich commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

mtrmac commented Feb 1, 2022

mtrmac commented Feb 1, 2022

cevich commented Feb 1, 2022

cevich commented Feb 1, 2022

cevich commented Feb 2, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 8, 2022

mtrmac commented Feb 8, 2022

cevich commented Feb 10, 2022

cevich commented Feb 10, 2022

cevich commented Feb 10, 2022

rhatdan commented Feb 10, 2022

mtrmac commented Feb 10, 2022

cevich commented Feb 11, 2022