Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cirrus: Use updated VM images #1558

Merged
merged 1 commit into from
Feb 10, 2022
Merged

Conversation

cevich
Copy link
Member

@cevich cevich commented Feb 1, 2022

Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:

containers/automation_images#111

Signed-off-by: Chris Evich cevich@redhat.com

cevich added a commit to cevich/image that referenced this pull request Feb 1, 2022
Depends on containers/skopeo#1558

Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:

containers/automation_images#111

Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

@mtrmac have you seen this failure before? I'm getting this (and another similar error) after updating the VM images:

FAIL: copy_test.go:1320: SkopeoSuite.TestFailureCopySrcWithMirrorAndPrefixUnavailable
Running skopeo --registries-conf=./fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror579140904
copy_test.go:1324:
    assertSkopeoFails(c, ".*no such host.*", "--registries-conf="+regConfFixture, "copy",
        "docker://gcr.invalid/wrong/prefix/busybox", "dir:"+dir)
utils.go:80:
    c.Assert(string(out), check.Matches, "(?s)"+regexp) // (?s) : '.' will also match newlines
... value string = "time=\"2022-02-01T11:27:54-06:00\" level=fatal msg=\"initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get \\\"https://gcr.invalid/v2/\\\": dial tcp: lookup gcr.invalid: Temporary failure in name resolution\"\n"
... regex string = "(?s).*no such host.*"

@cevich cevich marked this pull request as draft February 1, 2022 17:56
@cevich cevich changed the title Cirrus: Use updated VM images [WIP] Cirrus: Use updated VM images Feb 1, 2022
@mtrmac
Copy link
Contributor

mtrmac commented Feb 1, 2022

@cevich There have been a few DNS flakes recently, but everything else in the tests does succeed.

At least on my local system, lookups of those host names do fail immediately with NXDOMAIN.

So my working hypothesis is that something, somewhere, is explicitly treating .invalid as special (or we are implicitly sharing state with some other code, maybe via a DNS cache)

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

We can probably invent some other host name (this.does.not.exist.example.com or, to be in actual control of the name, ….redhat.com) — for that, it would be nice to be able to tell whether we chose one that passes.

@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

Gosh this would be really hard. Kernel updates, glibc updates, systemd updates, etc. There's likely a lot of incidental updates
to sift through. The main intentional change in these images is just the shifting around some python-related packages which (I would think) shouldn't affect anything here.

We can probably invent some other host name

Actually, we do have control over podman.io DNS domain. If we need to add some special testing-related entries, give me the specifics, and I can talk to Tom about it. The example.comdomain has special "documentation" designation so I'm not sure how useful that is for negative-testing.

If you need domains that DO exist with predictable results, XIP and NIP can be useful for testing forward and reverse lookups.

@mtrmac
Copy link
Contributor

mtrmac commented Feb 1, 2022

Can you tell whether the change is related to the different VM image, or to something in the CI environment?

Gosh this would be really hard.

I was thinking something like “if it starts to fail with the existing images, and not just with this PR, it’s something in the environment”. I guess we’ll see in a few days.

We can probably invent some other host name

Actually, we do have control over podman.io DNS domain. If we need to add some special testing-related entries, give me the specifics, and I can talk to Tom about it. The example.comdomain has special "documentation" designation so I'm not sure how useful that is for negative-testing.

This is testing the mirroring; i.e. we need 3 host names that clearly don’t contain registries, and then we test that 1) with a mirror configuration we use the mirror, and 2) without a mirror configuration we fail.

Failing in a predictable, easy to detect, way, like NXDOMAIN, is best. It wouldn’t be too bad to just relax the regex of expected failure mode, instead.

Right now, this-does-not-exist.podman.io redirects to containers.github.io. Carving out domains that don’t exist, and don’t trigger that redirect (even ${uuidN}.podman.io) would work. OTOH it would be nice not to have that kind of very hard-to-see dependency between this GitHub repo and an, essentially, marketing domain.


To experiment with this, it should be enough to replace the various .invalid host names, consistently in all occurrences, in integration/copy_test.go and integration/fixtures/registries.conf` .


Actually, per https://www.rfc-editor.org/rfc/rfc6761.html#section-6.4 , .invalid is explicitly supposed to always result in NXDOMAIN. So something is fairly badly broken.

@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

I guess we’ll see in a few days.

We run builds on the branches every day and continuously (after every merge), I believe the report e-mail is sent at 8pm EST. Regardless, the post-merge results can be observed here: https://cirrus-ci.com/github/containers/skopeo/main (hint: they're all green).

So yeah, it's most def. "something" in the environment. Perhaps the tests were unintentionally depending on a bug that got fixed?

Actually, per https://www.rfc-editor.org/rfc/rfc6761.html#section-6.4 , .invalid is explicitly supposed to always result in NXDOMAIN. So something is fairly badly broken.

Oh interesting, I wouldn't have guessed that was actually defined. So maybe there was a bug-fix in the glibc resolver (assuming that affects golang)?

@mtrmac
Copy link
Contributor

mtrmac commented Feb 1, 2022

Let me ask some stupid questions to make sure I understand the situation:

Per the commit history, we’ve had a successful test run 11 yours ago. With this PR , and the updated VM image (which does affect this PR, I assume), the test is failing.

So, the cause either something that has changed in the environment during the last 11 hours, or a change in the contents of the VM image, isn’t it?


If that is correct, well, let’s see which of the two it is: #1559 .

@mtrmac
Copy link
Contributor

mtrmac commented Feb 1, 2022

#1559 did pass just now. So is it the VM images?

@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

So is it the VM images?

Or a flake...but assuming not, yes it's probably due to a change in the VM images. It may even be a new bug rather than a bug fix. It's just almost always really hard to tell by looking at the package change history. Is there some simple command I can run inside a hack/get_ci_vm.sh that would help narrow things down?

@cevich cevich force-pushed the new_python_images branch from 11f2c7e to 7ec98de Compare February 1, 2022 21:06
cevich added a commit to cevich/image that referenced this pull request Feb 1, 2022
Depends on containers/skopeo#1558

Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:

containers/automation_images#111

Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member Author

cevich commented Feb 1, 2022

Interesting...so with the systemd resolver in place, a dig of gcr.invalid gives me an empty response (as opposed to a failure). IIRC, I disabled the systemd resolver (somewhere) because it was causing problems with tests. Let me see if I can find that.

@cevich
Copy link
Member Author

cevich commented Feb 2, 2022

Damn, can't find it or I'm remembering something unrelated.

@cevich cevich force-pushed the new_python_images branch 2 times, most recently from be452ec to 92de540 Compare February 7, 2022 17:26
@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

@mtrmac these images are more/less passing in every other repo.'s CI, so I'm thinking this is a problem with the tests. I'm going to get a pair of VMs up, one old, one new. Is there anything specific you'd like me to check/compare that would help find the problem?

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

Weird. New VM:

[root@cevich-fedora-c5533341981081600 skopeo]# dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 11896
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;gcr.invalid.                   IN      A

;; AUTHORITY SECTION:
.                       86400   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022020800 1800 900 604800 86400

;; Query time: 27 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Tue Feb 08 10:26:38 CST 2022
;; MSG SIZE  rcvd: 115

[root@cevich-fedora-c5533341981081600 skopeo]# cat /etc/resolv.conf
# Generated by NetworkManager
search c.skopeo.internal google.internal
nameserver 169.254.169.254

Old VM:

[root@cevich-fedora-c6226133906620416 skopeo]# dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 9410
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;gcr.invalid.                   IN      A

;; AUTHORITY SECTION:
.                       86400   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2022020800 1800 900 604800 86400

;; Query time: 294 msec
;; SERVER: 169.254.169.254#53(169.254.169.254)
;; WHEN: Tue Feb 08 10:34:40 CST 2022
;; MSG SIZE  rcvd: 115

[root@cevich-fedora-c6226133906620416 skopeo]# cat /etc/resolv.conf
# Generated by NetworkManager
search c.skopeo.internal google.internal
nameserver 169.254.169.254

Laptop:

[cevich@localhost ~]$ dig gcr.invalid

; <<>> DiG 9.16.24-RH <<>> gcr.invalid
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 32132
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;gcr.invalid.			IN	A

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Tue Feb 08 11:36:25 EST 2022
;; MSG SIZE  rcvd: 40

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

Running the command from the test, new VM:

[root@cevich-fedora-c5533341981081600 integration]# ../bin/skopeo --registries-conf=./fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: Temporary failure in name resolution

Old VM:

[root@cevich-fedora-c6226133906620416 integration]# ../bin/skopeo --registries-conf=./
fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-
mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: no such host

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

In case it matters, there is a slight difference in golang as well, New VM has golang-1.16.13-1.fc35.x86_64, old one uses golang-1.16.8-2.fc35.x86_64

@mtrmac
Copy link
Contributor

mtrmac commented Feb 8, 2022

@mtrmac these images are more/less passing in every other repo.'s CI, so I'm thinking this is a problem with the tests. I'm going to get a pair of VMs up, one old, one new. Is there anything specific you'd like me to check/compare that would help find the problem?

I don’t really know:

  • dig, per the above, shows raw DNS responses.
  • getent hosts $host shows the glibc resolver’s view (but that doesn’t report errors on a detailed enough level)
  • A Go program, compiled the same way Skopeo is, would exercise the relevant code path — something like https://go.dev/play/p/ufwm6LMiZqw .
  • Ultimately I guess there’s strace or Wireshark.
  • Compare /etc/resolv.conf and /etc/nsswitch.conf (is sssd, nss-myhostname involved in one of them only?).

Also, figure out what resolver is Skopeo contacting (systemd-resolved? The new Podman DNS server?). Is there a difference in implementations/versions/configurations of that resolver in the old/new case?

And just to be sure, is the environment representative? AFAICS reading .cirrus.yml we are running these tests directly on the VM, so outputs directly from the VM should be relevant.


At some point, just relaxing the regex might be the most practical way to proceed. The tests check that a name is redirected when there is a match / that it isn’t redirected when there isn’t a match, and testing “the name was not redirected” is conceptually equally good when accepting two error strings as when accepting just one, as long as in “the name was redirected” case, the same error doesn’t happen.

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

figure out what resolver is Skopeo contacting (systemd-resolved? The new Podman DNS server?

Ahh ha! I think this might be it. I just discovered both netavark and aardvark-dns are (unexpectedly) installed. I think this might be causing another problem in CI elsewhere. I'm going to re-build with them explicitly excluded.

At some point, just relaxing the regex might be the most practical way to proceed.

So if I understand correctly, both roads (errors) end up at the same (desired) destination, but the tests are throwing a false-negative due to precise message-sensitivity?

@mtrmac
Copy link
Contributor

mtrmac commented Feb 8, 2022

So if I understand correctly, both roads (errors) end up at the same (desired) destination, but the tests are throwing a false-negative due to precise message-sensitivity?

Yes, IIRC. We do check for a precise error, but we don’t quite care about the exact way it fails. A host pointing to an existing registry which refused to authenticate the client could also be used to test the mirror redirection, for example — we would be looking for the authentication failure. (A more precise check is a bit safer, primarily it’s good if the failure is very unlikely to happen on a successful redirect.)

@cevich cevich force-pushed the new_python_images branch from 92de540 to ddcb457 Compare February 8, 2022 19:22
@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

Great, so I would suggest maybe adding a loosening of the check for that, since eventually we will be using netavar/aardvark-dns everywhere. In any case, I've just force-pushed with some images that should have those packages excluded, let's see if it makes a difference...

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

...damn (log). That wasn't it 😢

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

A Go program, compiled the same way Skopeo is, would exercise the relevant code path — something like https://go.dev/play/p/ufwm6LMiZqw .

I s'pose I'll give this a try next.

@mtrmac
Copy link
Contributor

mtrmac commented Feb 8, 2022

Per Fedora policy, just installing a package shouldn’t even start a network server, let alone change the default resolver…

@cevich
Copy link
Member Author

cevich commented Feb 8, 2022

True, though it was still an error those packages were there to begin with (they wrecked havoc in podman CI). When I was in the VM, I checked and both old/new images do have systemd-resolved running, however /etc/resolv.conf doesn't appear to be referring to it (127.0.0.53). I didn't look at nsswitch.conf though, so that could be a pathway to systemd-resolved maybe?

@mtrmac
Copy link
Contributor

mtrmac commented Feb 8, 2022

That’s possible (https://www.freedesktop.org/software/systemd/man/nss-resolve.html ), and https://fedoraproject.org/wiki/Changes/systemd-resolved suggests the the plan was, at least at one time, to use it that way.

@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

Bingo...nsswitch.conf contains:

hosts: files myhostname resolve [!UNAVAIL=return] dns

So it's probably not even looking at resolv.conf, ya?

@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

(new VM)

[root@cevich-fedora-c4764556961513472 skopeo]# bin/skopeo --registries-conf=integration/fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/cop
y-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid: Temporary failure in name resolution
[root@cevich-fedora-c4764556961513472 skopeo]# vi /etc/nsswitch.conf  # removed 'resolve [!UNAVAIL=return]'
[root@cevich-fedora-c4764556961513472 skopeo]# bin/skopeo --registries-conf=integration/fixtures/registries.conf copy docker://gcr.invalid/wrong/prefix/busybox dir:/tmp/copy-mirror401442990
FATA[0000] initializing source docker://gcr.invalid/wrong/prefix/busybox:latest: pinging container registry gcr.invalid: Get "https://gcr.invalid/v2/": dial tcp: lookup gcr.invalid on 169.254.169.254:53: no such host

@cevich cevich changed the title [WIP] Cirrus: Use updated VM images Cirrus: Use updated VM images Feb 10, 2022
@cevich
Copy link
Member Author

cevich commented Feb 10, 2022

@mtrmac okay, that fixed it! I've opened an issue re: desensitizing the tests, otherwise this PR is good-to-go. Thanks for your help 😃

Mainly this is to confirm some changes needed for the podman-py CI
setup don't disrupt operations here. Ref:

containers/automation_images#111

Note: Glibc resolver configuration has changed from previous images.  An
additional setup command was added to remove systemd-resolved from the
chain.

Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich cevich marked this pull request as ready for review February 10, 2022 18:38
@rhatdan
Copy link
Member

rhatdan commented Feb 10, 2022

LGTM

@rhatdan rhatdan merged commit 59a4522 into containers:main Feb 10, 2022
@mtrmac
Copy link
Contributor

mtrmac commented Feb 10, 2022

@cevich Thanks. If this is reproducible on “standard” installations, could you file a systemd bug as well?

mtrmac added a commit to mtrmac/skopeo that referenced this pull request Feb 10, 2022
... per containers#1558 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac mtrmac mentioned this pull request Feb 10, 2022
@cevich
Copy link
Member Author

cevich commented Feb 11, 2022

If this is reproducible on “standard” installations, could you file a systemd bug as well?

I'm happy to do the work, however I have not tried to reproduce it under the stock setup, nor do I have a good enough grasp on what's breaking down precisely. I would think the systemd maintainers would need a non-skopeo reproducer and/or low-level details regarding what/if systemd-resolved is doing wrong. I really don't feel knowledgeable enough to intelligently discuss the fine points with them in a bug 😕

mtrmac added a commit to mtrmac/skopeo that referenced this pull request Feb 15, 2022
... per containers#1558 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants