Cirrus: Run checks directly on the host #1334

cevich · 2021-06-24T16:09:29Z

In order to meet achievable deadlines converting from Travis to Cirrus
CI, one significant artifact was carried forward (instead of fixing):

Depending on a --privileged container to execute all/most automated
checks/tests.

Prior attempts to remove this aspect resulted in several test failures.
Fixing the problems was viewed as more time-consuming than simply
preserving this runtime environment.

Time has passed, and the code has since moved on. This commit removes
the legacy need to execute CI operations in a --privileged
container, instead running them directly on the host. At the same time,
the necessary test binaries are obtained from the same container used
for development/local testing purposes. This ensures the two experiences
are virtually always identical.

cevich · 2021-06-24T16:48:08Z

ahh right most of the tests are skipped when not running in a container, there was something I needed to tweak to get around that...

cevich · 2021-06-24T19:59:28Z

Wait...what?!?!?! It passed??????

@mtrmac PTAL (it's still a prototype).

cevich · 2021-06-25T14:36:34Z

oh! wait, I made a typo, and we probably want developers to be able to build a container for testing in...

mtrmac

The integration tests fail on OpenShift startup… Assuming that’s not transient (I have now triggered a re-run), debugging that doesn’t look like much fun.

Another sort-of-concern is that while running a custom container inside a custom VM is rather wasteful, and running CI directly on a VM is certainly nice, for local development it is convenient that make test-integration (and make test-system?) can be run in a container, so it is useful for CI to enforce that those two continue to work inside a container.

So, would running vendor+build+validate+unit directly on the VM, and integration+system in a container be (even possible and) worth the effort?

Dockerfile

hack/make.sh

cevich · 2021-06-30T12:58:56Z

for local development it is convenient

Yes, we need to do both, the challenge is keeping the environments and experience as similar as possible.

Would running vendor+build+validate+unit directly on the VM, and integration+system in a container be (even possible and) worth the effort?

There's no reason we can't/shouldn't do both in parallel (on separate VMs).

cevich · 2021-06-30T18:20:28Z

The openshift thingie seems to be failing to start due to SSl/TLS crap:

transport: http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate.

But I also see a fair number of what looks like errors binding to port 8443 of the local IP...I wonder if maybe something in the GCP networking setup is (somehow) blocking this? I would think the system/kernel smart enough to recognize that as the local address and not actually send anything out.

mtrmac · 2021-06-30T20:34:05Z

The openshift thingie seems to be failing to start due to SSl/TLS crap:

transport: http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate.

Yeah. I don’t know what that’s about :/ It’s quite possible that a networking difference changes the detected primary IP address inconsistently in some parts of the code vs. others, or something like that.

But I also see a fair number of what looks like errors binding to port 8443 of the local IP...

Binding or connecting? Connecting retries are expected, that’s just

skopeo/integration/utils.go

Line 117 in 71e7a58

    
           func newPortChecker(c *check.C, port int) (portOpen <-chan bool, terminate chan<- bool) {

waiting for the server to start up.

Binding to 0.0.0.0:8443 seems to succeed in those logs; also the “remote error” strings at least hint in the direction that a “remote” connection was actually established. Do you mean that it fails from time to time?

I would think the system/kernel smart enough to recognize that as the local address and not actually send anything out.

Yes — OTOH OpenShift might be trying to be “useful” on the local network; we just run an openshift start master (in this old version that didn’t use dozens of separate containers) with no configuration at all, and I don’t really know much at all what that does.

cevich · 2021-07-02T16:00:18Z

quite possible that a networking difference changes the detected primary IP address inconsistently

Oh, good thought. I can check this manually by comparing the logs from a job run from master.

Binding to 0.0.0.0:8443 seems to succeed in those logs;

Interesting, I didn't notice that.

OTOH OpenShift might be trying to be “useful” on the local network

Hmmm, so the VMs definitely have TWO IP addresses. One is internal-network only, the other is external. IIRC, the logs mentioned the 10.x.x.x (internal) network. That should work, but maybe not. We're using a very vanilla/default GCP networking setup, it's possible the default is to block inter-VM communications.

In any case, I think this is going to require some hands-on work.

cevich · 2021-07-20T16:34:36Z

this old version that didn’t use dozens of separate containers

@mtrmac QQ: Is the purpose of this to verify Skopeo can talk to the OpenShift registry? Is there a (testing) reason why skopeo needs to remain compatible with an unsupported OpenShift?

Reason I ask is...might it be easier to get a newer/supported registry to test against rather than fight/fix old one?

mtrmac · 2021-07-20T18:13:38Z

QQ: Is the purpose of this to verify Skopeo can talk to the OpenShift registry?

Yes,

Is there a (testing) reason why skopeo needs to remain compatible with an unsupported OpenShift?

Somewhat.

We need the registry for testing two things:

The X-Registry-Supports-Signatures API, a registry-native way of storing signatures, independent of OpenShift’s k8s API. That one is, AFAIK, supported by the OpenShift integrated registry in current versions as well, and it is much easier to use than external sigstores, so it’s something we fairly strongly need to keep functional, i.e. keep testing. (It’s fairly reasonable to argue that testing against old OpenShift will over time be less representative than testing against currently-supported versions, so if testing against the current full-cluster monsters isn’t feasible, so we could just give up and test against completely fake mock servers instead, and get similar coverage as we do now. It’s arguable but it’s not obviously the right call, and anyway replacing the current tests with the fake-mock-server tests would be a somewhat sizable effort.)
The preceding OpenShift-k8s-specific signature API, using the atomic: transport. That is still shipped but fairly strongly deprecated (it was created before the X-R-S-S API, so we need it either not to break old atomic: users, or to allow reading/writing signatures for OpenShift versions that support the atomic: API but not X-R-S-S), and less and less relevant over time. OpenShift has already changed its API so that atomic: does not work against recent servers, we keep it only to access the old ones. At some point, I don’t know how to decide when, we could just cut our losses and stop testing atomic:, or even remove it altogether.

In an ideal world I don’t expect to happen any time soon, we would teach github.com/distribution/distribution the X-R-S-S API, and test against that — a simple registry server is much smaller and easier to run. But nobody has worked on such a server, to my knowledge; and even in that case we would probably want to ensure interoperability with the OpenShift implementation of X-R-S-S. (… and in a truly ideal world, we would have testing coverage of the most of the major registry products, but that’s another problem space entirely.)

cevich · 2021-07-20T19:08:01Z

Okay, thanks for the insights.

Yeah, mocking it would be a major undertaking and additional maintenance, much better to use the real-deal. Too bad it's such a behemoth.

There are also now registries on both github and gitlab, in addition to quay and in the google cloud. I assume testing those wouldn't add much value, but please correct me if I'm wrong.

cevich · 2021-07-21T17:20:46Z

Okay, I setup an POC container build over in the c/automation_images repo.. Next I will work to incorporate it's use into both CI and the Makefile.

Makefile

cevich · 2021-07-22T16:02:47Z

@mtrmac I've taught make how to lookup and use the container image shared with CI. This involves some compromises (pulling & running two container images), though it's a one-time cost (only pulled if absent) and the images are relatively small. The big benefit here is, CI and local development will always remain in-sync (with an option to override for trying out a newer image). Please clone this PR and try out the Makefile so far (make docs-in-container, validate, shell, test-unit, test-integration, etc). I know some of the tests are still broken, but I'd like your blessing on the "look and feel" aspect.

Meanwhile...let me see about fixing some of these docker-rate-limit problems, they're starting to drive me crazy 😠

cevich · 2021-08-03T16:11:33Z

Is it possible that the home directory does not exist, or is not writeable?

We're running as root, so I doubt that simple case is the problem, but could it be an SELinux issue maybe?

I'm hitting all kinds of problems in podman land with these same VM images, due to policy updates: containers/podman#10522

Specifically, any volume mounts with :ro or those missing a :Z or :z are causing all kinds of problems with registry containers.

I'm having a really (REALLY) hard time parsing all the openshift/registry output mixed in with the test output. Is there any way we could send the openshift stuff into a file instead of printing it all? It's really easy for me to tell Cirrus to attach a file after the task runs (pass or fail).

cevich · 2021-08-03T16:14:17Z

confirmed: $HOME is set to /root and the permissions seem correct (0755), and SELinux type admin_home_t:s0

mtrmac · 2021-08-03T20:07:05Z

Huh. Is it possible that the home directory does not exist, or is not writeable?

Close enough of a guess, it seems.

confirmed: $HOME is set to /root and the permissions seem correct (0755), and SELinux type admin_home_t:s0

That’s not quite what I see. The home directory, per getpwnam(3)//etc/passwd, is indeed /root, but $HOME is either unset or empty.

See #1403 , which includes various debugging printouts, and where the integration tests pass when run with HOME=/root make ….

mtrmac · 2021-08-03T20:17:32Z

I'm having a really (REALLY) hard time parsing all the openshift/registry output mixed in with the test output. Is there any way we could send the openshift stuff into a file instead of printing it all? It's really easy for me to tell Cirrus to attach a file after the task runs (pass or fail).

I’m probably thinking about this incorrectly — I think those log streams have identifiable prefixes, so it seems possible to filter them out after-the-fact, using tools if necessary. So far it seemed convenient enough to me to have everything in the same log stream, usually in the natural order (“starting registry” … registry start-up logs … “waiting for registry start-up finished”); separating some of those logs out into external files would almost every time require looking up one or more of those files (presumably the main log would contain paths?), which seems to be more work to me. Is it the visual noise on each individual line? Too many useless log lines? The fact that everything is mixed in a single stream, regardless of what the individual lines look like? Something entirely different? Some combination of factors?

Admittedly:

It is definitely confusing that when start-up of the first cluster fails, we don’t tear it down and then the other test setup fails with a “port is already in use”. That’s should be fixable with a little effort.
The http2Client.notifyError got notified that the client transport was broken remote error: tls: bad certificate logs certainly spam the log, and IIRC weren’t there when writing the tests originally. But if the tests pass anyway, I’m not too tempted to dig deep into this old version of software to understand what that’s about; it is likely to be non-trivial effort and the outcome of that investigation is not very probable to be reusable to any modern software.

cevich · 2021-08-03T20:36:46Z

So far it seemed convenient enough to me to have everything in the same log stream

That's fine too, it's just hard on my eyes. But if it makes sense to you and you can make sense of it, I'm fine just being the lever-puller.

Some combination of factors?

From the perspective of someone who doesn't see it much / ever, it's hard to visually distinguish between test harness output and subject (skopeo) output and tooling (openshift, registry, etc.) output.

IIRC weren’t there when writing the tests originally

The horrendous amount of redundant messages from openshift/etcd ("blah blah tls cert invalid blah blah") are also really confusing. But yes, this can be a problem for another day and PR. I too am concerned as I don't remember seeing them before, maybe they were simply not shown before (somehow)?

the integration tests pass when run with HOME=/root

Oh wait, this just struck me. When the tests run, it's not a login shell, but in a get_ci_vm.sh environment, it is. This may be exactly the problem, and I remember having it in tests on the other repos. Let me copy-paste the fix...

mtrmac · 2021-08-03T20:52:29Z

So far it seemed convenient enough to me to have everything in the same log stream

That's fine too, it's just hard on my eyes. But if it makes sense to you and you can make sense of it, I'm fine just being the lever-puller.

Some combination of factors?

From the perspective of someone who doesn't see it much / ever, it's hard to visually distinguish between test harness output and subject (skopeo) output and tooling (openshift, registry, etc.) output.

I think there’s basically two cases of the asynchronous output: Log line: comes from OpenShift, and Output registry-$port {stdout,stderr} comes from registry processes. It should be easy enough to make the two more similar, or use a more prominent log prefix, or something like that.

Some other synchronous commands log output. Skopeo itself typically doesn’t log output, unless the test fails and the output is included in an assertion message.

Probably none of that is 100% consistent, as the code grew over time.

I can’t really remember any more, and of course past problems are not that much of an indication of future problems, but it seems reasonable that those noisy Output /usr/local/bin/oc oc create -f - stderr: waiting lines were added for a reason, and losing the relative timing of those vs. other output might be bad. OTOH it’s also quite possible that those log lines were added only while developing the log catching implementation and were never relevant for timing; my commit message doesn’t say.

I don’t feel about the output format strongly… I guess PRs are welcome?

The horrendous amount of redundant messages from openshift/etcd ("blah blah tls cert invalid blah blah") are also really confusing. But yes, this can be a problem for another day and PR. I too am concerned as I don't remember seeing them before, maybe they were simply not shown before (somehow)?

My first guess is that this has something to do with recent Go only looking for host names in subjectAltName instead of CN. But that’s really just a guess with no supporting evidence at all.

Makefile

cevich · 2021-08-04T13:34:21Z

Huh...well it seems (somehow) the tests passed. I didn't notice anything funny in the output, but I'm a bit skeptical. Let me push this small Makefile fixup and let's see if it passes again...

cevich · 2021-08-04T14:24:33Z

Well, okay then it passed again. Removing the draft status, let's get on with final review before something breaks 😁

mtrmac

Thanks for all the work!

I’m really only worried about the /etc/os-release check question; the rest are non-blocking nits.

Makefile

contrib/cirrus/runner.sh

Makefile

.cirrus.yml

In order to meet achievable deadlines converting from Travis to Cirrus CI, one significant artifact was carried forward (instead of fixing): Depending on a `--privileged` container to execute all/most automated checks/tests. Prior attempts to remove this aspect resulted in several test failures. Fixing the problems was viewed as more time-consuming than simply preserving this runtime environment. Time has passed, and the code has since moved on. This commit removes the legacy need to execute CI operations in a `--privileged` container, instead running them directly on the host. At the same time, the necessary test binaries are obtained from the same container used for development/local testing purposes. This ensures the two experiences are virtually always identical. Signed-off-by: Chris Evich <cevich@redhat.com>

cevich · 2021-08-04T19:38:46Z

@mtrmac thanks for all your help on this too, couldn't have done it w/o you. What a doozy of a PR! 😅

mtrmac

LGTM. Thanks again!

mtrmac · 2021-08-04T20:14:55Z

FWIW https://cirrus-ci.com/build/6216764259303424 lists some “Issues found while parsing configuration” — most look like stylistic preferences rather than something that needs fixing. I have no idea about the severity (or it might have been a well-considered long-standing decision), I just happened to randomly notice that field (well, the whole page, actually) for the first time today.

cevich · 2021-08-04T20:18:35Z

some “Issues found while parsing configuration” — most look like stylistic preferences rather than something that needs fixing.

Ugh, I know it does that everywhere now. For some reason they really don't like how we use alias + name in jobs, but we literally have to because of a bug they (supposedly) can't fix. It's safe to ignore for now.

cevich mentioned this pull request Jun 24, 2021

Travis -> Cirrus: validate, vendor, and test #1249

Merged

cevich marked this pull request as draft June 24, 2021 16:46

cevich force-pushed the drop_podmanmake branch 2 times, most recently from 8dfc215 to 910bd20 Compare June 24, 2021 19:43

cevich requested a review from mtrmac June 25, 2021 13:38

cevich force-pushed the drop_podmanmake branch 2 times, most recently from 0b2848e to e2d1464 Compare June 25, 2021 17:15

mtrmac reviewed Jun 28, 2021

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

Dockerfile Show resolved Hide resolved

Dockerfile Outdated Show resolved Hide resolved

Dockerfile Show resolved Hide resolved

hack/make.sh Show resolved Hide resolved

cevich force-pushed the drop_podmanmake branch 3 times, most recently from 1953bfa to a792502 Compare June 30, 2021 15:01

cevich force-pushed the drop_podmanmake branch from a792502 to 8105a49 Compare July 12, 2021 15:43

cevich force-pushed the drop_podmanmake branch from 8105a49 to e6922ca Compare July 21, 2021 19:31

cevich changed the title ~~Cirrus: Run checks directly on the host~~ [WIP] Cirrus: Run checks directly on the host Jul 21, 2021

cevich force-pushed the drop_podmanmake branch 2 times, most recently from 75729a2 to 363e1fa Compare July 21, 2021 20:03

mtrmac reviewed Jul 21, 2021

View reviewed changes

Makefile Show resolved Hide resolved

Makefile Show resolved Hide resolved

cevich force-pushed the drop_podmanmake branch from 363e1fa to 21545ac Compare July 22, 2021 15:53

cevich force-pushed the drop_podmanmake branch from 52a74bb to 069ada8 Compare August 3, 2021 20:50

mtrmac reviewed Aug 4, 2021

View reviewed changes

Makefile Outdated Show resolved Hide resolved

mtrmac mentioned this pull request Aug 4, 2021

Test oc login #1403

Closed

cevich force-pushed the drop_podmanmake branch 2 times, most recently from f68c889 to dec382f Compare August 4, 2021 13:36

cevich changed the title ~~[WIP] Cirrus: Run checks directly on the host~~ Cirrus: Run checks directly on the host Aug 4, 2021

cevich force-pushed the drop_podmanmake branch 2 times, most recently from 69552c5 to 53a3d99 Compare August 4, 2021 13:42

cevich marked this pull request as ready for review August 4, 2021 14:24

mtrmac reviewed Aug 4, 2021

View reviewed changes

Makefile Outdated Show resolved Hide resolved

contrib/cirrus/runner.sh Outdated Show resolved Hide resolved

contrib/cirrus/runner.sh Show resolved Hide resolved

Makefile Show resolved Hide resolved

Makefile Show resolved Hide resolved

mtrmac reviewed Aug 4, 2021

View reviewed changes

.cirrus.yml Show resolved Hide resolved

cevich mentioned this pull request Aug 4, 2021

Perl lang warning in Skopeo CI/DEV container image containers/automation_images#85

Closed

cevich force-pushed the drop_podmanmake branch from 53a3d99 to ab87b15 Compare August 4, 2021 19:38

mtrmac approved these changes Aug 4, 2021

View reviewed changes

mtrmac merged commit 678682f into containers:main Aug 4, 2021

mtrmac mentioned this pull request Aug 5, 2021

Define SKOPEO_CIDEV_CONTAINER_FQIN containers/image#1334

Merged

github-actions bot added the locked - please file new issue/PR label Sep 22, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cirrus: Run checks directly on the host #1334

Cirrus: Run checks directly on the host #1334

cevich commented Jun 24, 2021 •

edited

Loading

cevich commented Jun 24, 2021

cevich commented Jun 24, 2021

cevich commented Jun 25, 2021

mtrmac left a comment

cevich commented Jun 30, 2021

cevich commented Jun 30, 2021 •

edited

Loading

mtrmac commented Jun 30, 2021 •

edited

Loading

cevich commented Jul 2, 2021

cevich commented Jul 20, 2021

mtrmac commented Jul 20, 2021 •

edited

Loading

cevich commented Jul 20, 2021

cevich commented Jul 21, 2021

cevich commented Jul 22, 2021

cevich commented Aug 3, 2021

cevich commented Aug 3, 2021

mtrmac commented Aug 3, 2021

mtrmac commented Aug 3, 2021

cevich commented Aug 3, 2021

mtrmac commented Aug 3, 2021

cevich commented Aug 4, 2021

cevich commented Aug 4, 2021

mtrmac left a comment

cevich commented Aug 4, 2021

mtrmac left a comment

mtrmac commented Aug 4, 2021

cevich commented Aug 4, 2021

Cirrus: Run checks directly on the host #1334

Cirrus: Run checks directly on the host #1334

Conversation

cevich commented Jun 24, 2021 • edited Loading

cevich commented Jun 24, 2021

cevich commented Jun 24, 2021

cevich commented Jun 25, 2021

mtrmac left a comment

Choose a reason for hiding this comment

cevich commented Jun 30, 2021

cevich commented Jun 30, 2021 • edited Loading

mtrmac commented Jun 30, 2021 • edited Loading

cevich commented Jul 2, 2021

cevich commented Jul 20, 2021

mtrmac commented Jul 20, 2021 • edited Loading

cevich commented Jul 20, 2021

cevich commented Jul 21, 2021

cevich commented Jul 22, 2021

cevich commented Aug 3, 2021

cevich commented Aug 3, 2021

mtrmac commented Aug 3, 2021

mtrmac commented Aug 3, 2021

cevich commented Aug 3, 2021

mtrmac commented Aug 3, 2021

cevich commented Aug 4, 2021

cevich commented Aug 4, 2021

mtrmac left a comment

Choose a reason for hiding this comment

cevich commented Aug 4, 2021

mtrmac left a comment

Choose a reason for hiding this comment

mtrmac commented Aug 4, 2021

cevich commented Aug 4, 2021

cevich commented Jun 24, 2021 •

edited

Loading

cevich commented Jun 30, 2021 •

edited

Loading

mtrmac commented Jun 30, 2021 •

edited

Loading

mtrmac commented Jul 20, 2021 •

edited

Loading