Skip to content

release-blocking jobs must run in dedicated cluster: ci-kubernetes-build #19483

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
spiffxp opened this issue Oct 7, 2020 · 24 comments
Closed
Assignees
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Oct 7, 2020

What should be cleaned up or changed:

This is part of #18549

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

NOTE: I am not tagging this as "help wanted" because it is blocked on kubernetes/k8s.io#846. I would also recommend doing ci-kubernetes-build-fast first. Here is my guess at how we could do this:

  • create a duplicate job that pushes to the new bucket writable by k8s-infra-prow-build
  • ensure it's building and pushing appropriately
  • update a release-blocking job to pull from the new bucket
  • if no problems, roll out changes progressively
    • a few more jobs in release-blocking
    • all jobs in release-blocking that use this job's results
    • a job that still runs in the "default" cluster
    • all jobs that use this job's results
  • rename jobs / get rid of the job that runs on the "default" cluster
  • do the same for release-branch variants, can probably do a faster rollout

It will be helpful to note the date/time that PR's merge. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra
/sig testing
/area jobs

@spiffxp spiffxp added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Oct 7, 2020
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/jobs labels Oct 7, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Oct 7, 2020

/sig release
FYI @kubernetes/release-engineering

@justaugustus
Copy link
Member

Arnaud has PRs in flight for this.
/assign @ameukam

and Carlos has been working on the build-fast stuff.
/assign @cpanato

@spiffxp
Copy link
Member Author

spiffxp commented Nov 5, 2020

Current obstacle is deciding how to allow the ci-kubernetes-build-canary job to push to gcr.io/k8s-staging-ci-images.

Currently:

  • ci-kubernetes-build runs on default build cluster aka google.com-owned k8s-prow-builds
  • all jobs in this cluster use the default pr-kubekins@kubernetes-jenkins gcp service account
  • pr-kubekins has role roles/editor on the kubernetes-ci-images project
  • gs://artifacts.kubernetes-ci-images.appspot.com/ (the bucket that corresponds to gcr.io/kubernetes-ci-images) gives roles/storage.legacyBucketOwner to project editors

What this means:

  • any job that runs in default can push to kubernetes-ci-images, including presubmits for unrelated repos

What could we do for pushing to gcr.io/k8s-staging-ci-images?

  • follow the same approach, allow the default prow-build@k8s-infra-prow-build service account on k8s-infra-prow-build to push to gcr.io/k8s-staging-ci-images
    • do this via adding the service account e-mail address to the k8s-infra-staging-ci-images@kubernetes.io group?
    • do this via direct iam bindings on the projects or buckets in question?
  • create a service account that is granted push access to gcr.io/k8s-staging-ci-images, restrict its use to well known jobs via static tests
    • could do this for each staging project, or on demand
    • could hardcode job names in tests, or try enforcing some kind of path-based convention
  • suggest that any project that needs to push images to staging repos should run on the "trusted" cluster
    • this would mean no presubmits could push images to staging repos

@ameukam
Copy link
Member

ameukam commented Nov 5, 2020

create a service account that is granted push access to gcr.io/k8s-staging-ci-images, restrict its use to well known jobs via static tests

+1 for doing this for each staging project.
Does this involve a k8s service account for each staging project?

@ameukam
Copy link
Member

ameukam commented Nov 6, 2020

do this via adding the service account e-mail address to the k8s-infra-staging-ci-images@kubernetes.io group?

ci-kubernetes-build-canary still fails even after the service account is added (see kubernetes/k8s.io#1393) to k8s-infra-staging-ci-images@kubernetes.io : https://testgrid.k8s.io/sig-testing-canaries#build-master-canary

prow-build service account inherits of the permissions of the role roles/cloudbuild.builds.editor as member of k8s-infra-staging-ci-images@kubernetes.io :

https://github.com/kubernetes/k8s.io/blob/74bfdc5741bdde3b8f489bdd8327474101b3b5e4/infra/gcp/lib.sh#L209-L231

which is not enough to make the job successful.

@spiffxp
Copy link
Member Author

spiffxp commented Nov 6, 2020

My read of GCS permissions is that members of the group have write and admin privileges to the bucket

$ gsutil iam get gs://artifacts.k8s-staging-ci-images.appspot.com/
{
  "bindings": [
    {
      "members": [
        "group:k8s-infra-artifact-admins@kubernetes.io",
        "projectEditor:k8s-staging-ci-images",
        "projectOwner:k8s-staging-ci-images"
      ],
      "role": "roles/storage.legacyBucketOwner"
    },
    {
      "members": [
        "projectViewer:k8s-staging-ci-images"
      ],
      "role": "roles/storage.legacyBucketReader"
    },
    {
      "members": [
        "group:k8s-infra-staging-ci-images@kubernetes.io" # here
      ],
      "role": "roles/storage.legacyBucketWriter"
    },
    {
      "members": [
        "group:k8s-infra-artifact-admins@kubernetes.io",
        "group:k8s-infra-staging-ci-images@kubernetes.io" # here
      ],
      "role": "roles/storage.objectAdmin"
    },
    {
      "members": [
        "allUsers"
      ],
      "role": "roles/storage.objectViewer"
    }
  ],
  "etag": "CAg="
}

@spiffxp
Copy link
Member Author

spiffxp commented Nov 6, 2020

Per https://cloud.google.com/iam/docs/understanding-roles

roles/storage.legacyBucketOwner

storage.buckets.get
storage.buckets.getIamPolicy
storage.buckets.setIamPolicy
storage.buckets.update
storage.objects.create
storage.objects.delete
storage.objects.list

roles/storage.legacyBucketWriter

storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.list

roles/storage.objectAdmin

resourcemanager.projects.get
resourcemanager.projects.list
storage.objects.*

roles/storage.admin

firebase.projects.get
resourcemanager.projects.get
resourcemanager.projects.list
storage.buckets.*
storage.objects.*

@spiffxp
Copy link
Member Author

spiffxp commented Nov 6, 2020

Perhaps the service account isn't getting picked up as a group member

spiffxp@cloudshell:~ (k8s-infra-prow-build)$ gcloud auth activate-service-account prow-build@k8s-infra-prow-build.iam.gserviceaccount.com --key-file=prow-build.service-account.json
spiffxp@cloudshell:~ (k8s-infra-prow-build)$ docker push gcr.io/k8s-staging-ci-images/conformance-amd64:v1.20.0-beta.1.209_bf5382a53e1546
The push refers to repository [gcr.io/k8s-staging-ci-images/conformance-amd64]
3836a827d4dc: Preparing
75b964e845dc: Preparing
50aa71a59909: Preparing
a4ffc6e6326f: Preparing
fb8aa2688f2b: Preparing
a2bc020e977c: Waiting
b0300c32a489: Waiting
denied: Token exchange failed for project 'k8s-staging-ci-images'. Caller does not have permission 'storage.buckets.get'. To configure permissions, follow instructions at: https://cloud.google.com/container-registry/docs/access-control

@spiffxp
Copy link
Member Author

spiffxp commented Nov 6, 2020

Opened kubernetes/k8s.io#1401 to do what I did manually

$ gsutil iam ch serviceAccount:prow-build@k8s-infra-prow-build.iam.gserviceaccount.com:roles/storage.objectAdmin gs://artifacts.k8s-staging-ci-images.appspot.com
$ gsutil iam ch serviceAccount:prow-build@k8s-infra-prow-build.iam.gserviceaccount.com:roles/storage.legacyBucketWriter gs://artifacts.k8s-staging-ci-images.appspot.com

Which I verified as working via a push which had failed before:

$ docker push gcr.io/k8s-staging-ci-images/conformance-amd64:v1.20.0-beta.1.209_bf5382a53e1546
The push refers to repository [gcr.io/k8s-staging-ci-images/conformance-amd64]
3836a827d4dc: Pushed
75b964e845dc: Pushed
50aa71a59909: Pushed
a4ffc6e6326f: Pushed
fb8aa2688f2b: Pushed
a2bc020e977c: Pushed
b0300c32a489: Pushed
v1.20.0-beta.1.209_bf5382a53e1546: digest: sha256:a97c22c59446a50398f42ebcd3af4f9999ec82adb008ce7dbed2f016536d3ef2 size: 1793

ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
ameukam added a commit to ameukam/test-infra that referenced this issue Nov 10, 2020
Ref : kubernetes#19483
Part of: kubernetes#18549

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
k8s-ci-robot added a commit that referenced this issue Nov 10, 2020
Add canary jobs for ci-kubernetes-build job variants.
@spiffxp
Copy link
Member Author

spiffxp commented Nov 20, 2020

It looks like https://testgrid.k8s.io/sig-testing-canaries#build-master-canary has been green for some time (same for release variants).

So now that we're writing ci artifacts to new locations, we need to migrate jobs to read from the new locations.

Copy-pasting from kubernetes/k8s.io#846 (comment):

What remains to be done:

  • identify jobs that depend on gcr.io/kubernetes-ci-images
    • determine whether extra config or flags are needed to specify gcr.io/k8s-staging-ci-images instead of gcr.io/kubernetes-ci-images
    • determine whether these jobs prevent migration of jobs that depend on non-image outputs of ci-kubernetes-build
  • migrate all jobs that use (non-image) output of ci-kubernetes-build to pull from gs://k8s-release-dev instead of gs://kubernetes-release-dev
    • sample job
    • all jobs
    • test to enforce
  • migrate all jobs that use image output of ci-kubernetes-build to pull from gcr.io/k8s-staging-ci-images instead of gcr.io/kubernetes-ci-images
    • sample job
    • all jobs? (depends on how bespoke this is)
    • test to enforce?
  • promote canary jobs
    • ci-kubernetes-build-canary should become ci-kubernetes-build, ci-kubernetes-build should become ci-kubernetes-build-old
    • remove -old jobs when confirmation that everything still works, and no jobs depend on the old artifacts

@spiffxp
Copy link
Member Author

spiffxp commented Nov 20, 2020

For "migrate all jobs that use (non-image) output of ci-kubernetes-build to pull from gs://k8s-release-dev instead of gs://kubernetes-release-dev", instead of copying around an --extract-ci-bucket=k8s-release-dev setting everywhere, I suggest we:

  • migrate enough jobs using that flag that we're confident the new ci artifacts work
  • give a heads up
  • switch the default bucket used by kubetest
  • bump a few jobs to the new kubekins image built as a result to confirm that this works
  • wait for the usual daily prow bump pr to update the rest

@spiffxp
Copy link
Member Author

spiffxp commented Nov 24, 2020

identify jobs that depend on gcr.io/kubernetes-ci-images

Based on https://cs.k8s.io/?q=kubernetes-ci-images&i=nope&files=&repos=

  • jobs that use kubeadm that don't explicitly set ClusterConfiguration.ImageRepository will default to using gcr.io/kubernetes-ci-images if ClusterConfiguration.KubernetesVersion starts with ci/ or ci-cross/
    • could impact kubeadm-kinder jobs
    • could impact cluster-api jobs
    • other jobs that use kubeadm?

It is unclear to me whether kubeadm needs to be patched?

  • change hardcoded constant from gcr.io/kubernetes-ci-images to gcr.io/k8-staging-ci-images
  • add flag / config option for CI image repo if we ever rename the repo again
  • make no changes and explicitly set --image-repository

@spiffxp
Copy link
Member Author

spiffxp commented Nov 24, 2020

https://cs.k8s.io/?q=kubernetes-release-dev&i=nope&files=&repos=

There are a lot of references to kubernetes-release-dev out there

  • kops jobs reference the version there
  • some sig-testing coverage jobs stage something there (do they need to?)
  • kubernetes_bazel and kubernetes_build scenarios reference it
  • https://dl.k8s.io redirects to it
  • parts of kubernetes/release reference it
  • kubernetes-sigs/cluser-api-provider-azure appears to reference it (and the bazel-built artifacts??)
  • kubernetes/kubernetes cluster scripts reference it
  • kubernetes/cloud-provider-gcp cluster scripts reference it
  • kubernetes/kubeadm references it in docs, kinder, and tests
  • kubernetes-sigs/cluster-api references it (again, bazel built artifacts?)
  • kubernetes-sigs/cluster-api-provider-gcp references it in a conformance e2e script
  • kubernetes-sigs/image-builder references it (for capi image building?)
  • kubernetes-sigs/poseidon references it
  • kubernetes/community docs reference it

So we can't just immediately delete the old ci-kubernetes-build jobs.
Possible paths forward

  • keep the old job running until everything is migrated (I don't like this because I think the old and new bucket may not have the exact same versions)
  • setup replication from the new gcs bucket to the old gcs bucket until everything is migrated

I see a KEP in our future

@neolit123
Copy link
Member

neolit123 commented Nov 30, 2020

identify jobs that depend on gcr.io/kubernetes-ci-images

Based on https://cs.k8s.io/?q=kubernetes-ci-images&i=nope&files=&repos=

  • jobs that use kubeadm that don't explicitly set ClusterConfiguration.ImageRepository will default to using gcr.io/kubernetes-ci-images if ClusterConfiguration.KubernetesVersion starts with ci/ or ci-cross/

    • could impact kubeadm-kinder jobs
    • could impact cluster-api jobs
    • other jobs that use kubeadm?

kinder downloads image tarbals from e.g.:

https://storage.googleapis.com/kubernetes-release-dev/ci/v1.20.0-beta.2.88+e3de62298a7304/bin/linux/amd64/kube-apiserver.tar

and then mutates them to be k8s.gcr.io/....

so i don't think kinder (or kubeadm CI) will be affected by the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change.
however reading kubernetes/k8s.io#846 my understanding is that kinder needs to migrate away from downloading from gs://kubernetes-release-dev to gs://k8s-release-dev.

should we do this now - i.e. is gs://k8s-release-dev ready for usage?

It is unclear to me whether kubeadm needs to be patched?

  • change hardcoded constant from gcr.io/kubernetes-ci-images to gcr.io/k8-staging-ci-images
  • add flag / config option for CI image repo if we ever rename the repo again
  • make no changes and explicitly set --image-repository

for the gcr.io/kubernetes-ci-images -> gcr.io/k8-staging-ci-images change. it can be considered as a breaking change for the kubeadm API:

  • the correct way would be to make this change as part of a new API version and older versions of the API would have to pass imageRepository (or the flag) explicitly, which is a breaking change to old API users.
  • the easier option is to just change the default in both the older and newer APIs, which is just easier for everyone, but may annoy API purist a little. in any case, it feels like this is the better option.

EDIT, logged:
kubernetes/kubeadm#2355
kubernetes/kubeadm#2356

@lasomethingsomething
Copy link
Contributor

@cpanato @ameukam Hello both: Wondering what your suggested next step might be from a Release Engineering group perspective?

@cpanato
Copy link
Member

cpanato commented Jan 19, 2021

I need to see what is the status of this item and maybe we can plan the next steps (@LappleApple)

@ameukam do you have any updates that you can provide to check where we are with this issue?

@ameukam
Copy link
Member

ameukam commented Jan 20, 2021

@LappleApple @cpanato The only update I'm aware is the switch to gcr.io/k8s-staging-ci-images : kubernetes/kubernetes#97087.
As suggested (#19483 (comment)) by spiffxp, the next step would be a KEP describing how we can migrate away from kubernetes-release-dev bucket.

@spiffxp
Copy link
Member Author

spiffxp commented Jan 25, 2021

IMO next steps are:

@spiffxp
Copy link
Member Author

spiffxp commented Feb 2, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 2, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 8, 2021

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 8, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 9, 2021

/remove-priority important-longterm
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Feb 9, 2021
ameukam added a commit to ameukam/test-infra that referenced this issue Feb 22, 2021
Ensure periodic-kubernetes-bazel-build can run k8s-infra-prow-build
cluster and push artifacts on k8s-release-dev bucket.

Related to : kubernetes#19483

Signed-off-by: Arnaud Meukam <ameukam@gmail.com>
@spiffxp
Copy link
Member Author

spiffxp commented Feb 23, 2021

@spiffxp
Copy link
Member Author

spiffxp commented Feb 24, 2021

/close
Work related to fully deprecating the google-hosted artifacts from these jobs (#19483 (comment)) will be tracked under kubernetes/k8s.io#1571

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Work related to fully deprecating the google-hosted artifacts from these jobs (#19483 (comment)) will be tracked under kubernetes/k8s.io#1571

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/jobs kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants