Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

Closed
1 task done
0xavi0 opened this issue Oct 7, 2024 · 3 comments
Closed
1 task done

Comments

@0xavi0
Copy link
Contributor

0xavi0 commented Oct 7, 2024

Backport of #2870

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

  1. Install Rancher & Fleet
  2. Add any GitRepo and make sure it deploys
  3. Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

@0xavi0 0xavi0 added this to the v2.9.3 milestone Oct 7, 2024
@rancherbot rancherbot added this to Fleet Oct 7, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Oct 7, 2024
@0xavi0 0xavi0 self-assigned this Oct 7, 2024
@0xavi0 0xavi0 moved this from 🆕 New to Needs QA review in Fleet Oct 7, 2024
@0xavi0
Copy link
Contributor Author

0xavi0 commented Oct 7, 2024

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos.
We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

  • Fleet will create a new job when is needed and will delete it after it succeeds
  • In case of error the job won't be deleted (so we can describe the job, check the logs, etc)
  • If a job is not finished and the user changes the Spec or force updates or a new commit is received, the job running will be deleted and a new one will be created.

Testing

Test a few scenarios so cover all the possible cases

  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
  • Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
  • Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created
  • Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

@mmartin24
Copy link
Collaborator

I initially checked it in v2.9-15b5719857d2bac0398a57b85f5ec1173e4dd375-head with 104.1.0+up0.10.4-rc.2 and seemed to work. However the case where I change the existing path of a deployed gitrepo which works, to another path that does not exist, it seems not to trigger a job:

Screencast.from.08-10-24.10.33.24.webm

Not sure what may happen here.
After having a look with @0xavi0 , this change does happen when the incorrect path is edited directly via cli, but not when changed in the UI.

@mmartin24
Copy link
Collaborator

Tested in Rancher v2.9-af2463ff43418024ff86cfd9304836b7074a924a-head with fleet
fleet:104.1.0+up0.10.4-rc.3 and working ok.


Tested scenarios above described an all were ok.Namely:

  1. Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds.
    a. Checked that after doing this, the UI correctly shows a job deletion status on recent events of the gitrepo:
    image

  2. Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.

  3. Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.

  4. Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.

  5. Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.

  6. Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created. (for this example I borrowed this scale-grepos from @manno (thanks for it)

  7. Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

Added some other scenarios and all ok as well:

  1. Checked triggering force update several times on high-resourced repo correctly triggers and deletes the job.

  2. Checked 10 repo upload high on resources on downstream clusters, correctly triggers and deletes the job. Tested with heavier amount (30) in bulk as well and ok.

  3. Checked multiple repo upload high on resources, and trigering force update while waiting correctly deploys and delete job.
    Adding as proof video on scenario 10:

    Screencast.from.10-10-24.13.46.07.webm
  4. Checked multiple repo upload on diferent downstream clusters correctly triggers and deletes the job

  5. Checked after forced update a faulty and correct git repo the faulty git repo stays while the correct one is deleted.

  6. Checked deploying git repo over paused clusters correctly triggers and deletes the job and later deploys the gitrepo after starting the cluster again.

  7. Checked deploying git repo over disconnected clusters correctly triggers and deletes the job and later deploys the gitrepo after reconnecting the cluster again.


  1. SPECIAL CASE: there is only 1 scenario that not being an error does not clean up the jobs after a succesful deployment
    Create successful gitrepo. Confirm deletion. Change URL or commit to a bad one. Observe how error appears. Change it back to correct one and observe how the gitrepo is correctly deployed yet the job remains in bad status:
    2024-10-10_16-01

14 scenarios were ok. Minor thing on 15 not an issue, just to be noted. Congratulations @0xavi0 for the fixes here. It seems to work quite well at least in all scenarios tested.


Aside from this manual checks we will add several of the tests cases above to our UI automation on: rancher/fleet-e2e#213

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants