[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

0xavi0 · 2024-10-07T09:56:01Z

Backport of #2870

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

In Rancher local cluster, for each commit/change in each GitRepo, there is a Job started by Fleet. There is nothing to clean up these Jobs, so you will quickly end up with hundreds of lingering Job objects and their completed Pods.

I didn't notice this behavior in Fleet 0.9.x, so I assume something in 0.10.x introduced these Jobs. I was assuming this is related to automatic chart dependency update, but setting disableDependencyUpdate to true doesn't seem to affect.

Expected Behavior

Unnecessary Job objects are cleaned up, e.g. by setting some sane default for .spec.ttlSecondsAfterFinished: https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

Steps To Reproduce

Install Rancher & Fleet
Add any GitRepo and make sure it deploys
Check rancher-local cluster. You now have lingering Job objects

Environment

- Architecture: x86
- Fleet Version: v0.10.2
- Cluster:
  - Provider: GKE
  - Options: Rancher 2.9.1
  - Kubernetes Version: v1.30.4-gke.1213000

Logs

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

0xavi0 · 2024-10-07T13:44:14Z

Additional QA

Problem

Fleet is not deleting the jobs related to GitRepos.
We create a new job for every new commit we get in the git repository, which is a problem in systems with many GitRepos and many commits because we could reach the etcd limits.

Solution

Fleet will create a new job when is needed and will delete it after it succeeds
In case of error the job won't be deleted (so we can describe the job, check the logs, etc)
If a job is not finished and the user changes the Spec or force updates or a new commit is received, the job running will be deleted and a new one will be created.

Testing

Test a few scenarios so cover all the possible cases

Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created
Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

In any test, the job should only stay if it is not successful, otherwise it should be deleted.

mmartin24 · 2024-10-08T09:49:47Z

I initially checked it in v2.9-15b5719857d2bac0398a57b85f5ec1173e4dd375-head with 104.1.0+up0.10.4-rc.2 and seemed to work. However the case where I change the existing path of a deployed gitrepo which works, to another path that does not exist, it seems not to trigger a job:

Screencast.from.08-10-24.10.33.24.webm

Not sure what may happen here.
After having a look with @0xavi0 , this change does happen when the incorrect path is edited directly via cli, but not when changed in the UI.

mmartin24 · 2024-10-10T14:18:00Z

Tested in Rancher v2.9-af2463ff43418024ff86cfd9304836b7074a924a-head with fleet
fleet:104.1.0+up0.10.4-rc.3 and working ok.

Tested scenarios above described an all were ok.Namely:

Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds.
a. Checked that after doing this, the UI correctly shows a job deletion status on recent events of the gitrepo:
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then update the Commit, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then Force Update, check that another job is created and deleted after it succeeds.
Apply a GitRepo that is successful, check that the job is created and deleted when the job succeeds. Then change the Spec of the GitRepo (for example change the path), check that another job is created and deleted after it succeeds.
Apply a GitRepo that is not successful (for example a bad path or git url or anything that makes the job fail). Check that the job is not deleted and we can see the error in the logs.
Apply a GitRepo that creates a job that is slow, so we have time to Force Update before it is finished. Check that the job is deleted and re-created. (for this example I borrowed this scale-grepos from @manno (thanks for it)
Apply a GitRepo that creates a job that is slow, so we have enough time to change the Spec (for example the path). Check that the job is deleted and re-created.

Added some other scenarios and all ok as well:

Checked triggering force update several times on high-resourced repo correctly triggers and deletes the job.
Checked 10 repo upload high on resources on downstream clusters, correctly triggers and deletes the job. Tested with heavier amount (30) in bulk as well and ok.
Checked multiple repo upload high on resources, and trigering force update while waiting correctly deploys and delete job.
Adding as proof video on scenario 10:

Screencast.from.10-10-24.13.46.07.webm
Checked multiple repo upload on diferent downstream clusters correctly triggers and deletes the job
Checked after forced update a faulty and correct git repo the faulty git repo stays while the correct one is deleted.
Checked deploying git repo over paused clusters correctly triggers and deletes the job and later deploys the gitrepo after starting the cluster again.
Checked deploying git repo over disconnected clusters correctly triggers and deletes the job and later deploys the gitrepo after reconnecting the cluster again.

SPECIAL CASE: there is only 1 scenario that not being an error does not clean up the jobs after a succesful deployment
Create successful gitrepo. Confirm deletion. Change URL or commit to a bad one. Observe how error appears. Change it back to correct one and observe how the gitrepo is correctly deployed yet the job remains in bad status:

14 scenarios were ok. Minor thing on 15 not an issue, just to be noted. Congratulations @0xavi0 for the fixes here. It seems to work quite well at least in all scenarios tested.

Aside from this manual checks we will add several of the tests cases above to our UI automation on: rancher/fleet-e2e#213

0xavi0 added this to the v2.9.3 milestone Oct 7, 2024

rancherbot added this to Fleet Oct 7, 2024

github-project-automation bot moved this to 🆕 New in Fleet Oct 7, 2024

0xavi0 self-assigned this Oct 7, 2024

0xavi0 added the kind/backport label Oct 7, 2024

kkaempf added the kind/bug label Oct 7, 2024

0xavi0 mentioned this issue Oct 7, 2024

[v0.10] Changes job handling in gitops controller #2932

Merged

0xavi0 moved this from 🆕 New to Needs QA review in Fleet Oct 7, 2024

sbulage added the status/waiting-for-fleet-rc-and-chart label Oct 7, 2024

mmartin24 removed the status/waiting-for-fleet-rc-and-chart label Oct 8, 2024

mmartin24 self-assigned this Oct 8, 2024

mmartin24 mentioned this issue Oct 8, 2024

Automate job cleanup rancher/fleet-e2e#213

Closed

mmartin24 closed this as completed Oct 10, 2024

github-project-automation bot moved this from Needs QA review to ✅ Done in Fleet Oct 10, 2024

mmartin24 mentioned this issue Oct 22, 2024

Multiple gitjobs for one GitRepo, and gitrepo updates are not being triggered by future git code commit. #2972

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

0xavi0 commented Oct 7, 2024

0xavi0 commented Oct 7, 2024

mmartin24 commented Oct 8, 2024

mmartin24 commented Oct 10, 2024

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

[v0.10] Backport of [SURE-9061] Jobs are not cleaned up from local cluster #2931

Comments

0xavi0 commented Oct 7, 2024

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Logs

Anything else?

0xavi0 commented Oct 7, 2024

Additional QA

Problem

Solution

Testing

mmartin24 commented Oct 8, 2024

mmartin24 commented Oct 10, 2024