Skip to content

Commit

Permalink
feat: add status for removing / removalfailed (#1334)
Browse files Browse the repository at this point in the history
## Description

This PR utilizes the "new" ability in a Pepr finalizer to not remove the
`finalizer`. This enables us to update the status while finalizing, and
catch errors if cleanup does not work as expected. Changes:
- Skip finalizer if it's already running (based on status) 
- Skip finalizer if Package isn't ready/failed yet (for
#963)
- Patch `Removing` status on the CR
- Catch errors on finalization and patch `RemovalFailed` status and
create a failure event
- Retry each cleanup/purge function using `retryWithDelay`

I also updated the diagram to support these changes, as well as adding
test cases for the finalizer function. Diagram update can be previewed
on the docs by using [this
link](https://raw.githubusercontent.com/defenseunicorns/uds-core/c41964d426b8bb9780c26d41c631dbe6f50e854a/docs/.images/diagrams/uds-core-operator-uds-package.svg)
on `docs/reference/configuration/UDS operator/package.md`, specific
changes:
- Moved finalizer section to the right of reconciler
- Simplified flow of validator (to make more space in the diagram)
- Added new pieces of finalizer flow (failure, status patching, etc)

## Related Issue

Fixes #963

Fixes #1159

## Type of change

- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Steps to Validate

<details><summary>Testing Steps</summary>

Test setup:
```console
# Install slim-dev (unicorn flavor to avoid pull rate limiting)
uds run slim-dev --set flavor=unicorn
# Create the test packages
zarf p create src/test --skip-sbom
# Deploy the test packages
zarf p deploy build/zarf-package-uds-core-test-apps-*.tar.zst --confirm
# Validate all package CRs go to Ready status
kubectl get pkg -A # should all show ready
```

Test that normal deletion works and makes events:
```console
# Delete a package CR
kubectl delete pkg -n test-tenant-app test-tenant-app
# Validate success and events
kubectl get pkg -n test-tenant-app # should show no resources
kubectl get events -n test-tenant-app | grep package # should show 3 removal events
```

Test that finalizer doesn't run until CR is ready:
```console
# This forces a re-reconcile of the package and then deletes immediately
# If you watch while this happens (k9s, etc) you should see it go to Pending before Removing
kubectl patch pkg httpbin-other -n authservice-test-app --subresource=status --type=json  -p='[{"op": "remove", "path": "/status"}]' && kubectl delete pkg httpbin-other -n authservice-test-app
# Validate that the watcher waited to finalize
kubectl logs -n pepr-system -l app=pepr-uds-core-watcher --tail=-1 | grep "Waiting"
kubectl get events -n authservice-test-app | grep package # should show 3 removal events
```

Test that finalizer places CR in RemovalFailed state on failed cleanup:
```console
# Deploy the test apps again (we need the sso client)
zarf p deploy build/zarf-package-uds-core-test-apps-*.tar.zst --confirm
# Edit the peprstore
kubectl edit peprstore -n pepr-system pepr-uds-core-store
# Delete the line with `uds-core-operator-v2-sso-client-uds-core-httpbin`, this is the client token and will make Pepr unable to cleanup the client
# Save the peprstore
# Delete the package CR
kubectl delete pkg httpbin-other -n authservice-test-app
# Make sure that status is marked as RemovalFailed (after ~15 seconds)
kubectl get pkg httpbin-other -n authservice-test-app
# Make sure events show up that client failed to be removed
kubectl describe pkg httpbin-other -n authservice-test-app
# Make sure that the SSO client removal was retried 4 times before final failure
kubectl logs -n pepr-system -l app=pepr-uds-core-watcher --tail=-1 | grep "cleanupSSOClients"
```

</details>

Also note the automated jest unit tests and validate those.

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [x] [Contributor
Guide](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md)
followed
  • Loading branch information
mjnagel authored Mar 7, 2025
1 parent d51db55 commit a99b408
Show file tree
Hide file tree
Showing 9 changed files with 577 additions and 336 deletions.
3 changes: 2 additions & 1 deletion docs/.images/diagrams/uds-core-operator-uds-package.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
621 changes: 308 additions & 313 deletions docs/.images/diagrams/uds-core-pepr-operator-flow.drawio

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion schemas/package-v1alpha1.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -1140,7 +1140,8 @@
"Ready",
"Failed",
"Retrying",
"Removing"
"Removing",
"RemovalFailed"
],
"title": "Phase"
}
Expand Down
9 changes: 8 additions & 1 deletion src/pepr/operator/controllers/keycloak/client-sync.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,13 @@ export async function keycloak(pkg: UDSPackage) {
clients.set(client.clientId, client);
}

await purgeSSOClients(pkg, [...clients.keys()]);
// Purge orphaned clients
try {
await purgeSSOClients(pkg, [...clients.keys()]);
} catch (e) {
log.error(e, `Failed to purge orphaned clients in for ${pkg.metadata!.name!}: ${e}`);
}

// Purge orphaned SSO secrets
try {
await purgeOrphans(generation, pkg.metadata!.namespace!, pkg.metadata!.name!, kind.Secret, log);
Expand Down Expand Up @@ -87,6 +93,7 @@ export async function purgeSSOClients(pkg: UDSPackage, newClients: string[] = []
await Store.removeItemAndWait(storeKey);
} else {
log.warn(pkg.metadata, `Failed to remove client ${ref}, token not found`);
throw new Error(`Failed to remove client ${ref}, token not found`);
}
}
}
Expand Down
1 change: 1 addition & 0 deletions src/pepr/operator/crd/generated/package-v1alpha1.ts
Original file line number Diff line number Diff line change
Expand Up @@ -776,6 +776,7 @@ export enum Phase {
Failed = "Failed",
Pending = "Pending",
Ready = "Ready",
RemovalFailed = "RemovalFailed",
Removing = "Removing",
Retrying = "Retrying",
}
Expand Down
2 changes: 1 addition & 1 deletion src/pepr/operator/crd/sources/package/v1alpha1.ts
Original file line number Diff line number Diff line change
Expand Up @@ -540,7 +540,7 @@ export const v1alpha1: V1CustomResourceDefinitionVersion = {
},
},
phase: {
enum: ["Pending", "Ready", "Failed", "Retrying", "Removing"],
enum: ["Pending", "Ready", "Failed", "Retrying", "Removing", "RemovalFailed"],
type: "string",
},
ssoClients: {
Expand Down
7 changes: 5 additions & 2 deletions src/pepr/operator/reconcilers/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import { K8s, kind } from "pepr";

import { Component, setupLogger } from "../../logger";
import { Phase, PkgStatus, UDSPackage } from "../crd";
import { StatusEnum, StatusObject as Status } from "../crd/generated/package-v1alpha1";
import { StatusObject as Status, StatusEnum } from "../crd/generated/package-v1alpha1";

export const uidSeen = new Set<string>();

Expand All @@ -24,7 +24,10 @@ export function shouldSkip(cr: UDSPackage) {
const isRetrying = cr.status?.phase === Phase.Retrying;
const isPending = cr.status?.phase === Phase.Pending;
// Check for status removing OR a deletion timestamp present
const isRemoving = cr.status?.phase === Phase.Removing || cr.metadata?.deletionTimestamp;
const isRemoving =
cr.metadata?.deletionTimestamp ||
cr.status?.phase === Phase.Removing ||
cr.status?.phase === Phase.RemovalFailed;
const isCurrentGeneration = cr.metadata?.generation === cr.status?.observedGeneration;

// First check if the CR has been seen before and return false if it has not
Expand Down
182 changes: 180 additions & 2 deletions src/pepr/operator/reconcilers/package-reconciler.spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,42 @@
import { beforeEach, describe, expect, jest, test } from "@jest/globals";

import { K8s, Log } from "pepr";
import { writeEvent } from ".";
import { cleanupNamespace } from "../controllers/istio/injection";
import { purgeAuthserviceClients } from "../controllers/keycloak/authservice/authservice";
import { purgeSSOClients } from "../controllers/keycloak/client-sync";
import { retryWithDelay } from "../controllers/utils";
import { Phase, UDSPackage } from "../crd";
import { packageReconciler } from "./package-reconciler";
import { packageFinalizer, packageReconciler } from "./package-reconciler";

const mockCleanupNamespace: jest.MockedFunction<() => Promise<void>> = jest.fn();
const mockPurgeSSO: jest.MockedFunction<() => Promise<void>> = jest.fn();
const mockPurgeAuthservice: jest.MockedFunction<() => Promise<void>> = jest.fn();
const mockPatchStatus: jest.MockedFunction<() => Promise<void>> = jest.fn();
const mockWriteEvent = jest.fn();

jest.mock("kubernetes-fluent-client");
jest.mock("../../config");
jest.mock("../controllers/istio/injection");
jest.mock("../controllers/istio/injection", () => ({
cleanupNamespace: jest.fn(),
}));
jest.mock("../controllers/keycloak/client-sync", () => ({
purgeSSOClients: jest.fn(),
}));
jest.mock("../controllers/keycloak/authservice/authservice", () => ({
purgeAuthserviceClients: jest.fn(),
}));
jest.mock("../controllers/utils", () => ({
retryWithDelay: jest.fn(async <T>(fn: () => Promise<T>) => fn()),
}));
jest.mock(".", () => {
const originalModule = jest.requireActual(".") as object;
return {
...originalModule,
writeEvent: jest.fn(),
};
});

jest.mock("../controllers/istio/virtual-service");
jest.mock("../controllers/network/policies");

Expand Down Expand Up @@ -59,3 +89,151 @@ describe("reconciler", () => {
expect(Log.error).toHaveBeenCalled();
});
});

describe("finalizer", () => {
let mockPackage: UDSPackage;

beforeEach(() => {
jest.clearAllMocks();

mockPackage = {
metadata: { name: "test-package", namespace: "test-namespace", generation: 1 },
};

(K8s as jest.Mock).mockImplementation(() => ({
Create: jest.fn(),
PatchStatus: mockPatchStatus,
}));
(cleanupNamespace as jest.Mock).mockImplementation(mockCleanupNamespace);
(purgeSSOClients as jest.Mock).mockImplementation(mockPurgeSSO);
(purgeAuthserviceClients as jest.Mock).mockImplementation(mockPurgeAuthservice);
(writeEvent as jest.Mock).mockImplementation(mockWriteEvent);
});

test("should not remove the finalizer for pending packages", async () => {
mockPackage.status = { phase: Phase.Pending };
const finalizerRemoved = await packageFinalizer(mockPackage);
// Assert that we didn't try to cleanup anything
expect(retryWithDelay).not.toHaveBeenCalled();
expect(mockCleanupNamespace).not.toHaveBeenCalled();
expect(mockPurgeSSO).not.toHaveBeenCalled();
expect(mockPurgeAuthservice).not.toHaveBeenCalled();
// Assert that the finalizer was not removed
expect(finalizerRemoved).toEqual(false);
});

test("should not remove the finalizer for removing packages", async () => {
mockPackage.status = { phase: Phase.Removing };
const finalizerRemoved = await packageFinalizer(mockPackage);
// Assert that we didn't try to cleanup anything
expect(retryWithDelay).not.toHaveBeenCalled();
expect(mockCleanupNamespace).not.toHaveBeenCalled();
expect(mockPurgeSSO).not.toHaveBeenCalled();
expect(mockPurgeAuthservice).not.toHaveBeenCalled();
// Assert that the finalizer was not removed
expect(finalizerRemoved).toEqual(false);
});

test("should not remove the finalizer for removalfailed packages", async () => {
mockPackage.status = { phase: Phase.RemovalFailed };
const finalizerRemoved = await packageFinalizer(mockPackage);
// Assert that we didn't try to cleanup anything
expect(retryWithDelay).not.toHaveBeenCalled();
expect(mockCleanupNamespace).not.toHaveBeenCalled();
expect(mockPurgeSSO).not.toHaveBeenCalled();
expect(mockPurgeAuthservice).not.toHaveBeenCalled();
// Assert that the finalizer was not removed
expect(finalizerRemoved).toEqual(false);
});

test("should finalize a ready package", async () => {
mockPackage.status = { phase: Phase.Ready };
const finalizerRemoved = await packageFinalizer(mockPackage);
expect(finalizerRemoved).toEqual(true);
expect(retryWithDelay).toHaveBeenCalled();
expect(mockCleanupNamespace).toHaveBeenCalled();
expect(mockPurgeSSO).toHaveBeenCalled();
expect(mockPurgeAuthservice).toHaveBeenCalled();
});

test("should handle failure in cleanupNamespace and set phase to RemovalFailed", async () => {
mockPackage.status = { phase: Phase.Ready };
mockCleanupNamespace.mockRejectedValue(new Error("Istio cleanup failed"));
mockPurgeAuthservice.mockReset();
mockPurgeSSO.mockReset();

const finalizerRemoved = await packageFinalizer(mockPackage);

expect(finalizerRemoved).toEqual(false);
expect(retryWithDelay).toHaveBeenCalled();
expect(mockCleanupNamespace).toHaveBeenCalled();
expect(mockPatchStatus).toHaveBeenCalledWith(
expect.objectContaining({
metadata: { name: "test-package", namespace: "test-namespace" },
status: { phase: Phase.RemovalFailed },
}),
);
expect(mockWriteEvent).toHaveBeenCalledWith(
expect.anything(),
expect.objectContaining({
reason: "RemovalFailed",
message: expect.stringContaining("Istio"),
}),
);
});

test("should handle failure in purgeAuthserviceClients and set phase to RemovalFailed", async () => {
mockPackage.status = { phase: Phase.Ready };
mockCleanupNamespace.mockReset();
mockPurgeAuthservice.mockRejectedValue(new Error("AuthService cleanup failed"));
mockPurgeSSO.mockReset();

const finalizerRemoved = await packageFinalizer(mockPackage);

expect(finalizerRemoved).toEqual(false);
expect(retryWithDelay).toHaveBeenCalled();
expect(mockCleanupNamespace).toHaveBeenCalled();
expect(mockPurgeAuthservice).toHaveBeenCalled();
expect(mockPatchStatus).toHaveBeenCalledWith(
expect.objectContaining({
metadata: { name: "test-package", namespace: "test-namespace" },
status: { phase: Phase.RemovalFailed },
}),
);
expect(mockWriteEvent).toHaveBeenCalledWith(
expect.anything(),
expect.objectContaining({
reason: "RemovalFailed",
message: expect.stringContaining("AuthService"),
}),
);
});

test("should handle failure in purgeSSOClients and set phase to RemovalFailed", async () => {
mockPackage.status = { phase: Phase.Ready };
mockCleanupNamespace.mockReset();
mockPurgeAuthservice.mockReset();
mockPurgeSSO.mockRejectedValue(new Error("SSO cleanup failed"));

const finalizerRemoved = await packageFinalizer(mockPackage);

expect(finalizerRemoved).toEqual(false);
expect(retryWithDelay).toHaveBeenCalled();
expect(mockCleanupNamespace).toHaveBeenCalled();
expect(mockPurgeAuthservice).toHaveBeenCalled();
expect(mockPurgeSSO).toHaveBeenCalled();
expect(mockPatchStatus).toHaveBeenCalledWith(
expect.objectContaining({
metadata: { name: "test-package", namespace: "test-namespace" },
status: { phase: Phase.RemovalFailed },
}),
);
expect(mockWriteEvent).toHaveBeenCalledWith(
expect.anything(),
expect.objectContaining({
reason: "RemovalFailed",
message: expect.stringContaining("SSO"),
}),
);
});
});
85 changes: 70 additions & 15 deletions src/pepr/operator/reconcilers/package-reconciler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import { Client } from "../controllers/keycloak/types";
import { podMonitor } from "../controllers/monitoring/pod-monitor";
import { serviceMonitor } from "../controllers/monitoring/service-monitor";
import { networkPolicies } from "../controllers/network/policies";
import { retryWithDelay } from "../controllers/utils";
import { Phase, UDSPackage } from "../crd";
import { migrate } from "../crd/migrate";

Expand Down Expand Up @@ -114,54 +115,108 @@ export async function packageReconciler(pkg: UDSPackage) {

/**
* The finalizer is called when an update with a deletion timestamp happens.
* On completion the finalizer is removed from the Package CR.
* This function removes any SSO/Authservice clients and ensures that Istio Injection is restored to the original state.
* Return values indicate whether the finalizer should be removed from the CR based on failure or success of cleanup.
*
* @param pkg the package to finalize
*/
export async function packageFinalizer(pkg: UDSPackage) {
// Skip running the finalizer if it is already running
if (pkg.status?.phase === Phase.Removing || pkg.status?.phase === Phase.RemovalFailed) {
// Trace log since this can be confusing when the finalizer hits quickly for the status patch
log.trace(
`Skipping finalizer for ${pkg.metadata?.namespace}/${pkg.metadata?.name}, removal already in progress or failed.`,
);
return false;
}

// Skip running the finalizer if the CR has not completed initial reconciliation - running this during reconciliation can lead to orphaned resources and failed cleanup
if (pkg.status?.phase !== Phase.Ready && pkg.status?.phase !== Phase.Failed) {
log.debug(
`Waiting to finalize package ${pkg.metadata?.namespace}/${pkg.metadata?.name}, package has not completed initial reconciliation.`,
);
return false;
}

log.debug(`Processing removal of package ${pkg.metadata?.namespace}/${pkg.metadata?.name}`);

// In order to avoid triggering a second call of this finalizer, we just write events for each removal piece
// This could be switched to updateStatus once https://github.com/defenseunicorns/pepr/issues/1316 is resolved
// await updateStatus(pkg, { phase: Phase.Removing });
// Update Package to indicate removal in progress
await updateStatus(pkg, { phase: Phase.Removing });

// Cleanup Istio status on namespace
try {
await writeEvent(pkg, {
message: `Restoring original Istio injection status on namespace`,
reason: "RemovalInProgress",
type: "Normal",
});
// Cleanup the namespace - retry on failure
await retryWithDelay(async function cleanupIstioConfig() {
return cleanupNamespace(pkg);
}, log);
} catch (e) {
log.debug(
`Restoration of Istio injection status during finalizer failed for ${pkg.metadata?.namespace}/${pkg.metadata?.name}: ${e.message}`,
);
await writeEvent(pkg, {
message: `Restoration of Istio injection status failed: ${e.message}. Istio status must be manually restored, by updating or deleting the istio-injection label and cycling pods.`,
reason: "RemovalFailed",
type: "Warning",
});
await updateStatus(pkg, { phase: Phase.RemovalFailed });
return false;
}

// Cleanup AuthService Clients
try {
await writeEvent(pkg, {
message: `Restoring original istio injection status on namespace`,
message: `Removing AuthService configuration for package`,
reason: "RemovalInProgress",
type: "Normal",
});
// Cleanup the namespace
await cleanupNamespace(pkg);
// Remove any Authservice configuration - retry on failure
await retryWithDelay(async function cleanupAuthserviceConfig() {
return purgeAuthserviceClients(pkg, []);
}, log);
} catch (e) {
log.debug(
`Restoration of istio injection status during finalizer failed for ${pkg.metadata?.namespace}/${pkg.metadata?.name}: ${e.message}`,
`Removal of AuthService configuration during finalizer failed for ${pkg.metadata?.namespace}/${pkg.metadata?.name}: ${e.message}`,
);
await writeEvent(pkg, {
message: `Restoration of istio injection status failed: ${e.message}`,
message: `Removal of AuthService configuration failed: ${e.message}. AuthService configuration secret should be reviewed and cleaned up as needed.`,
reason: "RemovalFailed",
type: "Warning",
});
await updateStatus(pkg, { phase: Phase.RemovalFailed });
return false;
}

// Cleanup SSO Clients
try {
await writeEvent(pkg, {
message: `Removing SSO / AuthService clients for package`,
message: `Removing SSO clients for package`,
reason: "RemovalInProgress",
type: "Normal",
});
// Remove any SSO clients
await purgeSSOClients(pkg, []);
await purgeAuthserviceClients(pkg, []);
// Remove any SSO clients - retry on failure
await retryWithDelay(async function cleanupSSOClients() {
return purgeSSOClients(pkg, []);
}, log);
} catch (e) {
log.debug(
`Removal of SSO / AuthService clients during finalizer failed for ${pkg.metadata?.namespace}/${pkg.metadata?.name}: ${e.message}`,
`Removal of SSO clients during finalizer failed for ${pkg.metadata?.namespace}/${pkg.metadata?.name}: ${e.message}`,
);
await writeEvent(pkg, {
message: `Removal of SSO / AuthService clients failed: ${e.message}`,
message: `Removal of SSO clients failed: ${e.message}. Clients must be manually removed from Keycloak.`,
reason: "RemovalFailed",
type: "Warning",
});
await updateStatus(pkg, { phase: Phase.RemovalFailed });
return false;
}

// Indicate success - all other resources (network policies, virtual services, etc) are cleaned up through owner references
// See https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#ownership-and-finalizers
log.debug(`Package ${pkg.metadata?.namespace}/${pkg.metadata?.name} removed successfully`);
return true;
}

0 comments on commit a99b408

Please # to comment.