Skip to content

Latest commit

 

History

History
54 lines (32 loc) · 2.73 KB

cleanup.md

File metadata and controls

54 lines (32 loc) · 2.73 KB

Tear-down (clean up) all the resources - explained

We enforce the following specific order of destruction:

  • DeploymentPersistentVolumeClaim → EKS Cluster

This is the order required for a clean removal when using terraform destroy. This allows the cluster controllers to handle necessary cleanup operations.

  1. Remove the Deployment first to allow cluster controllers to properly delete the pods.
    • Pods must be deleted before the PersistentVolumeClaim; otherwise, the deletion process will hang.
  2. Remove the PersistentVolumeClaim while the cluster is still active to ensure controllers properly detach and delete the EBS volume.
  3. Then delete everything else.

What is going on here?

When a single terraform destroy command is used to destroy everything, destruction of the Deployment and ReplicaSet does not delete the pods. Some other resource needed by the cluster controllers (possibly a component of the VPC), is getting destroyed before the Deployment is.

This in turn prevents the PersistentVolumeClaim from deleting, because it is being used by the pods.

It is possible that a critical VPC component impacts communication between the Kubernetes control plane and the AWS control plane, or something similar.

This problem with terraform destroy can be solved by adding the following to the module "eks" block:

  depends_on = [ module.vpc ]

This forces destruction of the VPC to wait until after destruction of the EKS cluster.

But... this introduced new problems

  • It takes much longer to deploy resources with apply

  • Deploying resources with apply becomes unreliable.

    The change in dependency and timing introduces a new issue where Terraform attempts to create Kubernetes resources before the proper RBAC configurations are applied (e.g., ClusterRoles, RoleBindings). These resources then fail with errors like

    Error: serviceaccounts is forbidden: User "arn:aws:sts::12345678912:assumed-role/MyAdmin" cannot create resource "serviceaccounts" in API group "" in the namespace "default"
    

    After the apply failure, running the same exact apply command then succeeds, because by that time the RBAC have propagated.

  • Time to tear-down resources with destroy also seems longer

The goal of this repository is to show how to create these resources

For this repo, the focus is on education and simplicity in creating these resources; therefore, it will not use the depends_on fix.

Also this repo aims to show best practices, and in general it is a best practice to let Terraform determine dependency relationships.

How else might we handle this?

Using separate distinct Terraform configurations is the best way to address this issue.