Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Azure CLI Login Method Does Not Work Properly with AAD Enabled AKS Clusters and Multiple Deployment Service Principals #104

Closed
kkizer opened this issue Jul 17, 2022 · 19 comments

Comments

@kkizer
Copy link

kkizer commented Jul 17, 2022

Azure CLI Login Method Does Not Work Properly with AAD Enabled AKS Clusters and Multiple Deployment Service Principals

Overview

We are deploying applications to Azure Active Directory (AAD) enabled clusters and are required to use kubelogin. Since these are automated deployment pipelines we cannot have interactive logins. We deploy using Azure service principals so we need to use the Azure CLI login method in kubelogin.

The Azure CLI login method does not work properly with AAD enabled AKS clusters when more than one service principal is used by Azure DevOps deployment pipelines on the same build agent. Each pipeline is producing the same token cache file name even though they are for different service principals. This results in helm and kubectl commands failing due to authorization errors.

This problem became an issue when some deployment pipelines were using kubelogin v0.0.14 and others were using v0.0.10.

Scenario

  1. There are two different Azure Active Directory (AAD) enabled AKS clusters.
  2. Each AAD enabled cluster has its own service principal to deploy applications to the cluster. This is required since the clusters are for different applications and permissions must be assigned appropriately.
  3. For Azure DevOps deployment pipelines, the cluster configuration file is retrieved for the cluster using the az aks get-credentials Azure CLI command. The configuration needs to be converted to use Azure CLI logins (non-interactive) as documented here.
  4. An Azure DevOps deployment pipeline is executed for cluster 1 on a given build agent. This is running using its deployment service principal (SP) SP 1. This results in a token cache file in ~/.kube/cache/kubelogin with a file name like: AzurePublicCloud-6dae42f8-4368-4678-94ff-3960e28e3630--.json. NOTE that the last two of the tokens in the name are BLANK.
  5. An Azure DevOps deployment pipeline is executed for cluster 2 on the same build agent as the first pipeline soon after the first pipeline completes. This is running using its deployment service principal (SP) SP 2. This results in a token cache file in ~/.kube/cache/kubelogin with a the same file name as the name generated for the first pipeline.

This causes pipeline 2 to fail with authorization errors since the access token in the cache file is for the wrong service principal.

Source and cluster config info

Source code to generate token file

Cache token file name is generated here:

cacheFile := getCacheFileName(o.Environment, o.ServerID, o.ClientID, o.TenantID, o.IsLegacy)

And you see the name is formed as follows:

func getCacheFileName(environment, serverID, clientID, tenantID string, legacy bool) string {
	// format: ${environment}-${server-id}-${client-id}-${tenant-id}[_legacy].json
	cacheFileNameFormat := "%s-%s-%s-%s.json"
	if legacy {
		cacheFileNameFormat = "%s-%s-%s-%s_legacy.json"
	}
	return fmt.Sprintf(cacheFileNameFormat, environment, serverID, clientID, tenantID)
}

The last two tokens in the generated file name are blank, so the ClientID and TenantID must not be set. The user information in the original cluster configuration is:

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    auth-provider:
      config:
        apiserver-id: 6dae42f8-4368-4678-94ff-3960e28e3630
        client-id: 80faf920-1908-4b52-b5ef-a8e7bedfc67a
        config-mode: '1'
        environment: AzurePublicCloud
        tenant-id: ca56a4a5-e300-406a-98ff-7e36a0baac5b
      name: azure

And after kubelogin convert-kubeconfig -l azurecli it is (Azure DevOps is masking the server-id value using ***):

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - ***
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

This would explain how clientID and tenantID are blank in the generated token cache file name: neither the clientId or tenantID are passed to kubelogin as arguments.

Changes Between kubelogin v0.0.10 and v0.0.14

It doesn't look like the code that generates the token cache file name has changed between v.0.0.10 and v.0.0.14. So I'm not sure how this ever worked for us. Perhaps we were lucky and builds ran on different build agents and the name collision wasn't an issue. More likely the token expired and a new one was created (and cache file updated) so the issue was hidden.

However, the contents on the file DID change.

In v0.0.10, the resource property in the token file is blank:

"resource":""

While in v.0.14, it has a value:

"resource":"6dae42f8-4368-4678-94ff-3960e28e3630"

It looks like in v0.0.14 it is putting the object id of the service principal in the resource property. That won't work if two different pipelines using different service principals use the same kubelogin token cache file!

We need to update to the latest version of kubelogin especially since Kubernetes 1.24 requires the use of kubelogin with AAD enabled clusters. The current version of kubelogin (v0.0.14) is also required since it handles the new cluster configuration file that is generated starting in Kubernetes 1.24. We cannot stay on older versions of kubelogin (nor should we so that we keep up with bug fixes and security fixes).

Possible Workaround?

The kubelogin docs do not mention this (but kubelogin convert-kubeconfig --help does), but the kubelogin executable allows you to specify --client-id <client ID>. In my test YAML pipeline, the convert-kubeconfig -l azurecli is running in an Azure CLI task. I made the following changes:

  1. Added addSpnToEnvironment: true to the task properties.
  2. Changed the command to: kubelogin convert-kubeconfig -l azurecli --client-id $env:servicePrincipalId

This results in the following in the converted cluster config file (Azure DevOps masks the service principal id in log output. The *** is the service principal id):

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --client-id
      - ***
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

And I get file names like AzurePublicCloud-6dae42f8-4368-4678-94ff-3960e28e3630-***-.json

*** is the service principal id. Azure DevOps masks the service principal id in log output.

So the clientID value is set in the file name.

Kubelogin also supports -t <tenant ID> but that did not get written to the converted cluster config file.

Is this an appropriate solution to this issue?

kubelogin v.0.0.14 Does not Honor --client-id Flag with Kubernetes 1.24 Cluster Config Files

With Kubernetes 1.24, kubelogin v0.0.14 is also broken. The cluster config file you get is:

users:
- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --environment
      - AzurePublicCloud
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --client-id
      - 80faf920-1908-4b52-b5ef-a8e7bedfc67a
      - --tenant-id
      - ca56a4a5-e300-406a-98ff-7e36a0baac5b
      - --login
      - devicecode
      command: kubelogin
      env: null

and if you convert to to use Azure CLI login like our workaround before:
kubelogin convert-kubeconfig -l azurecli --client-id $env:servicePrincipalId

The converted file no longer contains the client-id argument:

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

AND YOU GET DUPLICATE FILE NAMES AGAIN!

Conclusion

The kubelogin code needs to generate unique file names for the token cache file name when using the Azure CLI login method. It cannot assume that the same service principal is being used.

Kevin Kizer
Lead Engineer
MetLife

@kkizer
Copy link
Author

kkizer commented Jul 18, 2022

It looks like the option token-cache-dir should work and do what we need, but kubelogin doesn't accept it as a valid option. In our Azure DevOps pipelines, we could set it to $(Agent.TempDirectory).

@kkizer
Copy link
Author

kkizer commented Jul 18, 2022

It looks like the token-cache-dir option IS on the kubeconfig get-token command: kubelogin get-token --token-cache-dir=/mnt/src/test/

So we would have to edit the config file kubelogin convert-kubeconfig generates to include the token-cache-dir value

@weinong
Copy link
Contributor

weinong commented Jul 18, 2022

@kkizer looks like I missed this arg when I refactor the conversion code from exec to exec.. will fix it today

Oh interesting, we never had --token-cache-dir in convert-kubeconfig sub-command. I can still add it, though.

@kkizer
Copy link
Author

kkizer commented Jul 18, 2022

Yes please! Then we can specify that in our pipelines and not have to manually edit the converted cluster configuration file. When would a new kubelogin be available for me to try out? Asking because I may need to do a temporary hack an update the converted cluster configuration file. We have a lot of deployment pipelines stepping on each other now.

@weinong
Copy link
Contributor

weinong commented Jul 18, 2022

@kkizer working on it. should have a PR out soon

weinong added a commit that referenced this issue Jul 18, 2022
* added --token-cache-dir in convert-kubeconfig sub-command. Addresses #104
@weinong
Copy link
Contributor

weinong commented Jul 18, 2022

@kkizer do you want to give it a try before I cut the release?

@kkizer
Copy link
Author

kkizer commented Jul 19, 2022

Thank you very much for your quick support on this issue. I am working on getting it tested in our environment today. I will post a comment with test results.

@kkizer
Copy link
Author

kkizer commented Jul 19, 2022

The fix does exactly what we need.

In my Azure DevOps pipeline I get the cluster configuration using az aks-getcredentials and convert it using:
kubelkogin convert-kubeconfig -l azurecli --token-cache-dir $(Agent.TempDirectory)

and the cluster configuration file contains:

- name: clusterUser_RG_KUB11614NP01_AZAMERES01KUB11614NP01
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --login
      - azurecli
      - --server-id
      - 6dae42f8-4368-4678-94ff-3960e28e3630
      - --token-cache-dir
      - /AzDOAgents/Agent1/_work/_temp
      command: kubelogin
      env: null
      provideClusterInfo: false

It has the token-cache-dir argument!

We still need to test the change with multiple builds on the same build machine and each build uses a different service principal. I don't expect any issues since each build would have its own agent temp directory.

@weinong weinong closed this as completed Jul 20, 2022
@weinong weinong reopened this Jul 20, 2022
@weinong
Copy link
Contributor

weinong commented Jul 20, 2022

actually v0.0.16 introduced a bug such that --token-cache-dir passed to get-token would not be effective as reported in #108

@weinong
Copy link
Contributor

weinong commented Jul 23, 2022

@kkizer you should upgrade to v0.0.17 or higher. the --token-cache-dir wasn't really being used in actual token acquisition. I think it worked out for you as your build workspace is different

@weinong weinong closed this as completed Jul 23, 2022
@kkizer
Copy link
Author

kkizer commented Oct 11, 2022 via email

@kkizer
Copy link
Author

kkizer commented Oct 11, 2022 via email

@weinong
Copy link
Contributor

weinong commented Oct 11, 2022

@kkizer the latest v0.0.20 has this change

@mbaykara
Copy link

mbaykara commented Mar 5, 2023

Hey @kkizer what is the --server-id? Where can I get this?

@weinong
Copy link
Contributor

weinong commented Mar 6, 2023

@mbaykara https://github.com/Azure/kubelogin#exec-plugin-format

Note: The AAD server app ID of AKS Managed AAD is always 6dae42f8-4368-4678-94ff-3960e28e3630 in any environments.

@mbaykara
Copy link

mbaykara commented Mar 7, 2023

he AAD server ap

I am not using azure managed cluster, using another provider actually. Will not work right? @weinong

@weinong
Copy link
Contributor

weinong commented Mar 7, 2023

@mbaykara what do you mean by another provider? if you know what's being setup, what exactly were you asking for?

Hey @kkizer what is the --server-id? Where can I get this?

@mbaykara
Copy link

mbaykara commented Mar 7, 2023

Another public cloud provider i.e GKE @weinong

@weinong
Copy link
Contributor

weinong commented Mar 22, 2023

@mbaykara the --server-id should be the AAD application ID you configure your kubernetes cluster with. you can refer to https://github.com/Azure/kubelogin#setup-for-kubernetes-oidc-provider-using-azure-ad

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants