Support User Namespaces in pods #127

derekwaynecarr · 2016-10-10T19:02:52Z

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

The text was updated successfully, but these errors were encountered:

derekwaynecarr · 2016-10-10T19:03:28Z

This work is being done by @pweil- and is reviewed by @derekwaynecarr, it is sponsored by @kubernetes/sig-node

mdshuai · 2016-10-26T07:28:08Z

@derekwaynecarr Could you help create a user story card for this feature?

idvoretskyi · 2016-11-15T22:52:07Z

@derekwaynecarr can you confirm that this feature targets alpha for 1.5?

pweil- · 2016-11-16T13:35:56Z

@derekwaynecarr can you confirm that this feature targets alpha for 1.5?

Yes, this feature is experimental only so it would be considered alpha.

idvoretskyi · 2016-12-13T18:00:16Z

@derekwaynecarr @pweil- can you confirm that this item targets beta in 1.6?

adelton · 2017-11-14T09:51:10Z

@derekwaynecarr, the proposal kubernetes/kubernetes#34569 was closed by bot due to inactivity.

@pweil-, in kubernetes/kubernetes#34569 (comment) you've proposed the approach pweil-/kubernetes@16f29eb which changes the group of /var/lib/kubelet/pods to the remapped root group. Do I understand it correctly that this is currently not tracked in any pull request?

adelton · 2017-11-14T10:11:42Z

@pweil-, I also wonder if similar to docker's /var/lib/docker/<uid>.<gid> approach when --userns-remap is used, it might make sense to use /var/lib/kubelet/pods-<uid>.<gid> and just chown/chgroup everything in those subdirectories to the remapped <uid>.<gid>. Why did you opt for just the chgrp and not the full chown?

pweil- · 2017-11-14T13:45:48Z

@adelton in the end, I think having this be transparent to Kubernetes is the right approach. Whether that be something like shiftfs or implementation in the CRI (moby/moby#28593). You are correct that my existing proposal is not currently tracked in an open PR anymore.

The reasoning behind using the chgrp was to follow our fsgroup strategy where we just ensure group access instead of uid access.

adelton · 2017-11-14T14:49:07Z

Thanks @pweil-.

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

As for the fsgroup strategy, do you mean https://kubernetes.io/docs/concepts/policy/pod-security-policy/#fsgroup or some generic methodology within Kubernetes?

I have now filed kubernetes/kubernetes#55707 as an alternative approach where I make the remapped uid/gid an explicit option, and use those values to chown/chgrp the necessary directories.

pweil- · 2017-11-14T15:00:03Z

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

that would be ideal. Whether that is feasible (or more likely, feasible in an acceptable time frame) is another question 😄

As for the fsgroup strategy, do you mean https://kubernetes.io/docs/concepts/policy/pod-security-policy/#fsgroup or some generic methodology within Kubernetes?

Yes

I have now filed kubernetes/kubernetes#55707 as an alternative approach where I make the remapped uid/gid an explicit option, and use those values to chown/chgrp the necessary directories.

👍 subscribed

adelton · 2017-11-14T15:57:35Z

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

that would be ideal. Whether that is feasible (or more likely, feasible in an acceptable time frame) is another question

Ideally, the pod would specify how many distinct uids/gids it would require / list of uids it wants to see inside of the containers, and docker or different container runtime would setup the user namespace accordingly. But unless docker also changes ownership of the volumes mounted to the containers, Kubernetes will have to do that as part of the setup.

adelton · 2017-12-07T14:20:19Z

@pwel-, what is the best way to get some review and comments on kubernetes/kubernetes#55707, to get it closer to mergeable state?

0xmichalis · 2017-12-07T14:29:51Z

@pweil- ^

pweil- · 2017-12-07T14:47:11Z

@adelton I would try to engage the sig-node folks either at their Tuesday meeting or on slack: https://github.com/kubernetes/community/tree/master/sig-node

adelton · 2017-12-18T08:50:16Z

@derekwaynecarr, could you please bring kubernetes/kubernetes#55707 to sig-node's radar?

idvoretskyi · 2018-01-22T13:45:23Z

@pweil- @derekwaynecarr any progress on this feature is expected?

k8s-ci-robot · 2025-02-05T20:33:36Z

@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v.133

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kannon92 · 2025-02-05T20:33:47Z

/milestone v.1.33

k8s-ci-robot · 2025-02-05T20:33:50Z

@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v.1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

lzung · 2025-02-05T21:30:40Z

/milestone v1.33

Hello @rata @giuseppe 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage beta for v1.33 (correct me, if otherwise)
/stage beta

Here's where this enhancement currently stands:

KEP readme using the latest template has been merged into the k/enhancements repo.
KEP status is marked as implementable for latest-milestone: v1.33.
KEP readme has up-to-date graduation criteria
KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would just need to update the following:

Can you please update your KEP readme to use the latest template, and make sure to fill out all required sections?
Release Signoff Checklist is missing subbullets for Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) and Graduation criteria is in place
Missing response for Risks and Mitigations
Missing response for Prerequisite testing updates
Missing response for What steps should be taken if SLOs are not being met to determine the problem?
No Drawbacks but this is not vital

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

rata · 2025-02-06T09:35:08Z

@lzung the KEP is already at beta. Currently it is beta disabled by default, we are aiming for beta enabled by default. Do we need to do all of that in this tight deadline or just when we migrate to GA (with lot of other things)?

kannon92 · 2025-02-06T14:38:02Z

@lzung the KEP is already at beta. Currently it is beta disabled by default, we are aiming for beta enabled by default. Do we need to do all of that in this tight deadline or just when we migrate to GA (with lot of other things)?

Yes, please follow the template.

All you need to do is update the KEP to the latest template.

rata · 2025-02-06T15:12:57Z

@kannon92 oh, I missed it was way simpler than I expected. Thanks for pointing it out! :)

PR in place to update to the latest KEP template: #5141

gnufied · 2025-02-07T21:00:15Z

sig-storage has some valid concerns around removal of persistent-volume restrictions that were dropped here - https://github.com/kubernetes/enhancements/pull/4084/files

See previous discussion around - kubernetes/kubernetes#111090 (comment)

It is not clear if idmap's limitations about 1:1 mapping of gids were addressed in Linux kernel. It is not clear if idmap mounts can squash multiple gids when creating mapping.

cc @dobsonj @msau42 @jsafrane @liggitt

rata · 2025-02-10T12:07:46Z

@gnufied Thanks for taking a look!

The comment you link to is outdated and no longer valid. The phase I design was completely changed, and those comments don't apply anymore. Let me elaborate.

The phase I PR when I opened it (I changed it before merging, more on that later), only supported secrets/configmaps and emptydir volumes and did so by changing the permissions of those. After a meeting with you, we changed it to use fsGroup and then we merged it for 1.25. The commend you did was relevant to that fsGroup design

In 1.27 we completely changed the design; we are not using fsGroup anymore and just rely on idmap mounts. I don't know to what issue you refer to now, so I can't answer your concern. But I'll add some info that might be relevant. Please, though, explain what issue you see now with the new desging.

You are the storage expert, so correct me if some CSI drivers do something differently. IIUC a CSI driver does:

mount -o gid=100 XXX /some/path/mount

After the CSI drives does the mount, then when the kubelet creates a container it just sends over CRI to create a bind-mount of /some/path/mount to a location inside the container rootfs. When userns are in use, this will create an idmap bind-mount for the path.

So, the gid squashing is not an issue, that is done by the "device-mount call" (the first mount, done by the CSI driver), then the idmap bind-mount (the second mount, done by runc) just creates a mapping for that, there is no need to squash anything and all works fine. For example:

# mount -o gid=100 /dev/sda3 mount/
# ls -ln mount/
total 64
drwxr-xr-x 2 0 100 32768 Jan 29 15:53  print
drwxr-xr-x 2 0 100 32768 Nov  8 18:08 'System Volume Information'
# mount -o X-mount.idmap=b:100:200:1 --bind mount/ idmap-bind-mount/
# ls -ln idmap-bind-mount/
total 64
drwxr-xr-x 2 65534 200 32768 Jan 29 15:53  print
drwxr-xr-x 2 65534 200 32768 Nov  8 18:08 'System Volume Information'

So, as you see, if the CSI driver does the -o gid when mounting the device, then the idmap mount later will work fine. The squashing is done at device-mount time, not at bind-mount time.

But is there some CSI driver not doing that? Do you have any example I can repro with local-up-cluster.sh?

giuseppe · 2025-02-10T12:28:31Z

sig-storage has some valid concerns around removal of persistent-volume restrictions that were dropped here - https://github.com/kubernetes/enhancements/pull/4084/files

See previous discussion around - kubernetes/kubernetes#111090 (comment)

It is not clear if idmap's limitations about 1:1 mapping of gids were addressed in Linux kernel. It is not clear if idmap mounts can squash multiple gids when creating mapping.

the user namespace mapping is a bijective function, each ID in the user namespace can be mapped only to an ID outside the user namespace. There is nothing like squashing of ids in the kernel,

cc @dobsonj @msau42 @jsafrane @liggitt

sftim · 2025-02-10T12:48:42Z

the user namespace mapping is a bijective function, each ID in the user namespace can be mapped only to an ID outside the user namespace. There is nothing like squashing of ids in the kernel,

It's not bijective. You can map a subset of the outer UID range into the user namespace. It's true that each ID in the user namespace can be mapped only to an ID outside the user namespace but it's actually an injective relationship in terms of set theory.

dipesh-rawat · 2025-02-10T20:14:07Z

Hi @rata @giuseppe 👋, 1.33 Enhancements team here,

Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

The current status of this enhancement is marked as At risk for enhancement freeze. There are a few requirements mentioned in the comment #127 (comment) that still need to be completed.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

gnufied · 2025-02-10T21:53:35Z

But is there some CSI driver not doing that? Do you have any example I can repro with local-up-cluster.sh?

Not all CSI drivers use -o gid on first mount. In fact most don't. Only CSI drivers with capability VOLUME_MOUNT_GROUP do that. For other volume styles, the entire volume gets recursively chown/chmod (using given fsGroup) before bind mount can be prepared by the CRI.

But certain brown field volumes if they are using OnRootMismatch policy may not have uniform GID based ownership, either via recursive chown/chmod call or via mount -o gid. Will this be a problem?

I have been testing user namespace feature on a single node Fedora41 VM, with everything latest and for persistent volumes, what I have found is, no fsGroup based idmap is specified when bind mounts are created. @haircommander helped me a bit to understand the feature a bit more. If I print /run/containers/storage/overlay-containers/78407c092388b897c4d73b3e349b150bf31c024233c00b79addfa20cdb2c57db/userdata/config.json, I see following mapping:

                {
                        "destination": "/mnt/test",
                        "type": "bind",
                        "source": "/var/lib/kubelet/pods/0954e329-63f9-4448-ba2a-c47eb8d111b6/volumes/kubernetes.io~csi/pvc-0cdb10fb-7907-499f-9bfc-eeb90a51143c/mount",
                        "options": [
                                "rbind",
                                "rprivate",
                                "rw",
                                "bind"
                        ],
                        "uidMappings": [
                                {
                                        "containerID": 0,
                                        "hostID": 3007119360,
                                        "size": 65536
                                }
                        ],
                        "gidMappings": [
                                {
                                        "containerID": 0,
                                        "hostID": 3007119360,
                                        "size": 65536
                                }
                        ]

Where pod is using runAsUser: 9999 and fsGroup: 9999. The volume on the host have same GID of 9999 as 9999 inside container. It is as if, container was not using user namespaces at all. :-) I thought gidMappings will actually specify a mapping that will have something like 9999->9999 but it doesn't. Is this expected?

rata · 2025-02-11T11:25:28Z

@gnufied

But certain brown field volumes

What do you mean with brown field volumes? What is that? An empty persistent volume?

But certain brown field volumes if they are using OnRootMismatch policy may not have uniform GID based ownership, either via recursive chown/chmod call or via mount -o gid. Will this be a problem?

No. As I shown in the previous comment, if the first mount is done with -o gid it should just work. If there is a way to test this locally with a CSI driver that does this, please let me know how or go ahead and test it. But as I shown in the code example in my previous comment, this works just fine. The recursive chown is not an issue either, it is designed to work fine.

Where pod is using runAsUser: 9999 and fsGroup: 9999. The volume on the host have same GID of 9999 as 9999 inside container. It is as if, container was not using user namespaces at all. :-) I thought gidMappings will actually specify a mapping that will have something like 9999->9999 but it doesn't. Is this expected?

Yes, exactly! We designed it so you can use persistent storage as if userns was not there. This is what idmap mounts gives us (and some other stuff). You are running a user/group 3007119360 + 9999 on the host and access the volume files just fine. You can even disable user namespaces, change the volume and enable it again. It will all just work :-).

sftim · 2025-02-11T11:46:56Z

Brownfield storage is when the underlying block storage is reused but scrubbed. It's analogous to brownfield land.

By contrast, you could delete the block level storage and provision a completely empty block volume on which you build a filesystem that you then mount. That's analogous to a greenfield build.

You can also apply this to NAS; for example, you take an existing file-based volume and delete it back to one root inode. Then you reuse that with a different purpose for a new set of Pods.

gnufied · 2025-02-11T16:26:59Z

Thanks @rata and @giuseppe for explanations. I have tested userns feature with different volume scenarios, such as - using subpath (with different fsgroup), using brownfield volumes with different gid ownership etc and it appears to work.

rata · 2025-02-11T17:39:09Z

@lzung @kannon92 @dipesh-rawat All those things were just addressed and the PR is merged! #5141 (comment)

dipesh-rawat · 2025-02-11T19:08:49Z

Hello @rata @giuseppe 👋, 1.33 Enhancements team here,

Now that PR #5141 has been merged, all the KEP requirements are in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is marked as tracked for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

(cc: @lzung)

/label tracked/yes

rayandas · 2025-02-12T17:28:11Z

Hello @rata @giuseppe 👋, v1.33 Docs Lead here.

Does this enhancement work planned for v1.33 require any new docs or modification to existing docs?

If so, please follow the steps here to open a PR against dev-1.33 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday 27th February 2025 18:00 PDT.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

rata · 2025-02-13T17:00:20Z

@AkihiroSuda wanna handle the doc PR too? You can cc me so I review, if you want :)

AkihiroSuda · 2025-02-14T02:10:47Z

@AkihiroSuda wanna handle the doc PR too? You can cc me so I review, if you want :)

Yes, thanks

user-namespaces: add idsPerPod configuration website#49749

rata · 2025-02-14T11:31:01Z

Code PRs for 1.33:

aojea · 2025-02-14T15:39:10Z

@rata I've updated the description with those links so they don't get lost on the comments thread

rata · 2025-02-17T10:26:09Z

@aojea thanks! I can't edit the original issue description :)

idvoretskyi modified the milestone: v1.5 Oct 11, 2016

idvoretskyi added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 12, 2016

idvoretskyi assigned pweil- Oct 12, 2016

idvoretskyi added the alpha-in-1.5 label Nov 16, 2016

idvoretskyi modified the milestones: next-milestone, v1.5 Dec 13, 2016

calebamiles removed the (deprecated label - do not use) alpha-in-1.5 label Jul 26, 2017

This was referenced Aug 16, 2017

K8s 1.4 and Docker 1.12 user namespace incompatibility kubernetes/kubernetes#33508

Closed

Configure Docker to run with User Namespaces kubernetes-sigs/kubespray#962

Closed

The-Loeki mentioned this issue Nov 10, 2017

Support user namespace remapping kubernetes/kubernetes#27945

Closed

adelton mentioned this issue Nov 14, 2017

[DO NOT MERGE] Add support for experimental-userns-remap-root-uid and experimental-userns-remap-root-gid options to match the remapping used by the container runtime. kubernetes/kubernetes#55707

Closed

php-coder mentioned this issue Feb 1, 2018

Track Linux User Namespaces in the Pod Security Policy kubernetes/kubernetes#59152

Closed

milsonian mentioned this issue Mar 9, 2018

Proposal: Ability to set runAsUser to deployed functions openfaas/faas-netes#155

Closed

lzung moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 5, 2025

haircommander moved this from Triage to Proposed for consideration in SIG Node 1.33 KEPs planning Feb 6, 2025

pacoxu moved this from Proposed for consideration to Tracked in SIG Node 1.33 KEPs planning Feb 8, 2025

k8s-ci-robot added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Feb 11, 2025

dipesh-rawat moved this from At risk for enhancements freeze to Tracked for enhancements freeze in 1.33 Enhancements Tracking Feb 11, 2025

rata mentioned this issue Feb 14, 2025

user-namespaces: add idsPerPod configuration kubernetes/website#49749

Draft

Support User Namespaces in pods #127

Support User Namespaces in pods #127

Comments

derekwaynecarr commented Oct 10, 2016 • edited by pacoxu Loading

Enhancement Description

derekwaynecarr commented Oct 10, 2016

mdshuai commented Oct 26, 2016

idvoretskyi commented Nov 15, 2016

pweil- commented Nov 16, 2016

idvoretskyi commented Dec 13, 2016

adelton commented Nov 14, 2017

adelton commented Nov 14, 2017

pweil- commented Nov 14, 2017

adelton commented Nov 14, 2017

pweil- commented Nov 14, 2017

adelton commented Nov 14, 2017

adelton commented Dec 7, 2017

0xmichalis commented Dec 7, 2017

pweil- commented Dec 7, 2017

adelton commented Dec 18, 2017

idvoretskyi commented Jan 22, 2018

k8s-ci-robot commented Feb 5, 2025

kannon92 commented Feb 5, 2025

k8s-ci-robot commented Feb 5, 2025

lzung commented Feb 5, 2025

rata commented Feb 6, 2025

kannon92 commented Feb 6, 2025

rata commented Feb 6, 2025 • edited Loading

gnufied commented Feb 7, 2025

rata commented Feb 10, 2025 • edited Loading

giuseppe commented Feb 10, 2025

sftim commented Feb 10, 2025

dipesh-rawat commented Feb 10, 2025

gnufied commented Feb 10, 2025

rata commented Feb 11, 2025 • edited Loading

sftim commented Feb 11, 2025

gnufied commented Feb 11, 2025

rata commented Feb 11, 2025

dipesh-rawat commented Feb 11, 2025

rayandas commented Feb 12, 2025

rata commented Feb 13, 2025

AkihiroSuda commented Feb 14, 2025

rata commented Feb 14, 2025

aojea commented Feb 14, 2025

rata commented Feb 17, 2025

derekwaynecarr commented Oct 10, 2016 •

edited by pacoxu

Loading

rata commented Feb 6, 2025 •

edited

Loading

rata commented Feb 10, 2025 •

edited

Loading

rata commented Feb 11, 2025 •

edited

Loading