-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Support User Namespaces in pods #127
Comments
This work is being done by @pweil- and is reviewed by @derekwaynecarr, it is sponsored by @kubernetes/sig-node |
@derekwaynecarr Could you help create a user story card for this feature? |
@derekwaynecarr can you confirm that this feature targets alpha for 1.5? |
Yes, this feature is experimental only so it would be considered alpha. |
@derekwaynecarr @pweil- can you confirm that this item targets beta in 1.6? |
@derekwaynecarr, the proposal kubernetes/kubernetes#34569 was closed by bot due to inactivity. @pweil-, in kubernetes/kubernetes#34569 (comment) you've proposed the approach pweil-/kubernetes@16f29eb which changes the group of |
@pweil-, I also wonder if similar to docker's |
@adelton in the end, I think having this be transparent to Kubernetes is the right approach. Whether that be something like shiftfs or implementation in the CRI (moby/moby#28593). You are correct that my existing proposal is not currently tracked in an open PR anymore. The reasoning behind using the chgrp was to follow our |
Thanks @pweil-. When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with As for the I have now filed kubernetes/kubernetes#55707 as an alternative approach where I make the remapped uid/gid an explicit option, and use those values to chown/chgrp the necessary directories. |
that would be ideal. Whether that is feasible (or more likely, feasible in an acceptable time frame) is another question 😄
Yes
👍 subscribed |
Ideally, the pod would specify how many distinct uids/gids it would require / list of uids it wants to see inside of the containers, and docker or different container runtime would setup the user namespace accordingly. But unless docker also changes ownership of the volumes mounted to the containers, Kubernetes will have to do that as part of the setup. |
@pwel-, what is the best way to get some review and comments on kubernetes/kubernetes#55707, to get it closer to mergeable state? |
@pweil- ^ |
@adelton I would try to engage the sig-node folks either at their Tuesday meeting or on slack: https://github.com/kubernetes/community/tree/master/sig-node |
@derekwaynecarr, could you please bring kubernetes/kubernetes#55707 to sig-node's radar? |
@pweil- @derekwaynecarr any progress on this feature is expected? |
@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/milestone v.1.33 |
@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/milestone v1.33 Hello @rata @giuseppe 👋, v1.33 Enhancements team here. Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025. This enhancement is targeting stage Here's where this enhancement currently stands:
For this KEP, we would just need to update the following:
The status of this enhancement is marked as If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you! |
@lzung the KEP is already at beta. Currently it is beta disabled by default, we are aiming for beta enabled by default. Do we need to do all of that in this tight deadline or just when we migrate to GA (with lot of other things)? |
Yes, please follow the template. All you need to do is update the KEP to the latest template. |
sig-storage has some valid concerns around removal of persistent-volume restrictions that were dropped here - https://github.com/kubernetes/enhancements/pull/4084/files See previous discussion around - kubernetes/kubernetes#111090 (comment) It is not clear if idmap's limitations about 1:1 mapping of gids were addressed in Linux kernel. It is not clear if idmap mounts can squash multiple gids when creating mapping. |
@gnufied Thanks for taking a look! The comment you link to is outdated and no longer valid. The phase I design was completely changed, and those comments don't apply anymore. Let me elaborate. The phase I PR when I opened it (I changed it before merging, more on that later), only supported secrets/configmaps and emptydir volumes and did so by changing the permissions of those. After a meeting with you, we changed it to use fsGroup and then we merged it for 1.25. The commend you did was relevant to that fsGroup design In 1.27 we completely changed the design; we are not using fsGroup anymore and just rely on idmap mounts. I don't know to what issue you refer to now, so I can't answer your concern. But I'll add some info that might be relevant. Please, though, explain what issue you see now with the new desging. You are the storage expert, so correct me if some CSI drivers do something differently. IIUC a CSI driver does:
After the CSI drives does the mount, then when the kubelet creates a container it just sends over CRI to create a bind-mount of So, the gid squashing is not an issue, that is done by the "device-mount call" (the first mount, done by the CSI driver), then the idmap bind-mount (the second mount, done by runc) just creates a mapping for that, there is no need to squash anything and all works fine. For example:
So, as you see, if the CSI driver does the But is there some CSI driver not doing that? Do you have any example I can repro with local-up-cluster.sh? |
the user namespace mapping is a bijective function, each ID in the user namespace can be mapped only to an ID outside the user namespace. There is nothing like squashing of ids in the kernel, |
It's not bijective. You can map a subset of the outer UID range into the user namespace. It's true that each ID in the user namespace can be mapped only to an ID outside the user namespace but it's actually an injective relationship in terms of set theory. |
Hi @rata @giuseppe 👋, 1.33 Enhancements team here, Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025. The current status of this enhancement is marked as If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you! |
Not all CSI drivers use But certain brown field volumes if they are using I have been testing user namespace feature on a single node Fedora41 VM, with everything latest and for persistent volumes, what I have found is, no fsGroup based idmap is specified when bind mounts are created. @haircommander helped me a bit to understand the feature a bit more. If I print
Where pod is using |
What do you mean with brown field volumes? What is that? An empty persistent volume?
No. As I shown in the previous comment, if the first mount is done with
Yes, exactly! We designed it so you can use persistent storage as if userns was not there. This is what idmap mounts gives us (and some other stuff). You are running a user/group 3007119360 + 9999 on the host and access the volume files just fine. You can even disable user namespaces, change the volume and enable it again. It will all just work :-). |
Brownfield storage is when the underlying block storage is reused but scrubbed. It's analogous to brownfield land. By contrast, you could delete the block level storage and provision a completely empty block volume on which you build a filesystem that you then mount. That's analogous to a greenfield build. You can also apply this to NAS; for example, you take an existing file-based volume and delete it back to one root inode. Then you reuse that with a different purpose for a new set of Pods. |
@lzung @kannon92 @dipesh-rawat All those things were just addressed and the PR is merged! #5141 (comment) |
Hello @rata @giuseppe 👋, 1.33 Enhancements team here, Now that PR #5141 has been merged, all the KEP requirements are in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀 The status of this enhancement is marked as (cc: @lzung) /label tracked/yes |
Hello @rata @giuseppe 👋, v1.33 Docs Lead here. Does this enhancement work planned for v1.33 require any new docs or modification to existing docs? If so, please follow the steps here to open a PR against Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release. Thank you! |
@AkihiroSuda wanna handle the doc PR too? You can cc me so I review, if you want :) |
Yes, thanks |
@rata I've updated the description with those links so they don't get lost on the comments thread |
@aojea thanks! I can't edit the original issue description :) |
Enhancement Description
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update(s):k/enhancements
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: