Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support User Namespaces in pods #127

Open
31 of 38 tasks
derekwaynecarr opened this issue Oct 10, 2016 · 265 comments
Open
31 of 38 tasks

Support User Namespaces in pods #127

derekwaynecarr opened this issue Oct 10, 2016 · 265 comments
Assignees
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. stage/beta Denotes an issue tracking an enhancement targeted for Beta status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team
Milestone

Comments

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Oct 10, 2016

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@derekwaynecarr
Copy link
Member Author

This work is being done by @pweil- and is reviewed by @derekwaynecarr, it is sponsored by @kubernetes/sig-node

@idvoretskyi idvoretskyi modified the milestone: v1.5 Oct 11, 2016
@idvoretskyi idvoretskyi added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 12, 2016
@mdshuai
Copy link
Contributor

mdshuai commented Oct 26, 2016

@derekwaynecarr Could you help create a user story card for this feature?

@idvoretskyi
Copy link
Member

@derekwaynecarr can you confirm that this feature targets alpha for 1.5?

@pweil-
Copy link

pweil- commented Nov 16, 2016

@derekwaynecarr can you confirm that this feature targets alpha for 1.5?

Yes, this feature is experimental only so it would be considered alpha.

@idvoretskyi
Copy link
Member

@derekwaynecarr @pweil- can you confirm that this item targets beta in 1.6?

@adelton
Copy link

adelton commented Nov 14, 2017

@derekwaynecarr, the proposal kubernetes/kubernetes#34569 was closed by bot due to inactivity.

@pweil-, in kubernetes/kubernetes#34569 (comment) you've proposed the approach pweil-/kubernetes@16f29eb which changes the group of /var/lib/kubelet/pods to the remapped root group. Do I understand it correctly that this is currently not tracked in any pull request?

@adelton
Copy link

adelton commented Nov 14, 2017

@pweil-, I also wonder if similar to docker's /var/lib/docker/<uid>.<gid> approach when --userns-remap is used, it might make sense to use /var/lib/kubelet/pods-<uid>.<gid> and just chown/chgroup everything in those subdirectories to the remapped <uid>.<gid>. Why did you opt for just the chgrp and not the full chown?

@pweil-
Copy link

pweil- commented Nov 14, 2017

@adelton in the end, I think having this be transparent to Kubernetes is the right approach. Whether that be something like shiftfs or implementation in the CRI (moby/moby#28593). You are correct that my existing proposal is not currently tracked in an open PR anymore.

The reasoning behind using the chgrp was to follow our fsgroup strategy where we just ensure group access instead of uid access.

@adelton
Copy link

adelton commented Nov 14, 2017

Thanks @pweil-.

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

As for the fsgroup strategy, do you mean https://kubernetes.io/docs/concepts/policy/pod-security-policy/#fsgroup or some generic methodology within Kubernetes?

I have now filed kubernetes/kubernetes#55707 as an alternative approach where I make the remapped uid/gid an explicit option, and use those values to chown/chgrp the necessary directories.

@pweil-
Copy link

pweil- commented Nov 14, 2017

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

that would be ideal. Whether that is feasible (or more likely, feasible in an acceptable time frame) is another question 😄

As for the fsgroup strategy, do you mean https://kubernetes.io/docs/concepts/policy/pod-security-policy/#fsgroup or some generic methodology within Kubernetes?

Yes

I have now filed kubernetes/kubernetes#55707 as an alternative approach where I make the remapped uid/gid an explicit option, and use those values to chown/chgrp the necessary directories.

👍 subscribed

@adelton
Copy link

adelton commented Nov 14, 2017

When you say transparent, you mean that nothing should be needed to be added to code or to configuration on Kubernetes' side to allow running under docker with userns-remap?

that would be ideal. Whether that is feasible (or more likely, feasible in an acceptable time frame) is another question

Ideally, the pod would specify how many distinct uids/gids it would require / list of uids it wants to see inside of the containers, and docker or different container runtime would setup the user namespace accordingly. But unless docker also changes ownership of the volumes mounted to the containers, Kubernetes will have to do that as part of the setup.

@adelton
Copy link

adelton commented Dec 7, 2017

@pwel-, what is the best way to get some review and comments on kubernetes/kubernetes#55707, to get it closer to mergeable state?

@0xmichalis
Copy link

@pweil- ^

@pweil-
Copy link

pweil- commented Dec 7, 2017

@adelton I would try to engage the sig-node folks either at their Tuesday meeting or on slack: https://github.com/kubernetes/community/tree/master/sig-node

@adelton
Copy link

adelton commented Dec 18, 2017

@derekwaynecarr, could you please bring kubernetes/kubernetes#55707 to sig-node's radar?

@idvoretskyi
Copy link
Member

@pweil- @derekwaynecarr any progress on this feature is expected?

@k8s-ci-robot
Copy link
Contributor

@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v.133

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kannon92
Copy link
Contributor

kannon92 commented Feb 5, 2025

/milestone v.1.33

@k8s-ci-robot
Copy link
Contributor

@kannon92: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34]

Use /milestone clear to clear the milestone.

In response to this:

/milestone v.1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@lzung
Copy link

lzung commented Feb 5, 2025

/milestone v1.33

Hello @rata @giuseppe 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage beta for v1.33 (correct me, if otherwise)
/stage beta

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: v1.33.
  • KEP readme has up-to-date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would just need to update the following:

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@lzung lzung moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 5, 2025
@rata
Copy link
Member

rata commented Feb 6, 2025

@lzung the KEP is already at beta. Currently it is beta disabled by default, we are aiming for beta enabled by default. Do we need to do all of that in this tight deadline or just when we migrate to GA (with lot of other things)?

@kannon92
Copy link
Contributor

kannon92 commented Feb 6, 2025

@lzung the KEP is already at beta. Currently it is beta disabled by default, we are aiming for beta enabled by default. Do we need to do all of that in this tight deadline or just when we migrate to GA (with lot of other things)?

Yes, please follow the template.

All you need to do is update the KEP to the latest template.

@rata
Copy link
Member

rata commented Feb 6, 2025

@kannon92 oh, I missed it was way simpler than I expected. Thanks for pointing it out! :)

PR in place to update to the latest KEP template: #5141

@haircommander haircommander moved this from Triage to Proposed for consideration in SIG Node 1.33 KEPs planning Feb 6, 2025
@gnufied
Copy link
Member

gnufied commented Feb 7, 2025

sig-storage has some valid concerns around removal of persistent-volume restrictions that were dropped here - https://github.com/kubernetes/enhancements/pull/4084/files

See previous discussion around - kubernetes/kubernetes#111090 (comment)

It is not clear if idmap's limitations about 1:1 mapping of gids were addressed in Linux kernel. It is not clear if idmap mounts can squash multiple gids when creating mapping.

cc @dobsonj @msau42 @jsafrane @liggitt

@pacoxu pacoxu moved this from Proposed for consideration to Tracked in SIG Node 1.33 KEPs planning Feb 8, 2025
@rata
Copy link
Member

rata commented Feb 10, 2025

@gnufied Thanks for taking a look!

The comment you link to is outdated and no longer valid. The phase I design was completely changed, and those comments don't apply anymore. Let me elaborate.

The phase I PR when I opened it (I changed it before merging, more on that later), only supported secrets/configmaps and emptydir volumes and did so by changing the permissions of those. After a meeting with you, we changed it to use fsGroup and then we merged it for 1.25. The commend you did was relevant to that fsGroup design

In 1.27 we completely changed the design; we are not using fsGroup anymore and just rely on idmap mounts. I don't know to what issue you refer to now, so I can't answer your concern. But I'll add some info that might be relevant. Please, though, explain what issue you see now with the new desging.

You are the storage expert, so correct me if some CSI drivers do something differently. IIUC a CSI driver does:

mount -o gid=100 XXX /some/path/mount

After the CSI drives does the mount, then when the kubelet creates a container it just sends over CRI to create a bind-mount of /some/path/mount to a location inside the container rootfs. When userns are in use, this will create an idmap bind-mount for the path.

So, the gid squashing is not an issue, that is done by the "device-mount call" (the first mount, done by the CSI driver), then the idmap bind-mount (the second mount, done by runc) just creates a mapping for that, there is no need to squash anything and all works fine. For example:

# mount -o gid=100 /dev/sda3 mount/
# ls -ln mount/
total 64
drwxr-xr-x 2 0 100 32768 Jan 29 15:53  print
drwxr-xr-x 2 0 100 32768 Nov  8 18:08 'System Volume Information'
# mount -o X-mount.idmap=b:100:200:1 --bind mount/ idmap-bind-mount/
# ls -ln idmap-bind-mount/
total 64
drwxr-xr-x 2 65534 200 32768 Jan 29 15:53  print
drwxr-xr-x 2 65534 200 32768 Nov  8 18:08 'System Volume Information'

So, as you see, if the CSI driver does the -o gid when mounting the device, then the idmap mount later will work fine. The squashing is done at device-mount time, not at bind-mount time.

But is there some CSI driver not doing that? Do you have any example I can repro with local-up-cluster.sh?

@giuseppe
Copy link
Member

sig-storage has some valid concerns around removal of persistent-volume restrictions that were dropped here - https://github.com/kubernetes/enhancements/pull/4084/files

See previous discussion around - kubernetes/kubernetes#111090 (comment)

It is not clear if idmap's limitations about 1:1 mapping of gids were addressed in Linux kernel. It is not clear if idmap mounts can squash multiple gids when creating mapping.

the user namespace mapping is a bijective function, each ID in the user namespace can be mapped only to an ID outside the user namespace. There is nothing like squashing of ids in the kernel,

cc @dobsonj @msau42 @jsafrane @liggitt

@sftim
Copy link
Contributor

sftim commented Feb 10, 2025

the user namespace mapping is a bijective function, each ID in the user namespace can be mapped only to an ID outside the user namespace. There is nothing like squashing of ids in the kernel,

It's not bijective. You can map a subset of the outer UID range into the user namespace. It's true that each ID in the user namespace can be mapped only to an ID outside the user namespace but it's actually an injective relationship in terms of set theory.

@dipesh-rawat
Copy link
Member

Hi @rata @giuseppe 👋, 1.33 Enhancements team here,

Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

The current status of this enhancement is marked as At risk for enhancement freeze. There are a few requirements mentioned in the comment #127 (comment) that still need to be completed.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@gnufied
Copy link
Member

gnufied commented Feb 10, 2025

But is there some CSI driver not doing that? Do you have any example I can repro with local-up-cluster.sh?

Not all CSI drivers use -o gid on first mount. In fact most don't. Only CSI drivers with capability VOLUME_MOUNT_GROUP do that. For other volume styles, the entire volume gets recursively chown/chmod (using given fsGroup) before bind mount can be prepared by the CRI.

But certain brown field volumes if they are using OnRootMismatch policy may not have uniform GID based ownership, either via recursive chown/chmod call or via mount -o gid. Will this be a problem?

I have been testing user namespace feature on a single node Fedora41 VM, with everything latest and for persistent volumes, what I have found is, no fsGroup based idmap is specified when bind mounts are created. @haircommander helped me a bit to understand the feature a bit more. If I print /run/containers/storage/overlay-containers/78407c092388b897c4d73b3e349b150bf31c024233c00b79addfa20cdb2c57db/userdata/config.json, I see following mapping:

                {
                        "destination": "/mnt/test",
                        "type": "bind",
                        "source": "/var/lib/kubelet/pods/0954e329-63f9-4448-ba2a-c47eb8d111b6/volumes/kubernetes.io~csi/pvc-0cdb10fb-7907-499f-9bfc-eeb90a51143c/mount",
                        "options": [
                                "rbind",
                                "rprivate",
                                "rw",
                                "bind"
                        ],
                        "uidMappings": [
                                {
                                        "containerID": 0,
                                        "hostID": 3007119360,
                                        "size": 65536
                                }
                        ],
                        "gidMappings": [
                                {
                                        "containerID": 0,
                                        "hostID": 3007119360,
                                        "size": 65536
                                }
                        ]

Where pod is using runAsUser: 9999 and fsGroup: 9999. The volume on the host have same GID of 9999 as 9999 inside container. It is as if, container was not using user namespaces at all. :-) I thought gidMappings will actually specify a mapping that will have something like 9999->9999 but it doesn't. Is this expected?

@rata
Copy link
Member

rata commented Feb 11, 2025

@gnufied

But certain brown field volumes

What do you mean with brown field volumes? What is that? An empty persistent volume?

But certain brown field volumes if they are using OnRootMismatch policy may not have uniform GID based ownership, either via recursive chown/chmod call or via mount -o gid. Will this be a problem?

No. As I shown in the previous comment, if the first mount is done with -o gid it should just work. If there is a way to test this locally with a CSI driver that does this, please let me know how or go ahead and test it. But as I shown in the code example in my previous comment, this works just fine. The recursive chown is not an issue either, it is designed to work fine.

Where pod is using runAsUser: 9999 and fsGroup: 9999. The volume on the host have same GID of 9999 as 9999 inside container. It is as if, container was not using user namespaces at all. :-) I thought gidMappings will actually specify a mapping that will have something like 9999->9999 but it doesn't. Is this expected?

Yes, exactly! We designed it so you can use persistent storage as if userns was not there. This is what idmap mounts gives us (and some other stuff). You are running a user/group 3007119360 + 9999 on the host and access the volume files just fine. You can even disable user namespaces, change the volume and enable it again. It will all just work :-).

@sftim
Copy link
Contributor

sftim commented Feb 11, 2025

Brownfield storage is when the underlying block storage is reused but scrubbed. It's analogous to brownfield land.

By contrast, you could delete the block level storage and provision a completely empty block volume on which you build a filesystem that you then mount. That's analogous to a greenfield build.

You can also apply this to NAS; for example, you take an existing file-based volume and delete it back to one root inode. Then you reuse that with a different purpose for a new set of Pods.

@gnufied
Copy link
Member

gnufied commented Feb 11, 2025

Thanks @rata and @giuseppe for explanations. I have tested userns feature with different volume scenarios, such as - using subpath (with different fsgroup), using brownfield volumes with different gid ownership etc and it appears to work.

@rata
Copy link
Member

rata commented Feb 11, 2025

@lzung @kannon92 @dipesh-rawat All those things were just addressed and the PR is merged! #5141 (comment)

@dipesh-rawat
Copy link
Member

Hello @rata @giuseppe 👋, 1.33 Enhancements team here,

Now that PR #5141 has been merged, all the KEP requirements are in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is marked as tracked for enhancement freeze. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

(cc: @lzung)

/label tracked/yes

@k8s-ci-robot k8s-ci-robot added the tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team label Feb 11, 2025
@dipesh-rawat dipesh-rawat moved this from At risk for enhancements freeze to Tracked for enhancements freeze in 1.33 Enhancements Tracking Feb 11, 2025
@rayandas
Copy link
Member

Hello @rata @giuseppe 👋, v1.33 Docs Lead here.

Does this enhancement work planned for v1.33 require any new docs or modification to existing docs?

If so, please follow the steps here to open a PR against dev-1.33 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday 27th February 2025 18:00 PDT.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

@rata
Copy link
Member

rata commented Feb 13, 2025

@AkihiroSuda wanna handle the doc PR too? You can cc me so I review, if you want :)

@AkihiroSuda
Copy link
Member

@AkihiroSuda wanna handle the doc PR too? You can cc me so I review, if you want :)

Yes, thanks

@aojea
Copy link
Member

aojea commented Feb 14, 2025

@rata I've updated the description with those links so they don't get lost on the comments thread

@rata
Copy link
Member

rata commented Feb 17, 2025

@aojea thanks! I can't edit the original issue description :)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. stage/beta Denotes an issue tracking an enhancement targeted for Beta status tracked/yes Denotes an enhancement issue is actively being tracked by the Release Team
Projects
Status: Tracked
Status: Tracked
Status: Tracked for Code Freeze
Status: Tracked for Doc Freeze
Status: Tracked for enhancements freeze
Status: Tracked
Status: Not for release
Development

No branches or pull requests