ID-mapped mounts [LWN.net]

By Jake Edge
May 30, 2022

The ID-mapped mounts feature was added to Linux in 5.12, but the general idea behind it goes back a fair bit further. There are a number of different situations where the user and group IDs for files on disk do not match the current human (or process) user of those files, so ID-mapped mounts provide a way to resolve that problem—without changing the files on disk. The developer of the feature, Christian Brauner, led a discussion at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM) on ID-mapped mounts.

He began with an introduction. There are multiple use cases, but he likes to talk about portable home directories first because they are not related to containers, which many think is the sole reason for ID-mapped mounts. A portable home directory would be on some kind of removable media that can be attached to various systems, some of which have a different user and group ID for the user, but, of course, the media has fixed values for those IDs. ID-Mapped mounts allow the device to be mounted on the system with the IDs remapped to those of the user on the local system.

Beyond that, of course, are various container use cases, such as sharing a root filesystem with multiple containers, each of which is using its own user namespace with a different mapping for UID 0. Each of the containers needs to be able to access the files as "root", but UID 0 inside the namespace is mapped to some nonzero UID on the host system; an ID-mapped mount would enable that nonzero ID to be mapped to UID 0 for filesystem access. Similarly, sharing data between a host filesystem and one in a user namespace may require remapping the IDs. Some of these cases were handled with expensive recursive chown calls before ID-mapped mounts came along.

There are some filesystems that can be used in user-namespace-based containers, most notably overlayfs, but there are still lots of limitations and the main filesystem types, Btrfs, XFS, and ext4, are not really able to be used in that manner. Once all of the use cases were gathered, he said, the most flexible solution turned out to be a per-mount mapping of UIDs and GIDs, which is what ID-mapped mounts provide.

The API for the feature uses the mount_setattr() system call, which allows changing the ID mappings as well as other attributes of mounts. Brauner clarified that the feature applies to all virtual filesystem (VFS) mounts, so bind mounts are included. Unlike mount(), mount_setattr() allows changing mount attributes recursively.

Using the feature requires passing a flag and a file descriptor to mount_setattr(); the file descriptor is that of a user namespace that does the ID mapping that should be applied to the mount. The implementation was done in the VFS layer, so individual filesystems "do not need to be really aware of it"; there are APIs available to make it easy on the filesystems, he said. Ted Ts'o asked about a command-line tool for doing an ID-mapped mount; Brauner said that one should be merged soon into util-linux.

Amir Goldstein noted that fstests already has a binary tool for testing these mounts. Brauner added that there are 15K lines of code in tests, already upstream in fstests, for ID-mapped mounts that aim to test the feature in all possible combinations. That includes things like access-control lists (ACLs), Linux capabilities, setuid and setgid execution, and so on. Every time a bug or regression is found, a new test is added to the suite.

He spent a bit of time demonstrating the tool and the feature, noting that the mapping works in both directions: IDs of files in the mount follow the mapping and files created within the mount have the reverse-mapped IDs outside of it. The feature is already being used by various tools, such as systemd-nspawn and systemd-homed; it has also been added to the runC container specification, so "there is lots of activity going on around this".

Currently, ext4, XFS, Btrfs, and several other filesystems support the feature; there is a patch set for overlayfs that is on-track to be merged soon. David Howells asked what filesystems need to do to support ID-mapped mounts. Brauner said that "in principle it is easy" to do so. Network filesystems may have some additional wrinkles, however; he has a patch set for Ceph but it still needs more work. The changes for ext4 and XFS were small, he said, and others are likely to be similar because most filesystems do not really use the IDs directly. The XFS quota-handling code does use the IDs, so it needed a bit more work. There is a long document available and he is willing to help add it to other filesystems.

Network filesystems need to determine which ID they want to send to the server, he said. Normally, the mapped ID is the right choice, but that may not be true for all cases.

Chuck Lever asked how the ID mapping could be changed for an existing mount and wondered if it could just be remounted to make that change. Brauner said that no changes are allowed once the namespace has been attached to the mount or the mount has been attached to the filesystem. Due to "lifetime issues" with regard to the use of the mapping, it is too complicated to allow changes once the filesystem has been fully mounted. Using the new mount API, a user will create a detached mount, then set the ID mapping on it, then, finally, attach it to the filesystem.

Lever also asked about the limits for the number of entries in the mapping; for example, in a system with thousands of users, where each user should be mapped to their own ID in a single mount. Brauner said that user namespaces were originally limited to five mappings, but he raised that limit to 340 in 2015 or 2016. It will be difficult to increase it beyond that, he said, because mapping is done in a hot path; he optimized the data structure for the mappings and increasing it further will have a performance impact.

Ts'o wondered if there was any thinking about supporting "project IDs", which are used by some container systems; those IDs are used for project-wide quotas in filesystems. Brauner said that project ID needs to be revisited, since "we have dodged this issue for years". The intended semantics are not clear, so he has been confused when looking into it.

While both XFS and ext4 support those IDs, Ts'o said he is confused by the semantics as well, at least with respect to user namespaces. He and Darrick Wong discussed it at one point and it was not clear whether both filesystems worked the same way, though there is an intention to unify their behavior. Brauner said that quota handling is not the same between different filesystems in Linux; each seems to have its own quirks. In the Zoom chat, Jan Kara pointed out that ID-mapping changes had not been made to the VFS quota code, at least yet; that was relayed as time expired on the session, however.

Index entries for this article
Kernel	Filesystems/Mounting
Kernel	Namespaces/User namespaces
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

to post comments

ID-mapped mounts

Posted May 30, 2022 16:55 UTC (Mon) by jhoblitt (subscriber, #77733) [Link] (6 responses)

I'm very excited to see this coming to fruition after years of dealing with uid coordination for network filesystems and containers.

What capability will calling mount_setattr() require?

ID-mapped mounts

Posted May 30, 2022 17:07 UTC (Mon) by brauner (subscriber, #109349) [Link]

In order to create idmapped mounts you will need to have CAP_SYS_ADMIN in the user namespace the filesystem was mounted in and the filesystem needs to support them by raising FS_ALLOW_IDMAP. Since no filesystems that support being mounted unprivileged support them - and probably don't need to - this means you need to be CAP_SYS_ADMIN in the initial user namespace. There are no immediate plans to lower the privilege requirements.

ID-mapped mounts

Posted Jun 1, 2022 17:18 UTC (Wed) by developer122 (guest, #152928) [Link] (4 responses)

It's been frustrating to no end that you can't do user mapping in NFS without dragging kerberos into the picture. Despite what some documentation would suggest, it's not possible.

Sometimes you just want a simple file of the form:
user1@host1.host1.host1.host1 = localuser1
user2@host2.host2.host2.host2 = localuser1
user3@host3.host3.host3.host3 = localuser2

ID-mapped mounts

Posted Jun 1, 2022 18:41 UTC (Wed) by ballombe (subscriber, #9523) [Link] (1 responses)

what happened to rpc.ugidd ?

ID-mapped mounts

Posted Jun 1, 2022 18:45 UTC (Wed) by jhoblitt (subscriber, #77733) [Link]

NFS uid/gid mapping is essentially useless if you expect both reads and writes to be remapped.

ID-mapped mounts

Posted Jun 1, 2022 21:13 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (1 responses)

I've never gotten idmapping to work with sec=krb5. Not saying it isn't possible... just I couldn't get it to work.

krb5 isn't always practical. I have an application that mounts nfs into k8s pods. krb5 wasn't designed in the era of 100-1000s of "hosts" dynamically coming and going. While it supposedly isn't impossible to get krb5 working in a pod (https://cloud.redhat.com/blog/kerberos-sidecar-container), it doesn't look like much fun either.

ID-mapped mounts

Posted Jun 14, 2022 12:30 UTC (Tue) by cortana (subscriber, #24596) [Link]

If only RHEL would include the excellent kstart (yes I know it's in EPEL) then that would have been so much simpler.

That said, gssproxy is the better option these days, since it only requires an environment variable to be set. But I never figured out how to use it in a pod...

ID-mapped mounts

Posted May 30, 2022 23:19 UTC (Mon) by dgc (subscriber, #6611) [Link]

The fundamental issue with user namespaces/mappings and project IDs are that they are not UIDs or GIDs. They are completely independent of UID/GIDs and are directly user controllable. That makes them very different in behaviour - users cannot change UIDs/GIDs as they are effectively owned by the system and fixed, whilst project ID is a property of the user owned file and can be changed at any time. This "user owns project ID" architecture is why mapping project IDs with user namespaces or ID mapping is problematic.

Historically speaking, project IDs came from Irix. Irix didn't have group quotas at all. Instead, it had project quotas that allowed users to co-operatively build "project" based data stores without needing the admins to define special groups for those projects
and assign users to them. It also allowed users to assign files in the their home directories to projects, such that a project could account for not just all the /data/project/.... files but also all the working files that users might have in /home/fred/project/...

IOWs, project ID based quotas exist entirely outside the scope of UIDs and GIDs and strictly defined and owned directory heirarchies. This allows project quotas to do many things that UID/GID based quotas can't do, such as provide directory tree based quotas.

For example, in a namespace based container setup, the host may be using project IDs to track space usage and enforce ENOSPC for a container's directory heirarchy. In this case, users inside the container cannot be allowed to modify the project ID as that would allow them to escape the space usage accounting and enforcement mechanism the host is using. This is why we don't allow project IDs to be manipulated inside user namespaces.

There's also no limits on what project IDs a user can assign to a file. It's a 32 bit space, and the only two reserved system IDs are 0 (default, no accounting/enforcement) and 0xffffffff (-1) which is used to signal an invalid project ID. Any user can set any project ID they want between those two numbers. That makes it difficult to map ranges usefully to mounts because of the lack of constraints on what users can set.

So before anything is implemented, a coherent framework for mapping and sharing project IDs across host and client namespaces and mapped mounts needs to be developed and agreed upon. If project IDs are going to be mapped at the user level, how does that translate to what project quotas store? Do they account based on the user project ID (mapped) or the host project ID (unmapped)? What if two different containers map back to the same host side project ID?

There's a heap of unanswered questions here, and I'm not sure there is a single answer that works for every situation. The very flexibility and user control of project based quotas is what works against it here, and constraints will need to be carefully designed so that we don't compromise that flexibility and capability.

-Dave.

ID-mapped mounts

Posted May 31, 2022 7:13 UTC (Tue) by taladar (subscriber, #68407) [Link] (2 responses)

I wonder if this could also eventually be used to make bindfs unnecessary. We use that rather extensively for the mounting of directories owned by e.g. a vhost user into the chroot of various sftp users.

ID-mapped mounts

Posted Jun 1, 2022 2:00 UTC (Wed) by ringerc (subscriber, #3071) [Link] (1 responses)

bindfs looks amazing. Why on earth doesn't Docker / Moby / containerd / runc support that for -v bind mounts, to solve some of the horrifying pain of id mapping in containers?

ID-mapped mounts

Posted Jun 2, 2022 0:14 UTC (Thu) by rcampos (subscriber, #59737) [Link]

It is a FUSE fs, you can't get good performance out of it :-(

ID-mapped mounts

Posted Jun 1, 2022 1:57 UTC (Wed) by ringerc (subscriber, #3071) [Link]

This will bring some much needed sanity to Docker bind mounts, which are currently an absolute nightmare.

ID-mapped mounts

Posted Jun 28, 2022 17:37 UTC (Tue) by jengelh (subscriber, #33263) [Link] (1 responses)

>for example, in a system with thousands of users, where each user should be mapped to their own ID in a single mount. Brauner said that user namespaces were originally limited to five mappings, but he raised that limit to 340

Is that in units of users, or in units of ranges? For a container, a single ranged mapping (i.e. something like [131072:196608]->[0:65536]) might be sufficient while expressing much more than 340 individual users.

ID-mapped mounts

Posted Jul 7, 2022 3:52 UTC (Thu) by brauner (subscriber, #109349) [Link]

units of ranges :)