configfd() and shifting bind mounts [LWN.net]

By Jonathan Corbet
January 10, 2020

The 5.2 kernel saw the addition of an extensive new API for the mounting (and remounting) of filesystems; this article covered an early version of that API. Since then, work in this area has mostly focused on enabling filesystems to support this API fully. James Bottomley has taken a look at this API as part of the job of redesigning his shiftfs filesystem and found it to be incomplete. What has followed is a significant set of changes that promise to simplify the mount API — though it turns out that "simple" is often in the eye of the beholder.

The mount API work replaces the existing, complex mount() system call with a half-dozen or so new system calls. An application would call fsopen() to open a filesystem stored somewhere or fspick() to open an already mounted filesystem. Calls to fsconfig() set various parameters related to the mount; fsmount() is then called to mount a filesystem within the kernel and move_mount() to attach the result to the filesystem hierarchy somewhere. There are a couple more calls to fill in other parts of the interface as well. The intent is for this set of system calls to be able to replace mount() entirely with something that is more flexible, capable, and maintainable.

Back in November, Bottomley discovered one significant gap with the new API: it is not possible to use it to set up a read-only bind mount. The problem is that bind mounts are special; they do not represent a filesystem directly. Instead, they can be thought of as a view of a filesystem that is mounted elsewhere. There is no superblock associated with a bind mount, which turns out to be a problem where the new API is concerned, since fsconfig() is designed to operate on superblocks. An attempt to call fsconfig() on a bind mount will end up modifying the original mount, which is almost certainly not what the caller had in mind. So there is no way to set the read-only flag for a bind mount.

David Howells, the creator of the new mount API, responded that what is needed is yet another system call, mount_setattr(), which would change attributes of mounts. That would work for the read-only case, Bottomley said, but it falls down when it comes to more complex situations, such as his proposed UID-shifting bind mount. Instead, he said, the file-descriptor-based configuration mechanism provided by fsconfig() is well suited to this job, but it needs to be made more widely applicable. He suggested that this interface be made more generic so that it could be used in both situations (and beyond).

He posted an initial version of this proposed interface in November, and has recently come back with an updated version. It adds two new system calls:

    int configfd_open(const char *name, unsigned int flags, unsigned int op);
    int configfd_action(int fd, unsigned int cmd, const char *key, void *value,
    			int aux);

A call to configfd_open() would open a new file descriptor intended for the configuration of the subsystem identified by name; the usual open() flags would appear in flags, and op defines whether a new configuration instance is to be created or an existing one modified. configfd_action() would then be used to make changes to the returned file descriptor. The fsconfig() system call (along with related parts like fsopen() and fspick()) is reimplemented using the new calls. Bottomley provides an example for mounting a tmpfs filesystem:

    fd = configfd_open("tmpfs", O_CLOEXEC, CONFIGFD_CMD_CREATE);
    configfd_action(fd, CONFIGFD_SET_INT, "mount_attrs", NULL,
		    MOUNT_ATTR_NODEV|MOUNT_ATTR_NOEXEC);
    configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
    configfd_action(fd, CONFIGFD_GET_FD, "mountfd", &mfd, O_CLOEXEC);
    move_mount("", mfd, AT_FDCWD, "/mountpoint", MOVE_MOUNT_F_EMPTY_PATH);

The configfd_open() call creates a new tmpfs instance; the first configfd_action() call is then used to set the nodev and noexec mount flags on that instance. The filesystem mount is actually created with another configfd_action() call, and the third such call is used to obtain a file descriptor for the mount that can be used with move_mount() to make the filesystem visible.

With that infrastructure in place, Bottomley is able to reimplement his shiftfs filesystem as a type of bind mount. A shifting bind mount will apply a constant offset to user and group IDs before forwarding operations to the underlying mount; this is useful to safely allow true-root access to an on-disk filesystem from within a user namespace.

Only one developer, Christian Brauner, has responded to this patch series so far; he doesn't like it. It is an excessive collection of abstraction layers, he said, and it creates another set of multiplexing system calls, a design approach that is out of favor these days:

If they are ever going to be used outside of filesystem use-cases (which is doubtful) they will quickly rival prctl(), seccomp(), and ptrace(). That's not a great thing. Especially, since we recently (a few months ago with Linus chiming in too) had long discussions with the conclusion that multiplexing syscalls are discouraged, from a security and api design perspective.

Unsurprisingly, Bottomley disagreed. He argued that there is a common pattern that arises in kernel development: a subsystem that is complicated to configure, but then relatively simple to use. Filesystem mounts are an example of this pattern; the setup is hard, but then they can all be accessed through the same virtual filesystem interfaces. Cryptographic keys and storage devices were also mentioned. It would be better, he said, to figure out a common way of interfacing with these subsystems rather than inventing slightly different interfaces every time. The configuration file descriptor approach may be a good solution for that common way, he said:

I don't disagree that configuration multiplexors are a user space annoyance, but we put up with them because we get a simple and very generic API for the configured object. Given that they're a necessary evil and a widespread pattern, I think examining the question of whether we could cover them all with a single API and what properties it should have is a useful one.

The conversation appears to have stalled out at this point. It is hard to guess how this disagreement will be resolved, but one thing is fairly straightforward to point out: if the configfd approach is deemed unacceptable for the kernel, then somebody needs to come up with a better idea for how the problems addressed by configfd will be solved. Thus far, that better idea has not yet shown up on the mailing lists.

Index entries for this article
Kernel	Filesystems/Mounting
Kernel	System calls

to post comments

configfd() and shifting bind mounts

Posted Jan 10, 2020 21:30 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (20 responses)

I don't like the special configfd_open system call. Why not just use a regular open with a special file?

Instead of

configfd_open("tmpfs", O_CLOEXEC, CONFIGFD_CMD_CREATE)

write

open("/dev/config/fs/tmpfs/create", O_CLOEXEC | O_RDWR)

configfd() and shifting bind mounts

Posted Jan 10, 2020 21:53 UTC (Fri) by josh (subscriber, #17465) [Link] (19 responses)

Because /dev or /dev/config/fs might not exist in your namespace.

If you're binding a filesystem, though, I wonder why there isn't a way to change the fd you get from fsopen of the existing filesystem into a separate filesystem with separate options for "bind"?

configfd() and shifting bind mounts

Posted Jan 10, 2020 22:24 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (15 responses)

> Because /dev or /dev/config/fs might not exist in your namespace.

Not a problem for /proc

And if it *is* a problem, the right approach isn't some random new twice on open(2), but a system call that retrieves a directory file descriptor for /dev/config or whatever, one that you could then use with openat ---

open(get_configfs_fd(), "fs/tmpfs/create", O_CLOEXEC | O_RDWR)

configfd() and shifting bind mounts

Posted Jan 10, 2020 22:41 UTC (Fri) by cyphar (subscriber, #110703) [Link] (1 responses)

Adding more things to /proc isn't a great idea (it's already full of lots of other crap that arguably shouldn't be there), and there are lots of problems with safely resolving paths in /proc. Any new kernel interfaces (*especially* ones that will be implemented through magic-links) should have a non-procfs counterpart.

configfd() and shifting bind mounts

Posted Jan 10, 2020 23:35 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

I'm not saying that we should add stuff to proc. I'm saying that putting a virtual filesystem in the rooted filesystem namespace works fine, whether that virtual filesystem is proc or something else.

configfd() and shifting bind mounts

Posted Jan 10, 2020 23:26 UTC (Fri) by roc (subscriber, #30627) [Link] (11 responses)

I like that idea, but rather than a directory file descriptor, which might cause issues if the filesystem is not in fact mounted anywhere, it might be better to have new open() flags or AT_ values that select specific magic filesystems --- e.g. procfs.

As a solution to the "what if procfs isn't mounted?" problem, that seems far more elegant than the alternative of creating new syscall APIs for every single feature in procfs that someone might need to use without procfs mounted. (Same goes for other magic filesystems.)

configfd() and shifting bind mounts

Posted Jan 10, 2020 23:36 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

> which might cause issues if the filesystem is not in fact mounted anywhere,

Why would it cause problems? The actual FD would refer to a magical internal non-rooted mount, e.g., like the one the kernel sets up for pipefs on boot.

configfd() and shifting bind mounts

Posted Jan 11, 2020 1:12 UTC (Sat) by roc (subscriber, #30627) [Link]

OK, maybe it wouldn't. I have no experience with fds for paths in filesystems that aren't actually mounted anywhere.

configfd() and shifting bind mounts

Posted Jan 10, 2020 23:37 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

> As a solution to the "what if procfs isn't mounted?" problem, that seems far more elegant than the alternative of creating new syscall APIs for every single feature in procfs that someone might need to use without procfs mounted

Agreed. We don't need duplicate APIs. We just need some way to get a directory FD for /proc, /sys, whatever without going through the mount table.

configfd() and shifting bind mounts

Posted Jan 11, 2020 17:58 UTC (Sat) by smurf (subscriber, #17840) [Link]

Some more special negative pseudo-file-descriptors for openat() and friends?

configfd() and shifting bind mounts

Posted Jan 12, 2020 14:57 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (6 responses)

it’s called sysctl…

configfd() and shifting bind mounts

Posted Jan 13, 2020 3:34 UTC (Mon) by roc (subscriber, #30627) [Link] (5 responses)

... which has been removed

configfd() and shifting bind mounts

Posted Jan 13, 2020 21:40 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (4 responses)

Yeah, bad decision, that.

configfd() and shifting bind mounts

Posted Jan 13, 2020 21:45 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

sysctl() was broken, it had only casual connection with /proc.

Perhaps it would be better to add a new syscall like 'open_special(fs_type)' to open '/proc', '/sys', '/sys/fs/...' directories without them being mounted.

configfd() and shifting bind mounts

Posted Jan 14, 2020 6:44 UTC (Tue) by smurf (subscriber, #17840) [Link]

New special FD arguments to "openat()" and friends should be sufficient, no need for a new syscall.
Alternately, the new "mount" syscalls can give you a handle to /proc or /sys without actually mounting them.
Alternately, just acknowledge that not mounting /dev, /proc and /sys is not supported and going to cause problems, and leave it at that.

configfd() and shifting bind mounts

Posted Jan 14, 2020 21:06 UTC (Tue) by cyphar (subscriber, #110703) [Link] (1 responses)

The new mount API (in particular fsopen(2)) could work for this.

The problem is that there is a security issue with giving a program access to a /proc without any over-mounts if the /proc they already have access to has locked mounts on top of it (container runtimes use this technique to mask certain dangerous procfs files from containers). If we want to have a simple API that gives us a /proc handle, we'll need to make some kind of procfs2 (which has been suggested several times in the past) which removes all of the patently unsafe files so that untrusted programs can get access to all of it.

configfd() and shifting bind mounts

Posted Jan 14, 2020 21:09 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Yeah, security is a problem.

Probably at this point creating something like procfs2 and then mandating it would be the best approach. But then there's a question of what exactly is an "unsafe file"...

configfd() and shifting bind mounts

Posted Jan 12, 2020 14:56 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

*so* a problem for /proc, which may not exist in your chroot

configfd() and shifting bind mounts

Posted Jan 10, 2020 22:38 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (2 responses)

> why there isn't a way to change the fd you get from fsopen of the existing filesystem into a separate filesystem with separate options for "bind"?

Bind mounts can point to any file, even one that is not a mount point--or even one that isn't a directory.

However, it does seem to me that passing an O_PATH file descriptor to fspick, plus a new flag for fspick that says "create a bind mount", would be a good API. The article hints that "fsconfig() is designed to work with superblocks" but it's not clear why.

configfd() and shifting bind mounts

Posted Jan 11, 2020 16:51 UTC (Sat) by jejb (subscriber, #6654) [Link] (1 responses)

> However, it does seem to me that passing an O_PATH file descriptor to fspick, plus a new flag for fspick that says "create a bind mount", would be a good API. The article hints that "fsconfig() is designed to work with superblocks" but it's not clear why.

I did explain that problem in the original email: all the hooks for fsconfig actions are in sb->fs_type->init_fs_context() which the fs_context allocation uses. Now it is possible to special case this for bind mounts, but you also have to special case fsmount and fsconfig/reconfigure. By the time you've done all that, you've effectively got two separate paths through the same code, which isn't really such a good idea, which is why I asked the question "what would the generalisation of fsconfig look like".

configfd() and shifting bind mounts

Posted Jan 11, 2020 20:09 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

Thanks James, that makes sense.