A virtual filesystem locking surprise [LWN.net]

By Jonathan Corbet
July 31, 2023

It is well understood that concurrency makes programming problems harder; the high level of concurrency inherent in kernel development is one of the reasons why kernel work can be challenging. Things can get even worse, though, if concurrent access happens in places where the code is not expecting it. The long story accompanying this short patch from Christian Brauner is illustrative of the kind of problem that can arise when assumptions about concurrency prove to be incorrect.

Within the kernel, struct file is used to represent an open file. It contains the information needed to work with that file, including an extensive operations vector, a reference count, a pointer to the associated inode, the current read/write position, and more. Since there can be multiple references to an open file, there must be a way to serialize access to this structure. The f_lock spinlock is used in most cases, but there is also a mutex called f_pos_lock that is used for access to the file position.

Acquiring and releasing locks has a cost of its own. Many I/O operations affect the file position, so an I/O-intensive workload can end up repeatedly taking and releasing f_pos_lock, increasing the overhead imposed by the kernel. As it happens, though, having multiple references to an open file is a relatively rare occurrence. If there is only a single reference to a given file, concurrent access to the file position cannot happen and that lock overhead is wasted. To avoid this waste, the function that acquires f_pos_lock (__fdget_pos()) contains an optimization:

    if (file_count(file) > 1)
        mutex_lock(&file->f_pos_lock);

(The code has been simplified slightly to highlight the relevant part). The idea here is simple enough: if there is only a single reference to the file, concurrent access cannot happen and there is no point in taking the lock, so the mutex_lock() call is skipped.

The io_uring subsystem has been under intensive development since its introduction in 2019; it is rapidly becoming an independent interface to much kernel functionality. There are currently efforts underway to add io_uring operations corresponding to waitid(), futexes, and getdents(). That last patch, making the getdents() system call available in io_uring, is relevant here because getdents() relies heavily on the file position (and, possibly, state kept by the filesystem implementation) to allow a process to read through a long directory in multiple calls.

The "fixed files" feature of io_uring is also relevant here; it lets a file be used numerous times in io_uring operations without the per-call overhead required with regular system calls. That overhead, which includes acquiring a reference to the file and validating the process's access to it, can be significant in I/O-heavy applications; fixing a file makes it possible to pay that cost only once, improving performance. When a file is fixed into io_uring, a new reference is created, so the reference count will increase. The process can, however, close its own file descriptor after fixing it in io_uring, leaving the fixed-file reference as the only one. The reference count will, as a result, drop back to one. It will also stay there while I/O operations on the file are underway in io_uring; the whole point of fixing the file is to avoid the cost of repeatedly gaining and releasing references.

Brauner pointed out a problem in the getdents() patch: if a file has been fixed in io_uring, and its reference count is one, it will be possible to run multiple getdents() operations concurrently within io_uring, each of which will access f_pos without taking the lock. The results of this concurrency are highly unlikely to be what the developer was hoping for. One might argue that this is a "then don't do that" sort of situation but, as Brauner described in his patch addressing the problem, io_uring is not the only way to run into trouble.

In 2020, the kernel acquired an interesting system call named pidfd_getfd(), which allows a suitably privileged process to extract an open file descriptor from a running process. This operation can be useful for, among other things, enabling a privileged supervisor process to perform operations that another process cannot perform on its own; opening a file outside of a container might be one example. For this to work, the file descriptor created by pidfd_getfd() must refer to the same open file structure as the descriptor in the target process. It creates a second reference to that structure, and the reference count is duly incremented to reflect that.

A problem arises, though, if the target process has a getdents() call underway when its file descriptor is grabbed by pidfd_getfd(). Since, when getdents() was called, the file's reference count was one, the target process will not have acquired f_pos_lock. If the process that obtained the file descriptor with pidfd_getfd() also passes it to getdents(), things can go wrong. The second call will see the elevated reference count and acquire f_pos_lock but, since the first call did not acquire that lock, that acquisition will succeed immediately and the two getdents() call will run concurrently, once again with something other than the intended results.

The fix is easy enough: simply remove the check on f_count and acquire f_pos_lock unconditionally. That will impose a performance cost, but nobody seems to have been worried enough about it to actually measure it. Linus Torvalds applied the patch for the 6.5-rc4 release after editing the changelog (which he described as "*way* too much", but which your editor found most useful). He also complained about how pidfd_getfd() shares the file structure, saying it would have been better to simply reopen the file (creating a new file structure); that would defeat the purpose for pidfd_getfd(), though, since the new file descriptor would no longer be usable to perform actions on the other process's behalf.

Torvalds remains grumpy about the shared access to struct file created by pidfd_getfd(), but it seems like it is here to stay. In any case, this problem has been fixed, clearing the way for the (eventual) use of getdents() on fixed files in io_uring. But it provides an example about how subtle assumptions regarding concurrency can go wrong in surprising ways.

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer

to post comments

A virtual filesystem locking surprise

Posted Jul 31, 2023 15:14 UTC (Mon) by brauner (subscriber, #109349) [Link]

> That will impose a performance cost, but nobody seems to have been worried enough about it to actually measure it.

Actually, I did request performance measurements from Intel's lkp-tests even before I sent that patch but only received them after it was applied.

To quote from that (private) mail:

"we've already been merging it into our
so-called hourly kernels which are distributed to our machine pool for
various performance tests which we supported.

so far, we didn't capture any performance change caused by this branch.

in order to avoid missing, we aslo decided to run some performance tests
directly upon this branch [...]
to see if it could cause any performance change comparing to v6.5-rc1.

firstly we want to check stress-ng jobs with HDD such like:
stress-ng-class-filesystem
stress-ng-class-io
stress-ng-class-os
stress-ng-class-vm-stack
stress-ng-os-1-thread
upon our Ice Lake and Cascade Lake test machines."

A virtual filesystem locking surprise

Posted Jul 31, 2023 17:10 UTC (Mon) by wsy (subscriber, #121706) [Link] (2 responses)

> if (file_count(file) > 1)
> mutex_lock(&file->f_pos_lock);

What happens if the refrence count changed between the check and the lock?

A virtual filesystem locking surprise

Posted Jul 31, 2023 19:03 UTC (Mon) by stevie-oh (guest, #130795) [Link]

My understanding is that this is (well, *was*) ostensibly impossible, because the requirement is twofold: there is only one reference to this file descriptor, _and_ the process that has that reference is single-threaded.

The logic goes like this:
1. Only one reference exists to this file descriptor
2. The reference belongs to a process with only one thread
3. Therefore, right now, there is only one thread that can access or manipulate this file descriptor
4. Right now, that thread is busy running executing this function, which means it can't conflict with anything.

The problem, then, is that io_uring and pidfd_getfd violate the validity of the leap from 2->3. pidfd_getfd would do what you mentioned: it allows the reference count to be incremented by a thread from another process. io_uring, on the other hand, seems to do work on threads that don't get "counted" for #2.

A virtual filesystem locking surprise

Posted Jul 31, 2023 21:13 UTC (Mon) by pbonzini (subscriber, #60935) [Link]

Functions such as fdget_pos() return a bunch of flags for later use in fdput() and fdput_pos(). One such flag is FDPUT_POS_UNLOCK, which directs fdput_pos() to release the mutex.

General solution

Posted Jul 31, 2023 18:27 UTC (Mon) by calumapplepie (guest, #143655) [Link] (4 responses)

IMO, not acquiring a lock just because the reference count is *currently* one is a pretty nasty anti-pattern, but is also a really useful strategy in a lot of cases. It's nasty because it always creates a race window where a process can acquire a second reference to $PROTECTED_STRUCTURE and then see the instance success. Its useful because there are just so many cases where most objects will be doing something singlethreaded but the occasional object might be used concurrently.

Proposal: someone rigs up something clever, as a general solution that can be applied to all the various objects in the kernel, via a little struct that can be put inside a larger object, call it 'rcutex'. It tracks whether or not a user can assume that there is only a single reference to an object; when you're going lock-acuqiring, you simply check if the bit is set, and if so you can go ahead with the assumption that you have an exclusive reference. When you go to make a new reference to an object containing an rcutex, you clear the singlethreaded bit, wait an RCU grace period if it was set, and then you can proceed knowing that there are no ongoing lockless accesses.

Now, I'm not Paul McKenney, so take this with a grain of salt, but I think it should be possible to use the grace periods of RCU, or a similar construct, to accomplish this. The details are beyond me, however; do we make the rcutex some sort of RCU-proected pointer, or will we need a non-rcu solution? This will obviously slow down the adding-references-to-existing-objects case, due to the need to wait a grace period; there are a half dozen ways to amortize that cost springing to mind, both through heuristics and through trying to avoid actually waiting the period in a time-critical path. Some caution is needed for the case of two references being added at the same time; but two people clearing a bit still results in a cleared bit*
*probably.

General solution

Posted Jul 31, 2023 22:44 UTC (Mon) by josh (subscriber, #17465) [Link]

That seems like a potentially useful structure, as long as people acquiring the second reference to something are willing to wait the (potentially considerable) amount of time for an RCU grace period to expire.

That might be a reasonable amount of overhead for pidfd_getfd, for instance.

General solution

Posted Aug 3, 2023 8:14 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (2 responses)

Link to documentation for people (like me) who are unfamiliar with the term "RCU": https://www.kernel.org/doc/Documentation/RCU/rcu.txt

General solution

Posted Aug 3, 2023 10:31 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

There's a wealth of articles on RCU here on LWN too, from the RCU author, e.g.: https://lwn.net/Articles/263130/

Just google for "LWN RCU" in your favourite search engine, e.g. DuckDuckGo.

RCU

Posted Aug 3, 2023 13:13 UTC (Thu) by corbet (editor, #1) [Link]

...or look in the RCU section of the LWN kernel index.

A virtual filesystem locking surprise

Posted Aug 12, 2023 4:34 UTC (Sat) by dxin (guest, #136611) [Link]

Isn't mutex_lock already heavily optimized for the uncontended case? E.g mutex_trylock_fast. Is it still necessary to find a case to skip the locking like this?