Ideas for rationalizing GFP flags [LWN.net]

By Jonathan Corbet
April 20, 2016

LSFMM 2016

The kernel's memory-allocation functions normally take as an argument a set of flags describing how the allocation is to be performed. These "GFP flags" (for "get free page") control both the placement of the allocated memory and the techniques the kernel can use to make memory available if need be. For some time, developers have been saying that these flags need to be rethought; in two separate sessions at the 2016 Linux Storage, Filesystem, and Memory-Management Summit, Michal Hocko explored ways of doing that.

GFP_REPEAT

The first session, in the memory-management track, started with a discussion of the GFP_REPEAT flag which, as its name would suggest, is meant to tell the allocator to retry an attempt should it fail the first time. This flag, Michal said, has never been useful. It is generally used for order-0 (single-page) allocations, but those allocations are not allowed to fail and, thus, will retry indefinitely anyway. For larger requests, he said, it "pretends to try harder," but does not actually do anything beneficial. Michal would like to clean this flag up and create a better-defined set of semantics for it.

The kernel does have the opposite flag in the form of GFP_NORETRY, but that one, he said, is not useful for anything outside of order-0 allocations. What he would like to see instead is something he called GFP_BESTEFFORT; it would try hard to satisfy the request, but would not try indefinitely. So it could retry a failed request, and even invoke the out-of-memory killer but, should that prove fruitless, it would give up. This flag would be meant to work for all sizes of requests.

He is trying to move things in that direction, starting with the removal of GFP_REPEAT from order-0 allocation requests around the kernel. The next step would be to start placing the new flag in the places where it makes sense. As an example, he mentioned transparent huge pages and the hugetlbfs filesystem. Both need to allocate huge pages but, while an allocation failure for a transparent huge page is just a missed optimization opportunity, a failure in hugetlbfs is a hard failure that will be passed back to user space. It clearly makes sense to try harder for hugetlbfs allocations.

Johannes Weiner asked whether it would be a good idea to provide best-effort semantics by default while explicitly annotating the exceptions where it is not wanted. The existing GFP_NORETRY flag could be used for that purpose. Michal said that doing so would cause performance regressions, leading Andrew Morton to question whether "taking longer but succeeding" constitutes a regression. The point is that some callers do have reasonable fallback paths for failed allocations and would rather see the failures happen quickly if they are going to. Andrew asked how often that sort of failure happens, but nobody appeared to have any sort of answer to that question. It will be, in any case, highly workload-dependent.

Johannes persisted, saying that it can be difficult to know where the memory allocator should be told to try harder, but it is usually easy to see the places where failure can be handled easily. There was also a suggestion to make the flags more fine-grained; rather than use a vague "best effort" flag, have flags to specify that retries should not be done, or that the out-of-memory killer should not be invoked. Mel Gorman noted that he has already done some work in that direction, adding flags to control how reclaim should be performed.

That led to a wandering discussion on whether the flags should be positive ("perform direct reclaim") or negative ("no direct reclaim"). Positive flags are more descriptive, but they are a bit more awkward to use since call sites will have to mask them out of combined mask sets like GFP_KERNEL. There are also concerns that there aren't many flag bits available for fine-grained control.

The session ended with Michal asking if the group could at least come to a consensus that his work cleaning up GFP_REPEAT made sense. There seemed to be no objection there, so that work can be expected to continue.

GFP_NOFS

Later that day, the entire LSFMM group was present while Michal talked about a different GFP flag: GFP_NOFS. This flag instructs the memory allocator to avoid actions that involve calling into filesystem code — writing out dirty pages to files, for example. It exists for use by filesystem code for a number of reasons, the most straightforward of which is the avoidance of deadlocks. If a filesystem acquires locks then discovers that it must allocate memory, it doesn't want the allocator coming back and trying to obtain the same locks. But there is more to it than that; GFP_NOFS reflects a number of "indirect dependencies" within the filesystems. Also, XFS uses it for all page-cache allocations, regardless of deadlock concerns, to avoid calling so deeply into filesystem code that the kernel stack overflows.

There are, Michal said, too many uses of GFP_NOFS in the kernel tree; they needlessly constrain the memory allocator's behavior, making memory harder to obtain than it should be. So he would like to clean them up, but, he acknowledged, that will not be easy. The reason for any given use of GFP_NOFS is often far from clear — if there is one at all.

His suggestion is to get rid of direct use of that flag entirely; instead, setting a new task flag would indicate that that current task could not call back into filesystem code. XFS has a similar mechanism internally now; it could be pulled up and used in the memory-management layer. A call to a function like nofs_store() would set the flag; all subsequent memory allocations would implicitly have GFP_NOFS set until the flag was cleared.

There are a number of reasons for preferring this mechanism. Each call to nofs_store() would be expected to include documentation describing why it's needed. It allows the "no filesystem calls" state to follow the task's execution into places — security modules, for example — that have no knowledge of that state. Chris Mason noted that it would save filesystem developers from sysfs, which brings surprises of its own. Ted Ts'o added that there are a number of places where code called from ext4 should be using GFP_NOFS for its allocations, but that doesn't happen because it would simply be too much work to push the GFP flags through the intervening layers. Thus far, he has been crossing his fingers and hoping that nothing goes wrong; this mechanism would be more robust.

Michal asked the filesystem developers in the room how much work it would be to get rid of the GFP_NOFS call sites. Chris said that the default in Btrfs has been to use it everywhere; a bunch of those sites have since been fixed, but quite a few remain. He would be happy to switch to the new API, he said. Ted agreed, as long as the transition would be gradual and GFP_NOFS would not disappear in a flag day, as it were. The end result, he said, would be nice.

There was some talk of refining the mechanism to specify the specific filesystem that should be avoided, allowing the memory allocator to call into other filesystems. The consensus seemed to be that this idea would be tricky to implement; the possibility of stack overruns was also raised. Michal will go ahead and put together an API proposal for review. He hopes it will succeed: the fewer GFP_NOFS sites there are, the better the memory allocator's behavior will be.

Index entries for this article
Kernel	Memory management/GFP flags
Conference	Storage, Filesystem, and Memory-Management Summit/2016

to post comments

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 21, 2016 3:38 UTC (Thu) by neilbrown (subscriber, #359) [Link] (6 responses)

I think the best approach for GFP_NOFS is to get rid of it.

The only places where the allocator recurses into the filesystem (last I checked) was in shrinkers and ->releasepage. They need to be careful not to block on a lock that is already held.
Rather than testing __GFP_FS, they can be changed to use a "trylock" interface and simply not bother if the lock cannot be claimed.

The other places where GFP_NOFS is important are in the mm core where calls with GFP_NOFS get throttled less than GFP_KERNEL.
They could change to use GFP_NOIO instead - most of them already test both.

Then instead of introducing nofs_store(), filesystems can just use memalloc_noio_save() if they really need to.

fs/ocfs2/cluster/tcp.c already does this. The comment says "So we are not reentering filesystem while doing memory reclaim.", but it calls memalloc_noio_save(), which prevents waiting for IO rather than preventing entering the filesystem.

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 21, 2016 4:03 UTC (Thu) by viro (subscriber, #7872) [Link] (5 responses)

Oh, _lovely_. And how deep will those trylock of yours live? And how long would you expect for that to produce arseloads of broken code, both from "oh, shit, we need to undo a bunch of stuff on that trylock failure" and from having the same helpers called both from the page eviction pathways and from normal write?

IMO it's of the same order of realism as grand promises of aio-via-state-machine, non-blocking even for block allocation, onna stick, inna bun and that's cuttin' me own throat; it's just a matter of technics, guv, it can be done, honest... Heard it once in a while since at least 2002. Not materialized, and not going to happen, obviously.

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 21, 2016 5:28 UTC (Thu) by neilbrown (subscriber, #359) [Link] (4 responses)

> And how deep will those trylock of yours live?

Do you have an example of a releasepage of shrinker that takes sleeping locks (spinlocks obviously don't count) with a depth greater than one?

I might have missed something, but here is what I found:

There is some code in btrfs that isn't completely transparent, but I think that if we just removed the GFP flag from ->releasepage() the only interesting change is that nfs and filesystems that use fscache would need to reduce their 1 second timeout to zero, so maybe some extra throttling would be needed further up the stack - maybe.

gfs uses shrink_control.gfp_mask as does nfs though it isn't clear why as they just take a spinlock and manipulate some structures under that.

super_cache_scan aborts if __GFP_FS, but then it does do the trylock_super(). Why so?

xfs_qm_shrink_scan is the only filesystem shrinker I could find that might actually block indefinitely on filesystem IO. It would be easy to change that to non-blocking, but not so easy to understand all the consequences.

So I think we are just a tiny step away from removing the gfp flags from releasepage and shrinkers.
Where else is __GFP_FS used that __GFP_IO cannot trivially replace it?

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 26, 2016 18:18 UTC (Tue) by mstsxfx (subscriber, #41804) [Link] (3 responses)

I am not an expert in FS but I was told that things are much more complex than a simple deadlock:
lock() -> alloc() -> reclaim() -> FS -> lock()

Let me quote David Chinner from one of the email when we discussed this at
LSF mailing list before the summit (I am sorry but my skills with the LWN
commenting format are quite poor so I've ended up describing parallel things sequential - hopefully the idea will be describe sufficiently):

The quote is not full but the following should give the picture:

> I am not sure I understand you here. If kswapd is safe to call inode
> shrinker because "it won't try to reclaim referenced inodes" then the
> direct reclaim should be safe to do the same because they are doing
> the same thing. Or am I missing something and i/dcache shrinkers do
> something different depending on kdswapd/direct reclaim contenxt?

Like everyone else, you're assuming that ABBA deadlocks on locks are
the only thing that GFP_NOFS is needed for. It's not - the subsystem
defines the recursion context, and it may have nothing to do with
locks. So, let's look at why nesting transactions in direct
reclaim deadlocks XFS, but doesn't deadlock kswapd. Let's start with
a simple example of a GFP_KERNEL alocation inside a transaction:

process1
---------
trans alloc
reserve space in journal
lock inode X
join inode to transaction
kmalloc(GFP_KERNEL)
.....
shrink_slab
....
evict
xfs_inactive
trans alloc
reserve space
no space available
tail push journal
<block waiting for space>

xfsaild
----------
starts pushing from tail
inode X at tail of journal
trylock inode X
fails, skip inode X

So, now, lets, make it GFP_NOFS, add kswapd into the picture, and
another background reclaim worker thread that XFS runs(*):

process 1
---------------
trans alloc
reserve space in journal
lock inode X
join inode to transaction
kmalloc(GFP_NOFS)
<keeps retrying allocation>

kswapd
------------
shrink_slab
evict
xfs_inactive
trans alloc
reserve space
no space available
tail push journal
<block waiting for space>

xfsaild
----------
starts pushing from tail
inode X at tail of journal
trylock inode X
fails, skip inode X
pushes everything else that is dirty
<dirty inodes cleaned>

xfs_reclaim_work
----------------------------
walk reclaimable inodes
lock clean inode
free inode
<slab frees pages>

Process1
---------------
<gets a freed page>
<transaction commits>
unlock inode X

kswapd
------------
trylock inode X
locked, pushes inode X
<inode X cleaned>
<log tail moved>
<unblocks>
space available
inode truncated
inode marked free on disk
transaction commit
destroy inode

It's a fucking complex dance that revolves around several levels of
concurrency and workqueues. Quite frankly, I don't expect anyone
other than an experience XFS developer to understand how this all
works. I certainly don't expect mm developers to understand all
this subsystem-specific wizardry

(*) xfs_reclaim work could be any thing that results in memory being
freed, but this background worker does the majority of XFS inode
freeing, and we generally only find ourselves doing this dance when
we have a large inode cache and inode cache pressure.

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 26, 2016 23:45 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

Thanks for all that - very helpful!

Firstly, it shows me what I was missing. I was only looking at the shrinkers that filesystems explicitly register, not the common inode/dcache shrinker that the core code registers.
The 'dcache' side of that doesn't check for __GFP_FS, doesn't seem to block, and uses 'try_lock' occasionally to avoid deadlocks, just the way I think it should.

The 'inode' side is different. When an inode is removed from the icache, evict() is called (from dispose_list()) and this calls into the filesystem via ->evict_inode and that can block.

So my contention is that ->evict_inode should not be permitted to block. I wonder how practical that is.

evict_inode is responsible for truncating the inode if it has been unlinked, performing the final flush of all dirty pages, and freeing all the data structures. So it really does need to block.
In that case I don't think that it should be called from a shrinker any more than general writeback should be called during direct reclaim - it is too complex. Maybe it could be off-loaded to kswapd just like the writeback is.
i.e. prune_icache_sb(), instead of calling dispose_list() would splice the list of inodes onto some global (or per-NUMA-node) list and wake up kswapd. kswapd would call dispose_list() on the list.

Does that seem reasonable?

> Quite frankly, I don't expect anyone other than an experience XFS developer to understand how this all works.

That, I think, is a serious problem. If it is too hard to understand, it should be simplified.
Or at least, disentangled so that the mm side doesn't *need* to understand the complexities of the filesystem.

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 27, 2016 0:21 UTC (Wed) by viro (subscriber, #7872) [Link] (1 responses)

It's not reasonable at all. If nothing else, you've just royally fucked up the inode lifetime rules, since the same ->evict_inode() is called not only from memory pressure pathways. Moreover, it's not just unlink-related - it must write all the dirty pages of that inode first, for obvious reasons. These inode_wait_for_writeback() and truncate_inode_pages() really need to be called. On any inode eviction.

Final iput() blocks. No way around that. And offloading it to something async can screw filesystem internals in so many ways... IIRC, you used to argue for giving the filesystem drivers full control over the locking and lifetimes of everything, on the theory that They Surely Know Better(tm). I'm glad that you've finally seen the light, but IMO you went _way_ too far in the opposite direction...

Ideas for rationalizing GFP flags : GFP_NOFS

Posted Apr 27, 2016 5:37 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> If nothing else, you've just royally fucked up the inode lifetime rules,

Have I? I wonder how,.
All I did was move "dispose_list()" on a private list of inodes from the shrinker to kswapd. I_FREEING is still set at the same time while no other code has access to the inode. So any code wanting to access the inode will have to call __wait_on_freeing_inode(), and it hardly matters if it waits for a task calling the shrinker or for kswapd. Unless kwapd could ever call __wait_on_freeing_inode()??? That would be a problem for me, but it seems unlikely.
unmount might need to wait for kswapd in a way that it didn't before.

> since the same ->evict_inode() is called not only from memory pressure pathways.

True, but I recanted on changing ->evict_inode(). I only want to change where it is called from. It should be called from the same sorts of places that can write out dirty pages. This does include kswapd but doesn't include direct reclaim.

> Final iput() blocks. No way around that.

Yes, I agree. That does raise an awkwardness in that prune_icache_sb() calls inode_lru_isolate() which sometimes calls iput(). That feels like a rough edge that we should be able to smooth off. If we change ->releasepage() to never block (which I think we can, with a bit of work), then we might be able to hold the i_lock across invalidate_mapping_pages() and so not need to do the __iget()/iput() dance. That would need careful study to get right.

> IIRC, you used to argue for giving the filesystem drivers full control over the locking and lifetimes of everything,

I think "everything" is an over statement. I'm certainly in favor of the filesystem having control of which filesystem operations it can perform in parallel and which require serialization.

> I'm glad that you've finally seen the light,

Far from it. I think that filesystems (or any modules) should (ideally) have full control over things that are their responsibility, and not need to be concerned at all about things over which they are not responsible.
Creating files in a directory is very much the responsibility of the filesystem, so it should control, for example, whether they are serialized or not.
Allocating memory is not at all the filesystem's responsibility so it shouldn't need to know anything about possible deadlocks or recursion or whatever.
The distinction between "failure is OK, but don't sleep", and "failure is not OK, do whatever you must" is reasonable (because "failure" and "sleep" are things the filesystem needs to know about), but the amount of external knowledge needed to make the correct "Should I use GFP_NOFS" decision is excessive.
There will always be some give and take between filesystems and mm, but the more we can simplify it, the better.

Ideas for rationalizing GFP flags

Posted Apr 21, 2016 15:41 UTC (Thu) by josh (subscriber, #17465) [Link] (5 responses)

Ideally, no modern system should ever be using swap to begin with, and systems that don't use swap shouldn't pay a penalty for tracking this information. So, for instance, a kernel with swap compiled out should ideally not need to track that per-task flag. Other than swap, under what circumstances would an allocation ever need to touch filesystem code?

Ideas for rationalizing GFP flags

Posted Apr 21, 2016 16:41 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> Other than swap, under what circumstances would an allocation ever need to touch filesystem code?

Non-anonymous mmap()'d files with dirty pages pose essentially the same issues for allocation as a swap file, and are more difficult to write off.

Ideas for rationalizing GFP flags

Posted Apr 22, 2016 19:54 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

> Other than swap, under what circumstances would an allocation ever need to touch filesystem code?

any time the memory management system decides that it can force a write of pending data to disk to free the RAM holding that data.

Ideas for rationalizing GFP flags

Posted Apr 22, 2016 20:35 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> any time the memory management system decides that it can force a write of pending data to disk to free the RAM holding that data.

Nope. Memory allocation doesn't directly write out data. The data writeback happens from a separate thread - kswapd. Memory allocation may wake up that thread, and may wait a little while to see if it made progress.

There are two places where an allocation can call into filesystem code. One is "releasepage" which is called on a clean page and asks a filesystem to discard any fs-specific data that it has attached to the page. If that succeeds the page can be free. The other is "shrinkers" which are called to ask the filesystem to prune excess entries from some internal cache.
See my other comment where I give more details.

Ideas for rationalizing GFP flags

Posted Apr 29, 2016 19:01 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Ideally, no modern system should ever be using swap to begin with,

Huh? First of all, you're saying "let's solve performance problems by throwing hardware at it" which is a *real* bugbear to me, and secondly, it's not always practical ...

I doubt I'm that unusual - I have two desktop systems maxed out with RAM (one 2GB, one 16GB). I run gentoo, and /var/tmp/portage is on tmpfs. So when I do an "emerge", the system *often* spills into swap.

Atypical? Yes. Unusual? Probably not.

Cheers,
Wol

Ideas for rationalizing GFP flags

Posted Apr 30, 2016 17:28 UTC (Sat) by flussence (guest, #85566) [Link]

Swap is used for more than avoiding the OOM killer: it also lets the kernel page out programs that sit idle 99.99% of the time so they don't sit there soaking up RAM that could be used for page cache.