Guard pages for file-backed memory [LWN.net]

By Jonathan Corbet
March 3, 2025

One of the many new features packed into the 6.13 kernel release was guard pages, a hardening mechanism that makes it possible to inject zero-access pages into a process's address space in an efficient way. That feature only supports anonymous (user-space data) pages, though. To make guard pages more widely useful, Lorenzo Stoakes has put together a patch set enabling the feature for file-backed pages as well; in the process, he examined and resolved a long list of potential problems that extending the feature could encounter. One potential problem was not on his list, though.

The purpose of a guard page is to prevent buggy (or malicious) code from overrunning a memory region. An inaccessible page placed at the end of a region will cause a segmentation fault should the running process try to read or write to it; well-placed guard pages can trap a number of common buffer overruns and similar problems. Prior to 6.13, though, the only way to put a guard page into a process's address space was to set the protections on one or more pages with mprotect(); that works, but at the cost of creating a new virtual memory area (VMA) to contain the affected page(s). Placing a lot of guard pages will create a lot of VMAs, which can slow down many memory-management functions.

The new guard-page feature addresses this problem by working at the page-table level rather than creating a new VMA. A process can create guard pages with a call to madvise(), requesting the MADV_GUARD_INSTALL operation. The indicated range of memory will be rendered inaccessible; any data that might have been stored there prior to the operation will be deleted. There is an operation (MADV_GUARD_REMOVE) to remove guard pages as well.

Placing guard pages in VMAs containing anonymous pages is the simplest case, which is why anonymous pages were supported first. These pages have no connection to any file on disk, so there are relatively few hazards involved with changing their behavior. File-backed pages bring more complexity, though, and a number of places where guard pages could cause problems. Stoakes goes through the list in detail in the patch posting.

For example, readahead is an important part of maintaining performance when a process is working sequentially through a file. As that process reads some data from a file, the kernel can guess that the process will go on to request the following data in the file in the near future. By initiating a read operation before user space gets around to asking for the data, the kernel can ensure that this data is present (or at least on its way) when the request arrives. The presence of a guard page will stop readahead cold at that point, since the page has been marked inaccessible. As Stoakes notes, this should not be a problem, since it would be unusual for a process to map a file, place a guard page, then try to read through that page.

Similar complications arise in other situations. The kernel will often try to "fault around" a page that has been faulted in, under the assumption that nearby data will be of interest; guard pages will prevent that as well. If a file is truncated, the removed portion may include guard pages, but the guard pages themselves will remain in place. And so on; in each case, Stoakes has ensured that the kernel's operation will be correct and make sense.

There are still a couple of exceptions, though, one of which was known about before the patches were posted, while the other was a surprise. The known issue is that guard pages cannot be placed in memory areas that have been locked into RAM with mlock(). The problem, as Vlastimil Babka pointed out, is that mlock() guarantees that the affected pages will not be kicked out of RAM. Installing a guard page, though, frees any data stored there, which runs counter to the mlock() promise. Stoakes is considering a new operation that would make this data destruction explicit in that case but, as David Hildenbrand said, "mlock is weird" and there are a number of other details that would have to be managed there.

The unexpected issue was raised by Kalesh Singh, who wondered how the presence of guard pages would be represented in /proc/PID/maps and /proc/PID/smaps. These files, which are documented in Documentation/filesystems.proc.html, describe a process's VMAs in detail. Singh said:

In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.

It seems that banking apps running on Android are known for this sort of behavior and could run into trouble if guard pages are installed — which is something that the Android runtime might well want to do as a general hardening measure. Since those apps already read the indicated /proc files, Singh thought that would be a logical place to indicate the presence of the guard pages.

This request took Stoakes by surprise, since he thought the topic had been discussed previously and the situation understood. That situation is that, since those files describe VMAs, they are not a suitable place to put information about guard pages which, by design, do not have their own VMAs. Hildenbrand quickly suggested that a bit in /proc/PID/pagemap, which provides page-level data now, would be the best way to export that information to user space. The conversation nonetheless became a little tense, seemingly mostly as a result of misunderstandings rather than true disagreement.

In the end, though, it was agreed that pagemap was the right place for this information. Suren Baghdasaryan eventually joined the conversation, saying that some work would be needed to make this information available to apps in the Android system, but that he would start on that project. Apologies and thanks were shared around, and Stoakes said that he would go ahead and implement the kernel side of the pagemap solution.

With that issue seemingly resolved, there does not appear to be any serious obstacles to this feature heading toward the mainline in the near future. The patch series (minus the pagemap changes) is sitting in linux-next now and could conceivably go upstream as soon as the 6.15 merge window. That should result in easier and cheaper user-space hardening, which seems worth the trouble.

Index entries for this article
Kernel	System calls/madvise()

Thanks for pointing MADV_GUARD_*

Posted Mar 3, 2025 22:00 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

Thanks for pointing MADV_GUARD_REMOVE/MADV_GUARD_INSTALL, I was making my own guard pages myself (like anyone I guess) and indeed it's a mess which triples the number of VMAs. I'll have a look, that can be very helpful, even if it's only supported since 6.13.

more like this please :)

Posted Mar 3, 2025 22:05 UTC (Mon) by jokeyrhyme (guest, #136576) [Link]

yay, i'd like more stories like this please: of kernel arguments that end in mutual understanding and forward progress :)

What happened to “we don't break the userspace” idea?

Posted Mar 3, 2025 22:06 UTC (Mon) by khim (subscriber, #9252) [Link] (13 responses)

We don't know what exactly banking apps need, but we know what they want: they just want to find out where in the address space lies the code of system libraries… to scan that code. Some would want to scan data segment, too, but most only care about code.

I don't think they even think in terms of VMAs or pagetables… they just need list of addresses they can safely peek into without triggering SIGSEGV… and currently it can be readily found in /proc/PID/maps.

This patch definitely breaks that API.

P.S. Of course on ARM64 there are an additional twist to all that madness: because normally libraries are mapped execute-only on Android not only these apps need to find these regions via /proc/PID/maps… they also need to make them read+execute (for investigation purposes) and then they make them execute-only again (if they are courteous… many leave mappings in read+execute state). I wonder what would happen when all these hardening techniques would meet on one place, though.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 1:01 UTC (Tue) by WolfWings (subscriber, #56790) [Link] (4 responses)

I mean... Linux "breaks" userspace when it's acting like a virus or malware, and the way many banking apps behave is very much crossing that line up to and including coming into direct conflict with the sandboxing Android itself attempts to enforce in newer versions at times so suddenly "Please use your browser if you have an Android XYZ device or newer." pops up or the app just silently throws a browser view up instead for you.

There's quite a few that won't even load if you have Developer mode enabled on Android, doesn't matter if you're using it just to keep the screen on, or to override bluetooth versions to work with an older car radio, they just straight refuse to load.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 3:50 UTC (Tue) by wtarreau (subscriber, #51152) [Link] (3 responses)

Well, it's used as well for debugging, or by applications trying to perform introspection in back traces etc.

We're still missing a portable way to attempt a safe kernel-assisted memory copy from one area to another that would simply return EFAULT when either area is not accessible. I'm using some hacks using syscalls when I need to do that but that's ugly. I suspect one could also use vmsplice() to move the data into a pipe then from it, though I have not tried.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 8:41 UTC (Tue) by fw (subscriber, #26023) [Link]

Ideally, there would be routines in the vDSO we could use, and the kernel would just hide the faults (like it does in kernel space). Using process_vm_readv as a memcpy replacement seems a bit over the top.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 9:38 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> I'm using some hacks using syscalls when I need to do that but that's ugly.

Well… using write for read looks a fit… quaint, but works well, in practice.

> We're still missing a portable way to attempt a safe kernel-assisted memory copy from one area to another that would simply return EFAULT when either area is not accessible.

pipe, then fork with one side reading with write syscall (specify memory argument that you want to look into as buffer, pipe as target, kernel will return EFAULT if memory can not be read) and the other getting information from pipe… this trick is decades old and portable (even if I'm not sure how portable, but it certainly works fine with very old versions of Linux), but don't see why would it stop working… because of fork ?

What happened to “we don't break the userspace” idea?

Posted Mar 5, 2025 4:36 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

I totally agree, that's just what I mean by "ugly". Having to use a pipe + its 2 FDs + the double-copy. We could imagine having either mprotect(PROT_CHECK|PROT_READ) that would only return whether or not PROT_READ is present on all the area, or madvise(MADV_ZEROFILL) which would indicate that if not mapped, the area behaves as /dev/zero, or for threaded environments where this would not be safe, just a single syscall to perform a memcpy().

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 9:28 UTC (Tue) by vbabka (subscriber, #91706) [Link]

Note the kernel does not break existing userspace just by introducing the new API.

But yes, the new API gives you better efficiency (fewer VMAs) with the tradeoff for /proc/pid/(s)maps visibility, and having to deal with faults when you would want to scan the memory ranges and don't know where the guard pages are.

So yes in practice that makes it harder to use the new APIs when some part of the userspace (i.e. libc or the android Zygote mechanism) would switch from PROT_NONE guard areas to the new functionality and thus break other parts of the userspace. But the kernel doesn't change anything to the existing userspace unless it opts-in to the new functionality, so "we don't break the userspace" is not affected here.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 9:56 UTC (Tue) by ljsloz (subscriber, #158382) [Link] (6 responses)

> they just need list of addresses they can safely peek into without triggering SIGSEGV

There's a series of assumptions being made about executable code being present and immutable, none of which are 'API'.

It may be taken to probably be the case that it's fine, but it's in essence assuming internal implementation details.

Note that executable segments are very unlikely to be reclaimed, but they might be, at which point your underlying file system may do strange things on fault (e.g. network fs etc.). It may be unlikely, certainly in an android case, but if one is going to call this an implicit 'API' you really need some solid basis to do so.

What makes this more likely are for instance, obfuscation techniques and JIT which may result in 'strange' executable code relocation.

Speaking from a philosophical standpoint, PROT_NONE is not a contract that guarantees 'hey this is the only means by which a guard region can be implemented', it is only, simply, a VMA that guarantees, at least right now, that accesses will cause a signal to arise.

So I think right now these apps are making assumptions about internal implementation details, imagining implied contracts, which happen to be the case (at least _most_ of the time) right now, but, should implementers _choose_ to use this API, will no longer be in the future.

So speaking _generally_ about /proc/$pid/maps, ranges shown there:

- May SIGBUS.
- May fault causing file systems to possibly do strange things (they have custom hooks), that could in theory result in SIGSEGV or do other broken things....
- May trigger a uffd fault, where a broken userland app may cause an eternal sleep.
- /proc/$pid/maps is racey, and you may see things out of order as you read left to right if there are aggregate (in userland) operations being performed.
- May not exist any more, people may unmap/remap at any time.

There is absolutely emphatically no guarantee you will not receive a signal (or brokenness) by accessing regions seen in /proc/$pid/maps. No such guarantee is documented anywhere, and cannot reasonably relied upon by userland.

Linux isn't windows of yore, the commitment to not breaking userspace has very well been established as 'not breaking _reasonable_ userland if and only if there are users which actually specifically rely upon said reasonable interfaces'.

The discussion on-list was that all of this was made abundantly clear throughout the implementation of this feature, and it was agreed that there was some miscommunication that led to this issue not being raised.

But equally, it was agreed that it would be correct to instead of having the banking apps etc. simply rely upon _assumptions_ via /proc/$pid/maps, they could, instead, make use of an interface that explicitly provided this information. The provision of guard region information in /proc/$pid/pagemap provides the means for this. This should solve things for this very, very specific and unusual use case.

Additionally I went to great lengths to try to find whatever means by which we could find to resolve this - since this is all moot, given this is a shipped feature which again - I emphasise does NOT break the 'do not break userland' concept, nor does it break any existing API.

As @vbabka points out also, what is very clear is - this feature is opt-in. Nobody _has_ to use it, and continuing to use linux as-is will have no impact whatsoever with this feature present. So again, no API break, no never-break-userspace-break.

The benefits of this feature are huge, as the kind of kernel memory that is used upon VMA proliferation is pure memory pressure - it cannot be reclaimed or migrated. At scale this can really, really add up.

Also another point raised in the discussion is at no point could this feature exist AND something appear in /proc/$pid/maps. The feature emphatically requires the separation of page table-induced faulting behaviour vs. VMA metadata state. /proc/$pid/maps does not traverse page tables, so cannot obtain this information. It is expensive to do so. Adjustment of VMA metadata to show it here would cause VMA merge failure and thus render the feature useless.

In the end we all reached a satisfactory agreement upon how to move forward sensibly :)

People are very very keen to jump to 'oh the kernel broke userspace!' very quickly, but often things are more subtle and nuanced than they first appear. In this case it's understandable, but I would respectfully suggest you are mistaken in that assertion.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 11:04 UTC (Tue) by PeeWee (subscriber, #175777) [Link] (5 responses)

> As @vbabka points out also, what is very clear is - this feature is opt-in. Nobody _has_ to use it, and continuing to use linux as-is will have no impact whatsoever with this feature present. So again, no API break, no never-break-userspace-break.

But what happens to such banking apps if they try to scan memory areas of applications that did opt-in to this? Wouldn't they then do <funky stuff> because their - quite questionable - assumptions don't hold anymore? Say, libc opts in and such an unmodified banking app wants to scan its memory, wouldn't it trip over this? So they don't need to opt-in to be affected by this, or am I missing/misunderstanding something here?

On a related note, I believe such banking apps should not exist to begin with IMHO, because they essentially break 2FA and try to work around that by making sure no "hacker tools" are present by using this (and other?) technique. It used to be that one was strongly discouraged - by those very banks mind you - from using the same mobile device for entering transaction data AND receiving things like (transaction bound) TANs to validate said transaction, because a compromised device is able to manipulate both the transaction and the TAN that is only valid for this one transaction. I am a bit fuzzy on the exact details but that's the gist of it; TL;DR: if you can simply enter a transaction and don't have to do an actual additional validation (TAN) there is no 2FA and thus no guarantee that you are doing the transaction you actually want instead of the one the baddies want you to commit.

What I am trying to say is that I think this should NOT be considered "reasonable userspace" and (maybe) SHOULD be broken, on purpose even. The kernel should definitely NOT accommodate such onerous app behaviour IMHO.

What happened to “we don't break the userspace” idea?

Posted Mar 4, 2025 12:01 UTC (Tue) by tux3 (subscriber, #101245) [Link]

>Say, libc opts in and such an unmodified banking app wants to scan its memory, wouldn't it trip over this?

I think there's a reasonable interpretation where this is giving libc the tools, and libc can do something sensible with the feature without necessarily breaking everything.
The kernel is giving userspace new APIs, but not breaking any pre-existing code; taking an old system and installing this new kernel will not by itself break the dodgy memory scanning code.

Concretely, I think for Android libc could reasonably gate this on Android API level ("if your app declares targetSdkVersion >= X, libc will make use guard pages"). Every so often, the Android team increases the minimum SDK version required on their play store. So they will eventually be able to turn on guard pages unconditionally, but without taking the authors of innocent m̶a̶l̶w̶a̶r̶e anti-debug obfuscation features by surprise.

What happened to “we don't break the userspace” idea?

Posted Mar 6, 2025 0:08 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

In my book, these are debugging APIs. They're properly used to answer questions like "why am I segfaulting here?" You're not going to break on some existing app unless it has been compiled against (e.g.) a new libc that calls into the new API, at which point the answer is simple - update your gdb (or other debugger) when you update your libc.

But even if it does break, it's not much of a breakage anyway. You just get a less-useful "ptrace said no" error message than you would otherwise. It doesn't prevent you from doing any of the things that you would otherwise be able to do.

What happened to “we don't break the userspace” idea?

Posted Mar 6, 2025 11:58 UTC (Thu) by PeeWee (subscriber, #175777) [Link] (2 responses)

But besides debugging I was thinking about those banking apps. What they are doing does not fit the definition of debugging IMHO. I'd call it snooping, given the context that they try to see if some "evildoer" is present. I don't know enough about the matter to ascertain if a shared lib could transparently opt-in to this, i.e. no ABI breakage. But if that's the case then those banking apps and suchlike would be affected without doing anything, i.e. without opting in. I am also not quite sure if this is the correct example, since those apps seem to snoop in all memory regions they can read, even the ones totally unrelated to their own code. But I also don't know if that is even possible/allowed, as in: can any program read any memory region, sans kernel memory? I only have a vague memory of once reading about keystores such as Seahorse needing to go the extra mile to not expose their secrets to just anybody reading arbitrary memory regions, which makes me think that, by default, any memory is fair game for reading unless precautions are taken. And that would suggest that unrelated programs can be affected. But it is also more of an academic thought, since I believe that such programs (not debuggers) are employing malware practices.

What happened to “we don't break the userspace” idea?

Posted Mar 9, 2025 0:51 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

The default rule, to my understanding, is roughly as follows:

* root can ptrace anybody.
* You can ptrace your own processes. Or, to be more pedantically correct, processes running with the same UID can ptrace each other. (I have no idea if that is real UID, EUID, or some other UID-like thing entirely.)
* Nobody else can ptrace anything, unless they have a special capability.
* I imagine there might be a sysctl knob or something that applies further restrictions (such as turning off non-root ptracing altogether), but I don't know if such an interface really exists.

My position is that none of this actually matters. What matters is that ptracing is a tool for developers to figure out why their app is broken. It is not a security mechanism. It is not, in fact, intended for random apps to ptrace each other just because they feel like it, or because somebody with the word "compliance" in their job title has decided that it's a good idea.

When you go around poking your nose into somebody else's memory, at runtime, in production, on real hardware that is owned by a real user, any breakage is entirely your own problem. Nobody ever promised that you could do that, and there are numerous hardening measures that can trivially break it in one way or another (for example, the user could put you or the other app behind a container, or even just a separate UID), plus you have to consider more mundane problems like userspace ASLR, static linking and LTO, no debug symbols, and so on. The whole idea is monstrously fragile and it's a miracle if it works at all.

What happened to “we don't break the userspace” idea?

Posted Mar 9, 2025 8:31 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> It is not a security mechanism. It is not, in fact, intended for random apps to ptrace each other just because they feel like it

Bank apps ptrace() _themselves_ to make sure there's nothing unusual injected into their address space.

Naming

Posted Jul 30, 2025 0:56 UTC (Wed) by jepsis (subscriber, #130218) [Link]

These really do sound like memory mines. Glad to see they’ll be mapped properly, as mandated by Conventions to safeguard civilspace.

Guard pages for file-backed memory

pagemap change now in -next :)

Thanks for pointing MADV_GUARD_*

more like this please :)

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

What happened to “we don't break the userspace” idea?

Naming