Guard pages for file-backed memory
The purpose of a guard page is to prevent buggy (or malicious) code from overrunning a memory region. An inaccessible page placed at the end of a region will cause a segmentation fault should the running process try to read or write to it; well-placed guard pages can trap a number of common buffer overruns and similar problems. Prior to 6.13, though, the only way to put a guard page into a process's address space was to set the protections on one or more pages with mprotect(); that works, but at the cost of creating a new virtual memory area (VMA) to contain the affected page(s). Placing a lot of guard pages will create a lot of VMAs, which can slow down many memory-management functions.
The new guard-page feature addresses this problem by working at the page-table level rather than creating a new VMA. A process can create guard pages with a call to madvise(), requesting the MADV_GUARD_INSTALL operation. The indicated range of memory will be rendered inaccessible; any data that might have been stored there prior to the operation will be deleted. There is an operation (MADV_GUARD_REMOVE) to remove guard pages as well.
Placing guard pages in VMAs containing anonymous pages is the simplest case, which is why anonymous pages were supported first. These pages have no connection to any file on disk, so there are relatively few hazards involved with changing their behavior. File-backed pages bring more complexity, though, and a number of places where guard pages could cause problems. Stoakes goes through the list in detail in the patch posting.
For example, readahead is an important part of maintaining performance when a process is working sequentially through a file. As that process reads some data from a file, the kernel can guess that the process will go on to request the following data in the file in the near future. By initiating a read operation before user space gets around to asking for the data, the kernel can ensure that this data is present (or at least on its way) when the request arrives. The presence of a guard page will stop readahead cold at that point, since the page has been marked inaccessible. As Stoakes notes, this should not be a problem, since it would be unusual for a process to map a file, place a guard page, then try to read through that page.
Similar complications arise in other situations. The kernel will often try to "fault around" a page that has been faulted in, under the assumption that nearby data will be of interest; guard pages will prevent that as well. If a file is truncated, the removed portion may include guard pages, but the guard pages themselves will remain in place. And so on; in each case, Stoakes has ensured that the kernel's operation will be correct and make sense.
There are still a couple of exceptions, though, one of which was known
about before the patches were posted, while the other was a surprise. The
known issue is that guard pages cannot be placed in memory areas that have
been locked into RAM with mlock().
The problem, as Vlastimil Babka pointed
out, is that mlock() guarantees that the affected pages will
not be kicked out of RAM. Installing a guard page, though, frees any data
stored there, which runs counter to the mlock() promise. Stoakes
is considering
a new operation that would make this data destruction explicit in that case
but, as David Hildenbrand said,
"mlock is weird
" and there are a number of other details that would
have to be managed there.
The unexpected issue was raised by Kalesh Singh, who wondered how the presence of guard pages would be represented in /proc/PID/maps and /proc/PID/smaps. These files, which are documented in Documentation/filesystems.proc.html, describe a process's VMAs in detail. Singh said:
In the field, I've found that many applications read the ranges from /proc/self/[s]maps to determine what they can access (usually related to obfuscation techniques). If they don't know of the guard regions it would cause them to crash; I think that we'll need similar entries to PROT_NONE (---p) for these, and generally to maintain consistency between the behavior and what is being said from /proc/*/[s]maps.
It seems that banking apps running on Android are known for this sort of behavior and could run into trouble if guard pages are installed — which is something that the Android runtime might well want to do as a general hardening measure. Since those apps already read the indicated /proc files, Singh thought that would be a logical place to indicate the presence of the guard pages.
This request took Stoakes by surprise, since he thought the topic had been discussed previously and the situation understood. That situation is that, since those files describe VMAs, they are not a suitable place to put information about guard pages which, by design, do not have their own VMAs. Hildenbrand quickly suggested that a bit in /proc/PID/pagemap, which provides page-level data now, would be the best way to export that information to user space. The conversation nonetheless became a little tense, seemingly mostly as a result of misunderstandings rather than true disagreement.
In the end, though, it was agreed that pagemap was the right place for this information. Suren Baghdasaryan eventually joined the conversation, saying that some work would be needed to make this information available to apps in the Android system, but that he would start on that project. Apologies and thanks were shared around, and Stoakes said that he would go ahead and implement the kernel side of the pagemap solution.
With that issue seemingly resolved, there does not appear to be any serious
obstacles to this feature heading toward the mainline in the near future.
The patch series (minus the pagemap changes) is sitting in
linux-next now and could conceivably go upstream as soon as the 6.15 merge
window. That should result in easier and cheaper user-space hardening,
which seems worth the trouble.
| Index entries for this article | |
|---|---|
| Kernel | System calls/madvise() |
