The persistent memory "I know what I'm doing" flag [LWN.net]

By Jonathan Corbet
March 2, 2016

As was described in Neil Brown's article last week, developers working on persistent memory appear to be converging on a solution for the fsync() system call. A working fsync() will enable applications to ensure that the data they have written is safely stored to persistent memory; importantly, applications that have been written correctly for POSIX filesystems in general will work correctly on persistent memory without the need to be aware of the difference. But some developers want to write code that is specific to persistent memory as a way of maximizing performance. A patch catering to the needs of those developers inspired a lengthy conversation on how to best ensure that data written to persistent memory is not lost, and how development in this area should proceed in general.

The problem with the emerging fsync() solution, according to Boaz Harrosh, is that it requires the kernel to maintain a radix tree of all pages that might have dirty lines of data in the CPU caches. If an application has been written with persistent memory in mind, though, it can avoid leaving data in the caches. That data can be explicitly flushed by the application or, as an alternative, non-temporal writes can be used to bypass the CPU caches entirely. If the application is using these techniques, Boaz said, there is no need for the kernel to flush cache lines for the relevant persistent memory, so it can avoid the wasted overhead of maintaining the radix tree.

The kernel currently has no way of knowing that an application is taking care of its own cache-management needs, though. Fixing that is the goal of this patch set posted by Boaz in February. It adds a new flag for the mmap() system call named MAP_PMEM_AWARE. If an application maps a file stored in persistent memory with this flag, and the filesystem supports the DAX direct-access mechanism, the kernel can assume that the application will deal with cache management and, as a result, the kernel need not track pages with potentially dirty cache lines. Boaz claims considerably improved performance when running with this patch.

Some concerns

It is fair to say that this patch was not universally acclaimed. There were a number of objections to providing this kind of functionality, the first of which being that an application that does its own cache management will still have to make calls to fsync() (or msync()) to ensure that its data is truly persistent. That is because this data does not stand alone; it is stored within a filesystem, and the application has no knowledge of whether there is any filesystem metadata that must also be flushed out to be sure that the data can be accessed. The only way to be sure that the metadata is consistent on disk is to call fsync(), just like applications dealing with data on more traditional storage media.

In theory, an application can allocate and write an entire file, then call fsync() to get it all to persistent storage with the goal that, afterward, it can rewrite the data within the file without causing any further metadata changes (other than timestamps, which are not important for retrieving that data). But filesystems can be performing actions like data deduplication, delayed allocation, or, as Christoph Hellwig pointed out, copy-on-write operations. So it is true that the only way to be sure that data is truly, safely persistent is to call fsync(); the MAP_PMEM_AWARE flag would not eliminate that requirement.

Boaz protested that eliminating the need to call fsync() was never the purpose of the patch set. Instead, it aims to make those calls much faster; other overhead, especially associated with page faults in areas backed by persistent memory, would also be significantly reduced. Unfortunately, the worries about MAP_PMEM_AWARE didn't end there.

For example, consider the interaction between applications using this flag and others that are not aware of persistent memory. Such applications (which might be something as simple as mv or a backup utility) may also create metadata changes needing flushing, and they may create dirty cache lines in the persistent-memory area that the "aware" application knows nothing about. Experience with direct I/O has shown that such interactions can be subtle, difficult to notice, and impossible to fix.

Perhaps the biggest worry, though, is that application developers will rush out and proclaim that their code is "aware" without actually understanding everything they need to do to guarantee the integrity of their data. As Dave Chinner put it: "Almost any app developer that says they understand how filesystems provide data integrity is almost always completely wrong." If the kernel provides these developers with an "I know what I'm doing" flag, the reasoning goes, they will soon write code that demonstrates the lack of that knowledge — to their users' detriment.

One might just say that any such applications are buggy; they will either be fixed or replaced with something better. But, as Dave continued, he made it clear that he didn't see things happening that way.

History tells us otherwise. Users always blame the filesystem first, and then app developers will refuse to fix their applications because it would either make their app slow or they think it's a filesystem problem to solve because they tested on some other filesystem and it didn't display that behaviour. The result is we end up working around such problems in the filesystem so that users don't end up losing data due to shit applications.

The same will happen here - filesystems will end up ignoring this special "I know what I'm doing" flag because the vast majority of app developers don't know enough to even realise that they don't know what they are doing.

That last point is key: filesystem developers, in their own defense, will end up ignoring this new flag because the alternative is to face the wrath of users who blame them for their lost data. The ext4 data-loss wars in 2009 have left some lasting scars; filesystem developers do not wish to find themselves in that position again.

Data integrity first

Developers had one more reason to oppose this patch — one that had little to do with the specifics of the patch itself. DAX and its associated persistent-memory functionality are still new, and problems are still being found with them. Dave made the claim that the core problem of safely storing data via DAX has not yet been solved, so it is not appropriate to be looking at optimizations. For now, the focus has to be on making things reliable; after that, there will be time to look at where the performance issues lie and do some optimization work.

Failure to solve the correctness issues first, he said, will just lead to more problems as more features are added. He drew a parallel with Btrfs which, he said, didn't solve the "known hard problems" early and, as a result, is stuck with "entrenched deficiencies" that are nearly impossible to fix. If those known hard problems are not solved first with DAX, it may well end up in the same situation.

He would also like to see optimization work focused on the general case, instead of on providing opt-out mechanisms for a few programs. Fixing performance issues rather than bypassing them will provide benefits for everybody, a better outcome than just enabling a few applications to implement their own optimized solutions. If, instead, those applications opt out, they will not benefit from core-code improvements and, consequently, those improvements will be less likely to happen.

Pushing back on and delaying work that kernel developers would like to see merged is never a pleasant experience. That work was done for a reason; rejecting it often means that at least some of that work was done in vain, and hard feelings can often result. But experience has shown that resisting work that seems premature or not consistent with long-term goals leads to a better, more maintainable kernel in the long run. The DAX infrastructure is going to have to serve as an important kernel-supported approach to persistent memory for a long time; the community cannot afford to get this one wrong. So there may well be a solid case to be made for conservatism in this area for now.

Index entries for this article
Kernel	DAX
Kernel	Memory management/Nonvolatile memory

The persistent memory "I know what I'm doing" flag

Posted Mar 4, 2016 9:55 UTC (Fri) by hkario (guest, #94864) [Link] (2 responses)

Speaking of "Almost any app developer that says they understand how filesystems provide data integrity is almost always completely wrong."

Is there a document which says which file system operations are guaranteed to be atomic and which ones will cause a file system flush/barrier before a given sys call returns?

The persistent memory "I know what I'm doing" flag

Posted Mar 4, 2016 11:45 UTC (Fri) by gioele (subscriber, #61675) [Link]

> Speaking of "Almost any app developer that says they understand how filesystems provide data integrity is almost always completely wrong."
>
> Is there a document which says which file system operations are guaranteed to be atomic and which ones will cause a file system flush/barrier before a given sys call returns?

This may be a good starting point: http://danluu.com/file-consistency/

The persistent memory "I know what I'm doing" flag

Posted Mar 11, 2016 10:50 UTC (Fri) by ksandstr (guest, #60862) [Link]

> Is there a document which says which file system operations are guaranteed to be atomic and which ones will cause a file system flush/barrier before a given sys call returns?

None that're normative across the entirety of POSIX. It's MongoDB all the way down.

The persistent memory "I know what I'm doing" flag

Posted Mar 10, 2016 20:21 UTC (Thu) by luto (subscriber, #39314) [Link]

How about MAP_PMEM_WT to map the persistent memory are write-through? PMEM-aware applications can use it correctly. Dumb applications can set it and do it wrong, and they'll run very, very slowly, but they will still work correctly.

Looks like pmem "v1" is half duff

Posted Mar 11, 2016 12:27 UTC (Fri) by ksandstr (guest, #60862) [Link]

IIUC, with regular block devices either there's a cache-aware DMA controller which enforces full consistency for memory-mapped data during write, or data is copied explicitly into a suitable DMA buffer which yields snapshot consistency, and either of these is good enough for fsync(). The interface of persistent-memory devices weakens this arrangement by having PCOMMIT only apply to data that's left the CPU caches, i.e. it requires the CPU to either do NT writes or synchronously flush every applicable dirty cache line to get fsync() equivalence, which means 64 executions[0] of a CLFLUSH-like instruction per ptab-dirty page of persistent memory per affected file.

It appears that first, this "v1" interface is too weak to support POSIX-like writable mmap() with both minimal overhead and the benefits of writeback, which is the main reason to have those things in the first place.

Second, the proposed optimization -- basically, allowing the kernel to spare nearly all of the overhead above at request, violating fsync()'s spec -- has no affordance for catching ill-behaved tasks. Like a Dave said, failures stemming from applications not living up to that interface's requirements will cause lossage which flows downhill. So unless there's an approach to beat the blame game off, such as a categorical refusal to deal with durability failures in applications where the flag's been used[2][3], this approach seems unworkable from its interface alone.

Third, even if the optimization were workable, it'd still require userspace to either write its stuff in a buffer and then memcpy_nt() into the persistent mapping, or write the mapping as it likes and then call a flush_pmem_range(). The former is less good than copy-on-write from the second write onward, and the latter removes a chunk of the write-back advantage with every call -- and outside the microbenchmark, those calls will happen more often than necessary "just to be safe". Both have in common a requirement on the program to account for the memory it touches, which is at odds with the memory models of high-level languages.

In conclusion, v1 of persistent memory hardware is rather bad for writable mmaps of persistent memory, and the hacks proposed to make it less so are worse still. Perhaps v2 will fix this in, uh, a semiconductor generation's time.

[0] and that's horrible.[1] Say, if each CLFLUSH sleeps for the entire bus write for dirty cache lines, that's basically 64-byte PIO back-to-back.
[1] though maybe less so compared to PCI overhead with SATA devices.
[2] which is also bad for letting filesystems off the hook for having a silent interface deviation, e.g. an exotic alignment requirement for direct-access mappings
[3] who's willing to validate such an application across all combinations of filesystem and persistent-memory device? raise your hand, the one that's going to press the on/off switch a hojillion times...