Checkpoint/restart tries to head towards the mainline [LWN.net]

By Jake Edge
February 25, 2009

In kernel development, there is always tension between the needs of a new feature versus the needs of the kernel as a whole. Projects generally want to get their code merged as early as possible, for a variety of reasons, while the rest of the kernel community needs to be comfortable that the feature is sensible, desirable, and, perhaps most importantly, maintainable. The current push for inclusion of a feature to checkpoint and restart processes highlights this tension.

In late January, Oren Laadan posted the latest version of his kernel-based checkpoint and restart code with the notation: "Aiming for -mm". There are many possible uses for checkpoints, but it is an extremely complex problem. Laadan's current version is quite minimal, implementing only a fairly small subset of the features envisioned, but he would like to get the kind of review and testing that goes along with pushing it towards the mainline.

After two weeks without much in the way of comments, another proponent, Dave Hansen asked what, if anything, was holding the patchset back from -mm inclusion. Andrew Morton replied that he had raised some concerns which were "inconclusively waffled at" a few months back. Morton's opinion carries a fair amount of weight—not least because he runs the targeted tree. He is looking to the future and trying to ensure that the patches make sense:

I am concerned that this implementation is a bit of a toy, and that we don't know what a sufficiently complete implementation will look like. There is a risk that if we merge the toy we either:

a) end up having to merge unacceptably-expensive-to-maintain code to make it a non-toy or

b) decide not to merge the unacceptably-expensive-to-maintain code, leaving us with a toy or

c) simply cannot work out how to implement the missing functionality.

Morton asked for answers to several questions regarding what features are available in the current implementation, as well as information on what needs to be added. He also asked for indications that Laadan and Hansen had some thoughts on the design for required, but not yet implemented, features. In short, he wants to avoid any of the scenarios he outlined. In response to further questions from Ingo Molnar, Hansen outlined some of the shortcomings of the current implementation:

Right now, it is good for very little. An app has to basically be either specifically designed to work, or be pretty puny in its capabilities. Any fds that are open can only be restored if a simple open();lseek(); would have been sufficient to get it back into a good state. The process must be single-threaded. Shared memory, hugetlbfs, VM_NONLINEAR are not supported.

Hansen also had a more detailed answer to Morton's questions, which showed a lot of work still to be done. The current code only works for x86 architectures, for example, and only for basic file types, essentially just pipes and regular files. He likened the progress of checkpoint/restart to that of kernel scalability; it is a work in progress, not something that will ever be complete:

We intend to make core kernel functionality checkpointable first. We'll move outwards from there as we (and our users) deem things important, but we'll certainly never be done.

One of the main concerns is not that there is a lot still to be done, but that there may be lurking problems that either don't have solutions or can only be solved by very intrusive kernel changes. Matt Mackall looked at Hansen's list of additional features needing to be implemented and summed up the worries this way:

I think the real questions is: where are the dragons hiding? Some of these are known to be hard. And some of them are critical [for] checkpointing typical applications. If you have plans or theories for implementing all of the above, then great. But this list doesn't really give any sense of whether we should be scared of what lurks behind those doors.

There is, however, a free out-of-tree implementation of checkpoint/restart in the OpenVZ project. OpenVZ is a virtualization scheme using its own implementation of containers—different from that in more recent kernels—that supports checkpointing and migrating those containers. But it is a large patch, which Morton looked at several years ago and concluded that it would not be welcome in the mainline. Hansen sees OpenVZ as a useful example, but "with all the input from the OpenVZ folks and at least three other projects, I bet we can come up with something better".

An incremental approach to implementing checkpoints is reasonable, but Morton is concerned that by merging the current patches, the kernel developers will be committed to merging something that looks a lot like—and is as intrusive as—the OpenVZ patches. Molnar is more upbeat: he sees it as an important feature without "many long-term dragons". He does see one potential problem area in the incremental approach, though:

There is _one_ interim runtime cost: the "can we checkpoint or not" decision that the kernel has to make while the feature is not complete.

That, if this feature takes off, is just a short-term worry - as basically everything will be checkpointable in the long run.

That is one of the technical issues still to be resolved with the current patchset: how does a process programmatically determine whether it is able to be checkpointed? If the process has performed some action while running on a kernel that does not support checkpointing the state caused by that action, there is a need to be able to decide that. Molnar suggested overloading the LSM security checks such that performing those actions sets a one-way "not checkpointable" flag as appropriate. That flag could be checked by the process or by some other program that was interested. Overloading the LSM hooks is not completely uncontroversial, but it does hook the kernel in many of the right places—adding an additional call to those same places for checkpointing is not likely to fly.

There was also some question about whether the "not checkpointable" flag needs to be a one-way flag, as it could be cleared once the process has returned to a state that is able to be checkpointed. Molnar argued that the one-way flag is desirable: "uncheckpointable functionality should be as painful as possible, to make sure it's getting fixed". Users who run into problems checkpointing their applications will then apply pressure to get the requisite state added to checkpoints. As a starting point, Hansen has posted a patch that would add a one-way flag based on the kinds of files a process had opened.

Checkpoints are a useful feature that could be used for migrating processes to different machines, protecting long-running processes against kernel crashes or upgrades, system hibernation, and more. It is a difficult problem that may never really be completely finished and it touches a lot of core kernel code. For these reasons, caution is certainly justified, but one gets the sense that some kind checkpoint/restart feature will eventually make its way into the mainline. Whether it is Laadan's version, something derived from OpenVZ, or some other mechanism entirely remains to be seen.

Index entries for this article
Kernel	Checkpointing
Kernel	Containers

Can't checkpointing be done in user space?

Posted Feb 26, 2009 11:19 UTC (Thu) by epa (subscriber, #39769) [Link] (6 responses)

Why does checkpointing need kernel support at all? A process is able to dump its core to a file, along with details of file descriptors it has open. In general, any action a process took to get into a particular state, it did by calling normal kernel APIs - so those same APIs should be usable to restore the saved state later. There might be some missing kernel interface to query the current state ('what file descriptors do I have?') but adding those as needed seems fairly straightforward and not intrusive.

Why exactly is kernel support needed?

Can't checkpointing be done in user space?

Posted Feb 26, 2009 16:16 UTC (Thu) by lwithers (guest, #23379) [Link] (2 responses)

Perhaps of further interest is the description of Crash-only software by Valerie Henson (now Aurora). Software written with this paradigm in mind, combined with something like daemonitor or OpenRC tricks, can be used to build a system with a certain amount of resilience.

Can't checkpointing be done in user space?

Posted Feb 26, 2009 20:13 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

Well, yes, but one application I'd like to see (when they get
suspension/resumption of network connections working) is the ability to
suspend/resume a system which is displaying X apps some of which are
running on another machine, without using some sort of proxying layer like
xpra.

It's likely to be tricky...

Can't checkpointing be done in user space?

Posted Feb 27, 2009 3:32 UTC (Fri) by spotter (guest, #12199) [Link]

see the original zap paper

http://www.ncl.cs.columbia.edu/publications/osdi2002_zap.pdf

section 4.5

Can't checkpointing be done in user space?

Posted Feb 26, 2009 16:46 UTC (Thu) by spotter (guest, #12199) [Link]

see Oren's paper

http://www.ncl.cs.columbia.edu/publications/usenix2007_fo...

Can't checkpointing be done in user space?

Posted Feb 27, 2009 18:26 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

A user space program can checkpoint itself. Many do. This project is about checkpointing an application that wasn't designed for checkpointing, which I suppose saves the enormous engineering effort of building application-specific checkpointing into all the applications.

Can't checkpointing be done in user space?

Posted Mar 6, 2009 8:14 UTC (Fri) by TRS-80 (guest, #1804) [Link]

CyroPID is a user-space application that can checkpoint other processes without any special support. It doesn't work for all cases, although it's good enough for the "D'oh! I forgot to start this application inside screen(1)" use-case.

Checkpoint/restart tries to head towards the mainline

BLCR?

BLCR?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?

Can't checkpointing be done in user space?