The initial kGraft submission [LWN.net]

By Jonathan Corbet
April 30, 2014

Rebooting the system to install a kernel patch is a pain. In many cases, it's worse than that: the system(s) involved may have uptime constraints that rule out reboots at arbitrary times. Users faced with such issues would often welcome a way to get a high-priority patch into the kernel without taking the system down. In response to this need, the Ksplice patch set was first posted in 2008. Ksplice never made it into the mainline, though, and, after its acquisition by Oracle, it has mostly disappeared from the Linux environment. Recently, a few alternative patches offering similar functionality have been announced; now one of them, kGraft, has been posted for consideration.

KGraft is the work of Jiří Kosina and Jiří Slaby, both working at SUSE. The approach they have taken is simpler than Ksplice, and lacks some of the capabilities that Ksplice offers (adding shadow members to structures, for example). On the other hand, the basic kGraft code is only a 600-line patch, and the process of applying a patch is quite a bit more lightweight, with less impact on the system.

KGraft works by replacing entire functions in the kernel. Using the tools supplied with the patch set, a developer can turn a patch into a list of changed functions; the new versions of those functions are then compiled into a separate kernel module. When that module is loaded into the kernel, kGraft takes care of the task of replacing the existing, buggy functions with the new, fixed versions.

That replacement is a bit of a tricky task; performing live surgery on the kernel while it is running is fraught with potential perils. The good news for the kGraft developers is that this problem has already been mostly solved for them. The function tracer (ftrace) subsystem needs to do just this type of live patching, so the ftrace developers have written the code to perform this function, debugged all the weird errors, and taken the hits when things didn't go quite as expected. All the kGraft developers have to do is to use the ftrace machinery — which can already intercept function calls in a running kernel — and use it to replace calls to the buggy functions with calls to the new code.

Of course, there are still some hazards out there. Chief among them is this: what happens if a process is running in kernel space when the patch is applied? This process might call the old version of a function once, then call the new version shortly thereafter; depending on what has changed in the meantime, that could lead to confusion and just the sort of downtime that this whole exercise was meant to prevent. Looking at this problem, the kGraft developers concluded that problems could be avoided if no process sees two versions of the same function during a single trip into and out of kernel space.

So the patch adds a marker to the thread_info structure in each process to track whether that process has left or returned to user space since the patch application process started. When a call to the old function is intercepted, the "slow stub" checks that flag for the running process; if that process has entered or left kernel space, it is deemed to be running in the "old universe," and it gets the old version of the function. Otherwise, control goes to the new version. Once every process in the system has made this transition to the new universe, the slow stub can be removed and the new function can be called unconditionally.

What about processes that don't make this transition in a reasonable period of time? For example, a process stuck waiting for I/O on a network socket could remain in the kernel for a long time, preventing the cleanup of the graft operation. When Vojtěch Pavlik talked about this work at the Linux Foundation Collaboration Summit in March, he mentioned a mechanism that would send a signal to slow processes to force this transition. That mechanism is not present in the posted patch set, though. What there is, instead, is a flag under /proc allowing the system administrator to identify processes that are gumming up the works.

Next question: what about kernel threads, which have no user space to return to? Most kernel threads will, when they reach a suitable completion point, make a call to kthread_should_stop() to see whether they should exit. The kGraft code modifies kthread_should_stop() to reset the "old universe" flag. For threads that do not make such calls, the kGraft developers have inserted calls to kgr_task_safe(), which marks the new-universe transition, in suitable locations.

Finally, what about interrupts? The kGraft code can block interrupts on the local CPU while it is making its changes, but it cannot block them globally. To make sure that no surprises come about while an interrupt handler is running, kGraft adds a per-CPU array to track whether each processor has run in process (non-interrupt) context. That flag is initially set to false, and schedule_on_each_cpu() is called to run a kGraft function, in process context, on each processor. That function, which cannot run until any pending interrupts on a given CPU have been serviced, will set the per-CPU flag. The function-replacement stub, meanwhile, will force interrupt code to run in the old universe on any CPU that has not yet set its per-CPU "new universe" flag.

This patch set had just been posted as of this writing, so there is relatively little in the way of comments to report. Certainly there have been no objections to the overall approach expressed so far. There is obvious value to this functionality, so one would expect to see something merged at some point. The interesting variable here is the competing patches that are waiting in the wings. There's no place in the kernel for two or more runtime patching mechanisms, so either these patches will need to be unified in some fashion, or somebody will have to make a decision.

Index entries for this article
Kernel	Live patching

The initial kGraft submission

Posted May 5, 2014 7:09 UTC (Mon) by suzukikp (subscriber, #28182) [Link]

How does kGraft handle the situation where the 'new patched' function calls a local function(non-exported) from the running kernel ? And if by any chance, there are two different local functions of the same name, how does kGraft determine which one to choose ?

RedHad's kpatch

Posted May 7, 2014 3:03 UTC (Wed) by azilian (guest, #47340) [Link]

Since the kGraft patch was posted the folks at RedHat also posted their solution kpatch http://article.gmane.org/gmane.linux.kernel/1695004

And that started a very interesting discussion about the actual implementation of the live-patching mechanism.

The initial kGraft submission

Posted May 8, 2014 2:08 UTC (Thu) by weue (guest, #96562) [Link] (12 responses)

Is this really useful at all for servers?

If one has such HA constraints that reboots are impossible, then everything should be redundant since machines can stop working randomly.

And if everything is redundant, you can just reboot the systems one by one, letting the others handle the load in the meantime.

Also, what about patches to userspace programs?

The initial kGraft submission

Posted May 8, 2014 2:52 UTC (Thu) by raven667 (subscriber, #5198) [Link] (8 responses)

Another way to handle redundancy is to make the machine itself redundant, mirrored RAM, redundant power, redundant networking and SAN disks, and even redundant CPUs. If you invest in the single machine model then kGraft makes sense, a database server for example is much easier to run as a single instance in a single memory space, there is a very steep increase in complexity in making one usefully redundant.

The initial kGraft submission

Posted May 8, 2014 3:09 UTC (Thu) by dlang (guest, #313) [Link] (7 responses)

have you priced machines like that? they are _extremely_ expensive

and they still fail (I have had the 'fun' of dealing with hardware like this)

the increase in software complexity is nothing compared to the increase in hardware complexity trying to hide redundancy from the OS. And since very few people do this any more, you don't get any commodity pricing. It makes the old Unix boxes look cheap

The initial kGraft submission

Posted May 8, 2014 4:00 UTC (Thu) by raven667 (subscriber, #5198) [Link] (6 responses)

I would guess that is a specific complaint about redundant CPU machines, which are a very expensive niche product, the other technologies I mentioned are commonplace in pretty much any rackmount server. Even CPU redundancy can be had cheaply with virtulization these days too if you want, like VMware Fault Tolerance, at the cost of much performance.

The initial kGraft submission

Posted May 8, 2014 4:44 UTC (Thu) by dlang (guest, #313) [Link] (5 responses)

virtualization doesn't give you protection against cpu failures. if one cpu dies the entire server dies

If you play games with virtualization you can startup a slightly older copy of the system elsewhere in a short time, but that really isn't the right thing if you care about your database content.

making your overall system reliable by buying more reliable servers just doesn't work well. It may make it more reliable at the very low end, but it's not something that scales well.

The initial kGraft submission

Posted May 8, 2014 5:58 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

There are solutions that allow Xen VMs on multiple hosts to work in complete lockstep. It's still a niche but it doesn't require any exotic hardware.

The initial kGraft submission

Posted May 8, 2014 6:17 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

I've heard that statement before, and i'll make the same response.

If they truely guarantee duplication of every memory write, then your performance will slow to 308 levels or lower on the fastest machine you can buy today.

for that claim to be true, every memory write and every register write would have to get sent over the network to the other machine before anything else could be done (including anything that gets used for random number generation).

since fast commodity networks are 1Gb/sec (or 125MB/sec) while current memory can do > 17GB/sec (100x the performance) and your system actually does most of it's activity at cache and register speeds which are a couple orders of magnitude faster still

not counting the latency that would be added to get the signal a few feet to the other machine rather than a few inches on the motherboard

or the bandwidth that you would loose sending the address information for every memory address changed

if you were only about 1,000,000 slower than operating nativity you would be doing a _really_ good job

even people doing tricks with really specialized hardware see a couple orders of magnatude difference between accessing local memory and remote memory (look into remote DMA)

I'll buy that they can do periodic checkpointing, and that if you can do a graceful shutdown they can stop the process long enough to copy everything over with no impact.

but claiming that they get anything close to native speeds with full protection against the machine failing, and they have either beat the speed of light or are lieing.

The initial kGraft submission

Posted May 8, 2014 13:36 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> for that claim to be true, every memory write and every register write would have to get sent over the network to the other machine before anything else could be done (including anything that gets used for random number generation).
Incorrect. You don't need to send ALL updates over the network. Only memory deltas between externally observable events (like network packet transmission) must be replicated. Between these events both machines can work independently.

And that's exactly how Xen Remus works.

The initial kGraft submission

Posted May 8, 2014 14:27 UTC (Thu) by raven667 (subscriber, #5198) [Link] (1 responses)

> virtualization doesn't give you protection against cpu failures. if one cpu dies the entire server dies
> If you play games with virtualization you can startup a slightly older copy of the system elsewhere in a short time, but that really isn't the right thing if you care about your database content.

Actually the VMware Fault Tolerance feature I was talking about gives you exactly protection against full machine failures, it runs your VM simultaneously on two different host nodes in lock step by simulating every IO to the VM in a deterministic manner. The downsides are that you are limited to a single vCPU and can't have access to any devices or hardware features that could introduce any differences in the execution states of the two VMs, and you need a fast connection between nodes to transmit all of the duplicated IO, interrupts, etc. so that there is minimal lag.

> making your overall system reliable by buying more reliable servers just doesn't work well. It may make it more reliable at the very low end, but it's not something that scales well.

Yes, I agree with different words, that if you want high scalability, making individual servers more reliable is a losing proposition and a waste of resources, but for small systems making the individual hosts more reliable is less complex and often more than good enough.

The initial kGraft submission

Posted May 8, 2014 19:41 UTC (Thu) by dlang (guest, #313) [Link]

small systems seldom have such rigorous uptime requirements that they can't afford a restart to patch them

the other problem with continually modifying a running system rather than restarting it is that after a while you don't really know if it will restart.

And if the system gets corrupted, can you build a replacement? (and does your backup work)

if you don't excercise your backups, startups, etc you don't really know if they work.

The initial kGraft submission

Posted May 8, 2014 15:36 UTC (Thu) by jpoimboe (subscriber, #23893) [Link]

I think the most common need for live patching is smaller, non-distributed systems where _some_ downtime is acceptable, but arbitrarily rebooting *right now* is going to cause a lot of headaches. With live patching you can apply security fixes immediately without having to wait for the next scheduled reboot window.

If my desktop package manager tells me I need a kernel update, I usually don't close all my windows and install it immediately. Instead I wait for a more convenient time, possibly sacrificing security for convenience.

I think security-minded sysadmins always have to deal with that conflict between security and convenience, and convenience often wins.

The initial kGraft submission

Posted May 9, 2014 18:30 UTC (Fri) by intgr (subscriber, #39733) [Link] (1 responses)

> Is this really useful at all for servers?
> If one has such HA constraints that reboots are impossible, then everything should be redundant since machines can stop working randomly.

You're saying that all systems must be perfectly redundant anyway? This is just disconnected from reality.

Designing a redundant system that can do switchover with a brief downtime is easy. Most redundant systems fall into this category. Designing a system that's completely bulletproof so you can take any node down at any time without anyone noticing, is really hard.

With a system of the former kind, it's still necessary to schedule reboots during low usage periods to reduce the impact, which usually happens to be at night. Understandably admins are reluctant to do such work weekly if there's only a small chance the fixed issues could affect them. But they would still benefit from the security and bug fixes.

And there is also a big "middle ground" of systems that cannot justify the hardware and engineering resources for true redundancy, but also cannot reboot at the blink of an eye. Many of them accept that they may have 5 hours of downtime every 5 years due to hardware failures. Doing frequent reboots is often too much, in admin time and in downtime.

Live patching eliminates all of these tradeoffs. You can apply patches during normal working hours. No need for advance planning, no annoyance to users or admins.

The initial kGraft submission

Posted May 9, 2014 19:55 UTC (Fri) by dlang (guest, #313) [Link]

> Live patching eliminates all of these tradeoffs. You can apply patches during normal working hours. No need for advance planning, no annoyance to users or admins.

until the first time that the patch goes bad, at which point management will rediscover the idea of trying to minimize the impact to users.

If you have redundancy to be able to survive a box breaking and still serve your users, then you should be able to 'fail' a box, upgrade it, test it, then restore it, and then repeat the process for everything else in your cluster.

Yes, this does introduce a slight risk that a box will fail for real while you are upgrading, but almost every system has multiple cycles in it's load, it's not just "day heavy, night light" it's also "more heavily used from 7-10am than from 2-4 pm" or "more heavily used on monday and friday than thursday", you schedule your upgrades to avoid your peak times and even with one system being patched and another down for real you can probably handle the load at that time. If you have virtualized your systems, you may even be able to bring up a new system, patch it, add it to the cluster before taking the old system down so you have no reduction in your capacity.

I agree that doing upgrades during the day is valuable (why do you want your people doing the most nuanced sensitive work and making judgement calls on the results after they have been up for 20 hours?), but live patching is far from the only way to deal with that.