Kernel development [LWN.net]

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.7-rc2, released on May 29. Most of the patches this time around are aimed at stabilization after the big changes in -rc1, but -rc2 also contains an ALSA update, a whole bunch of new __user annotations (intended to help find misuses of user-space pointers - see below), an XFS update, some IPSec fixes, and some architecture updates. See the long-format changelog for the details.

Linus's BitKeeper repository contains, as of this writing, some stack usage reduction patches, more __user annotations, some architecture updates, and a few other fixes.

The current prepatch from Andrew Morton is 2.6.7-rc2-mm1. Recent additions to -mm include NFS, MD, and DMI updates, the x86 performance counters patch, some read-copy-update scalability work, and the usual pile of fixes.

The current 2.4 prepatch is 2.4.27-pre4, which was released by Marcelo on May 30. There are some XFS and JFS updates, a number of 2.6 networking backports (including TCP Vegas support and receiver-side RTT estimation) some driver updates, and the usual set of fixes.

Comments (4 posted)

Kernel development news

x86 NX support

Marking regions of memory as not containing executable code is not a particularly new technique; some processors have recognized this mode for years. The processor that everybody actually uses, however (the x86 family) does not have a "no-execute" bit.

At least, it didn't until very recently. AMD added a no-execute (NX) permission bit to the page table entries in its 64-bit processors; Intel has recently said it will be supporting this mode as well. So the hardware will be able to avoid executing code from certain regions of memory, making various types of buffer overflow attacks harder. At least, that will be true if the operating system supports and uses the NX mode.

To that end, Ingo Molnar has posted a patch bringing NX support to the x86 architecture; his patch is based on previous work done by Intel and the x86_64 NX support by Andi Kleen. This patch allows applications to mark areas as being non-executable; such areas, typically, will include the stack and heap zones. It also applies the NX bit to the kernel itself; kernel text is marked executable, but kernel data is not. As a result, the next time a buffer overflow turns up in the kernel, it, too, will be harder to exploit.

The NX bit only works when the processor is running in the PAE mode. Most x86 Linux systems currently do not run in that mode; it is normally only turned on when large amounts of memory (more than 4GB) are installed. This mode adds a third level of page tables, and makes the page table entries themselves larger, so users and distributors normally turn it off if it is not needed. Most modern x86 processors support the PAE mode, however; security considerations may lead to it being used more heavily in the future.

Linus's main concern about the patch would appear to be how many old applications it might break. The reply from Arjan van de Ven is that pretty much everything "just works." The no-execute permission is not applied unless the code is specially marked in the image file, and gcc apparently does a good job of not setting that flag when it would break things. If this experience holds true, NX support could go in fairly quickly, and a longstanding x86 security weakness will be no more.

For people interested in testing this patch, Arjan has merged it into the latest Fedora Core test kernels. See the patch announcement for a pointer. There is also a "quickstart" document for those who would like to test out NX in their own kernels.

Comments (5 posted)

The staircase scheduler

As the 2.6.0 release approached, some developers worried that the CPU scheduler would be the downfall of this particular stable series. Complaints of poor interactive performance were common, NUMA systems were not supported well, and so on. Over time, most of these problems have been addressed; massive amounts of interactivity work and the domain scheduler have smoothed over most of the problems. Complaints about the scheduler have been relatively rare in recent times.

One thing that does still bother some people, however, is the complexity of the current 2.6 scheduler. The interactivity work, in particular, added a great deal of very obscure code. The scheduler goes to great lengths to try to identify interactive tasks and to boost their priority accordingly. This process involves numerous strange computations involving a number of magic constants; it is difficult to understand, much less improve.

Con Kolivas, who had his hand in much of the interactivity work, has just posted a new version of his "staircase scheduler" patch. This patch aims to greatly simplify the scheduler while simultaneously improving interactive response; it deletes 498 lines of code, while adding less than 200. Much of what is deleted is the "black magic" interactivity calculations; it is all replaced with a relatively simple, rank-based scheme.

The staircase scheduler implements a single, ranked array of processes for each CPU. Initially, each process goes into the array at the rank determined by its base priority; the scheduler can then locate and run the highest-priority process in the usual way. So far, not much has changed.

In the current scheduler, processes which use up their time slice get moved over to a separate "expired" array; there they languish until the rest of the processes in the mix have used up their time (or blocked) as well. The staircase scheduler does away with the expired array; instead, an expired process will be put back into the staircase, but at the next lower rank. It can, thus, continue to run, but at a lower priority. When it exhausts another time slice, it moves down again. And so on. The following little table shows how long the process spends at each priority level:

	Priority rank
Iteration	Base	-1	-2	-3	-4	-5	-6	-7	-8	-9	...
1	1	1	1	1	1	1	1	1	1	1

When a process falls off the bottom of the staircase, an interesting thing happens: it gets moved back up to one level below its previous maximum, and it gets two time slices at that level. Thereafter, it once again works its way down the steps to the bottom. The next time, it goes up to two steps below the maximum, for three time slices. The above table, with three iterations through the staircase, would look like this:

	Priority rank
Iteration	Base	-1	-2	-3	-4	-5	-6	-7	-8	-9	...
1	1	1	1	1	1	1	1	1	1	1
2		2	1	1	1	1	1	1	1	1
3			3	1	1	1	1	1	1	1

Each descent down the staircase thus involves the same number of time slices, but, each time, more slices are spent at the top priority level for that iteration. This algorithm helps maintain the relative priorities. A process at priority n will, after falling off the staircase, find itself competing with all the processes at priority n-1, but it will get a longer slice of time relative to those other processes, which have a lower base priority.

If a process sleeps for a reasonable interval, it gets pushed back up the staircase. Thus interactive tasks, which normally sleep quite a bit, should stay near the top of the staircase and be responsive, while CPU hogs spend much of their time on the lower steps.

The kernel community may not be up for another big scheduler change at this point in the stable series; many people would like to see 2.6 actually stabilize and 2.7 begin. This patch appears worthy of consideration, however, for its simplification of a complex part of the kernel if nothing else.

Comments (8 posted)

Finding kernel problems automatically

In past years, this page has looked at the work done by the "Stanford checker," which analyzes code in search of various types of programming errors. The checker has found a lot of problems over the years, with the result that a lot of problems have been fixed before they had a chance to bite users of production kernels.

The only problem with the Stanford checker is that it is not free software; it is, in fact, completely unavailable to the world as a whole. Rather than release the code, the checker group went off and formed Coverity to commercialize the checker software (now called "SWAT" and touted, ominously, as being "patent pending"). Developers at Coverity still occasionally post reports of potential bugs found by SWAT, but, for the most part, their attention seems focused on potential revenue opportunities.

It is hard to complain about this outcome. Before heading on this course, the Coverity folks uncovered vast numbers of bugs, and all Linux users benefited from that work. They also demonstrated how valuable static code testing tools can be. The community, however, was left in the position of having to actually write its own checker if it wanted one. Fortunately, this is the sort of thing the community can be good at.

A while back, none other than Linus Torvalds started work on his own tool, which came to be called "sparse." There has recently been a flurry of new activity around sparse, so it seems like a good time to take a look.

sparse is normally obtained by cloning the BitKeeper repository at bk://kernel.bkbits.net/torvalds/sparse. For those who don't use BK, a checked-out version is available (as a bunch of SCCS files) on kernel.org. There is a low-bandwidth sparse mailing list as well.

Essentially, sparse is a parsing and analysis library for the C language. One could put a number of different backends onto it; for example, a code-generation backend would turn it into a simple compiler. For the purposes of the kernel, however, the backend of interest is the analysis code which looks for various types of errors. The analyzer checks for quite a few different types of errors. Many of these (many sorts of type mismatches, for example) are also found by the compiler, but other tests are unique to sparse.

The core test done by sparse is still the check for improper use of user-space pointers. A quick look through the kernel will turn up liberal use of a type attribute called __user; for example, the read() method invoked from system calls is prototyped as:

    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);

When the kernel is being compiled, __user is defined as the empty string, so gcc doesn't see it at all. When sparse is being used, instead, it marks the pointer as (1) being in a separate address space, and (2) not being legal to dereference. sparse will use those flags to catch any mixing of user- and kernel-space pointers, and any attempt to directly dereference user-space pointers.

These checks have turned up a surprising number of errors. The kernel normally sets up the virtual address space in such a way that direct dereferencing of user-space pointers actually works - most of the time. Using user-space addresses in this way will fail, however, if the user page is not actually resident in memory at the time. More importantly, perhaps, this sort of direct dereferencing bypasses the normal access controls; every such error could, thus, become a security hole.

Catching such mistakes automatically seems like a good idea. It does require, however, that every variable holding a user-space pointer be marked with the __user attribute. Since much of the kernel (including every device driver) deals with user-space pointers, this is not a trivial job. This job is proceeding, however; several dozen patches adding __user annotations (and fixing problems found on the way) have been merged for 2.6.7.

Other checks performed include finding constants which are overly long for their target type, mistakes in embedded assembly language code, empty switch statements, assignments in conditionals, and so on. Its output is rather noisy still, but one assumes that will improve over time. If you have sparse installed, running it on the kernel is simply a matter of adding "C=1" to the make command. External modules can also be checked in this way.

sparse is still clearly far behind the Stanford checker in terms of the variety of errors it can find. Unlike the checker, however, sparse is free software. The core parsing infrastructure is in place, so the addition of new checks should be relatively straightforward. All that's needed is the application of a bunch of developer time.

Comments (8 posted)

Diskdump: a new crash dump system

A standard feature of most commercial operating systems is a "crash dump" facility. If something goes wrong in the operating system kernel, the system saves its entire state to a file and reboots; the contents of that file can then be examined at leisure to try to figure out what went wrong. The Linux kernel, however, lacks this capability. There are a few possible reasons for this omission: the kernel never crashes (not quite true, unfortunately), kernel developers rarely want crash dumps for their own work, and there is a certain degree of unhappiness with all of the crash dump patches currently in circulation. The fact of the matter, however, is that a number of Linux vendors would like to have a good crash dump system in place so they can better support their customers.

A recent patch posted by Takao Indoh may provide that capability. The new "diskdump" system has taken a simpler approach to crash dumps that, with some fixes, may just get enough core hacker support to be considered for merging into the (presumably 2.7) mainline.

Diskdump works by taking absolute control of the system when a panic occurs. It shuts down all interrupts to keep the processor from getting distracted; it also freezes all other processors on SMP systems. It then checksums its own code, comparing against a value computed at initialization time; if the checksums fail to match, diskdump assumes that it has been corrupted as a result of whatever went wrong and refuses to run.

The next step involves finding a place to store the crash dump. Diskdump can be set up with multiple dump partitions. For each possibility, it queries the state of the driver, then reads and verifies the entire crash dump space. The diskdump authors are (rightly) fearful of overwriting important data while the system is in an unstable state, so diskdump requires that every block of the crash dump partition be initialized with a special pattern. If any blocks fail the test, that destination will not be used.

When a suitable location has been found, diskdump writes a header with the system state and panic information, followed by a memory image. At that point the system can be rebooted; once things are stable again, the "savecore" utility turns the memory image into a proper core dump and reinitializes the crash dump partition. All is then in readiness for debugging and, if need be, the next crash.

Diskdump needs some significant block driver modifications to be able to do its job. The driver must export a new set of operations:

    struct disk_dump_device_ops {
        int (*sanity_check)(struct disk_dump_device *);
        int (*quiesce)(struct disk_dump_device *);
        int (*shutdown)(struct disk_dump_device *);
        int (*rw_block)(struct disk_dump_partition *, int rw, unsigned long
            block_nr, void *buf);
    };

The sanity_check() call checks to ensure that the device in question is ready to accept a crash dump. If that function finds that, for example, the device is offline or somebody, somewhere is holding a spinlock for the device, the sanity check will fail and the dump will have to go somewhere else. A call to quiesce() follows, in case any preparation is needed. The current implementation (which only works with some SCSI devices) performs a full SCSI bus reset at this point. The actual I/O is done via rw_block, which is expected to transfer one page per call. This I/O should be done without interrupts (which are, remember, disabled when the panic happens), so the typical implementation will work by polling the device. At the end, shutdown() is called to ensure that all blocks have been flushed to the media.

Perhaps the ugliest part of the patch - and the part which some developers have complained about - is the rerouting of timer and tasklet calls. Since all interrupts are disabled, the normal timer and software interrupt mechanisms will not function. Diskdump does not need those capabilities itself, but a number of disk drivers do. As a result, diskdump must, somehow, run tasklets and timers expected by the driver, but without running arbitrary code unrelated to the dump process. To this end, diskdump sets up its own private timer and tasklet lists which come into action once the system is locked down and the dump process begins.

Currently, all this works by modifying the drivers to call diskdump's functions rather than the core kernel variants. So, for example, instead of setting up a timer with add_timer(), a driver implementing dumps would call this little wrapper:

    static inline void diskdump_add_timer(struct timer_list *timer)
    {
        if (crashdump_mode())
            _diskdump_add_timer(timer);
        else
            add_timer(timer);
    }

But that function is only available if crash dumps are configured into the system, so some preprocessor macros are used to redefine add_timer() if need be. This solution is not going to make it into the mainline kernel, however. The preferred approach would appear to be integrating this functionality directly into the core timer and tasklet routines; that change will make the driver changes smaller, but at the cost of intruding into some of the core kernel code.

Comments (3 posted)