The staircase scheduler [LWN.net]

As the 2.6.0 release approached, some developers worried that the CPU scheduler would be the downfall of this particular stable series. Complaints of poor interactive performance were common, NUMA systems were not supported well, and so on. Over time, most of these problems have been addressed; massive amounts of interactivity work and the domain scheduler have smoothed over most of the problems. Complaints about the scheduler have been relatively rare in recent times.

One thing that does still bother some people, however, is the complexity of the current 2.6 scheduler. The interactivity work, in particular, added a great deal of very obscure code. The scheduler goes to great lengths to try to identify interactive tasks and to boost their priority accordingly. This process involves numerous strange computations involving a number of magic constants; it is difficult to understand, much less improve.

Con Kolivas, who had his hand in much of the interactivity work, has just posted a new version of his "staircase scheduler" patch. This patch aims to greatly simplify the scheduler while simultaneously improving interactive response; it deletes 498 lines of code, while adding less than 200. Much of what is deleted is the "black magic" interactivity calculations; it is all replaced with a relatively simple, rank-based scheme.

The staircase scheduler implements a single, ranked array of processes for each CPU. Initially, each process goes into the array at the rank determined by its base priority; the scheduler can then locate and run the highest-priority process in the usual way. So far, not much has changed.

In the current scheduler, processes which use up their time slice get moved over to a separate "expired" array; there they languish until the rest of the processes in the mix have used up their time (or blocked) as well. The staircase scheduler does away with the expired array; instead, an expired process will be put back into the staircase, but at the next lower rank. It can, thus, continue to run, but at a lower priority. When it exhausts another time slice, it moves down again. And so on. The following little table shows how long the process spends at each priority level:

	Priority rank
Iteration	Base	-1	-2	-3	-4	-5	-6	-7	-8	-9	...
1	1	1	1	1	1	1	1	1	1	1

When a process falls off the bottom of the staircase, an interesting thing happens: it gets moved back up to one level below its previous maximum, and it gets two time slices at that level. Thereafter, it once again works its way down the steps to the bottom. The next time, it goes up to two steps below the maximum, for three time slices. The above table, with three iterations through the staircase, would look like this:

	Priority rank
Iteration	Base	-1	-2	-3	-4	-5	-6	-7	-8	-9	...
1	1	1	1	1	1	1	1	1	1	1
2		2	1	1	1	1	1	1	1	1
3			3	1	1	1	1	1	1	1

Each descent down the staircase thus involves the same number of time slices, but, each time, more slices are spent at the top priority level for that iteration. This algorithm helps maintain the relative priorities. A process at priority n will, after falling off the staircase, find itself competing with all the processes at priority n-1, but it will get a longer slice of time relative to those other processes, which have a lower base priority.

If a process sleeps for a reasonable interval, it gets pushed back up the staircase. Thus interactive tasks, which normally sleep quite a bit, should stay near the top of the staircase and be responsive, while CPU hogs spend much of their time on the lower steps.

The kernel community may not be up for another big scheduler change at this point in the stable series; many people would like to see 2.6 actually stabilize and 2.7 begin. This patch appears worthy of consideration, however, for its simplification of a complex part of the kernel if nothing else.

Index entries for this article
Kernel	Interactivity
Kernel	Scheduler
Kernel	Staircase scheduler

to post comments

The staircase scheduler

Posted Jun 3, 2004 1:44 UTC (Thu) by gallir (guest, #5735) [Link] (3 responses)

Again... isn't it the same as the old-well-known "multilevel feedback
queues" that appears in every OS book since MULTICS?

The staircase scheduler

Posted Jun 3, 2004 5:08 UTC (Thu) by error27 (subscriber, #8346) [Link] (2 responses)

To be honest, the definition of "multilevel feedback queues" is too vague to be worthwhile. The old scheduler fits the definition, and the new scheduler doesn't change that at all. What the new scheduler changes is how the processes move from one queue to the other. The stair case scheduler is pretty inovative actually.

If you want to learn more about the difficult issues for schedulers read Con's write up here.

It's pretty weird to not have an expired array. It seems like a reasonable scheduler design because it obviously gives priority to interactive processes and it doesn't starve any processes. I can see how this would be simpler to code. It's a nice design. I guess the real proof is in testing. :)

The staircase scheduler

Posted Jun 3, 2004 15:30 UTC (Thu) by gallir (guest, #5735) [Link] (1 responses)

If you read any OS book, it's comprehensively explained. I suppose you
don't expect to give a full description in a "title" :-).

For example, taking an old Peterson/Silberschatz book:

"... Multilevel feedback queues, however, allow a process to move between
queues. The idea is to separate out processes with a different cpu-burts
characteristics. If a process uses too much cpu time, it will be moved to
a lower priority queue. This scheme leaves I/O-bound and interactive
processes in the higher priority queues. Similarly, a process which waits
too long in a lower-priority queue may be moved to a higher priority
queue. This is a form of aging that would pevent starvation. ..."

Or from a Stalling's book:

"... This approach is known as multilevel feedback, meaning that the
operating system allocates the processor to a process, and when the
process blocks or is preempted, feeds it back into one of several queues.
There are a numbre of variations on this scheme. A simple version is to
perform preemption inthe same fashions for round-robin: at periodic
intervals..."

What I mean, a lot of "new methods", for example those applied to CPU and
I/O schedulers are already described, discussed, and compared in the
literature since the '70 or '80s.

The staircase scheduler

Posted Jun 4, 2004 2:40 UTC (Fri) by error27 (subscriber, #8346) [Link]

> Multilevel feedback queues, however, allow a process to move between
queues.

Right. With both the regular scheduler and the stair case scheduler you have 140 queues that represent 140 different priorities. You execute the top priorities first in a round robin format. The trick with the stair case scheduler is the logic behind how the processes move from one queue to the other.

> There are a number of variations on this scheme.

Exactly. In some ways, you could say that this was "just" a new variation. But in other ways, it's a pretty inovative variation. ;)

The staircase scheduler

Posted Jun 3, 2004 13:58 UTC (Thu) by marduk (guest, #3831) [Link]

This sounds too elegant. And even I can understand it. It can't possibly work! ;-)

The staircase scheduler

Posted Jun 3, 2004 15:08 UTC (Thu) by mmarsh (subscriber, #17029) [Link] (1 responses)

It's not obvious in the description, and I haven't looked through the code or any detailed technical docs, but what happens in the following case? Say you have a process that has, through grave misfortune, fallen off of the lowest step and been placed back *on* the lowest step. If there are n steps it'll wait there for n time slices, and if it still hasn't been run it'll fall off again. Does it get placed back on the lowest step as a default? Does it get elevated in priority as compensation for its patience? Can this just not occur (though I'd bet a contrived example could be constructed)? The nightmare scenario is that the scheduler tries to place it below the lowest step, resulting in an array bounds error and a kernel panic.

The staircase scheduler

Posted Jun 4, 2004 0:08 UTC (Fri) by farnz (subscriber, #17727) [Link]

The only time a process falls down is if it consumes all its timeslices. Thus, a process that has been placed on the lowest step has consumed n timeslices at each of the priorities Base through Base-n. It can only fall off again if it runs for n timeslices, thus your scenario cannot occur.

The staircase scheduler

Posted Jun 6, 2004 6:32 UTC (Sun) by sitaram (guest, #5959) [Link]

> Thus interactive tasks, which normally sleep quite a bit, should stay near
> the top of the staircase and be responsive, while CPU hogs spend much of
> their time on the lower steps.

sounds very much like delay pools in squid, which provide a neat way of penalising bulk downloads while keeping normal "interactive" sufing response good!

[I *have* seen the other posts with extracts from OS books; just want to give a different analogy is all...]