Rethinking optimization for size
Compiler optimizations are often aimed at making frequently-executed code (such as that found in inner loops) run more quickly. As an artificially simple example, consider a loop like the following:
for (i = 0; i < 4; i++)
do_something_with(i);
Much of the computational cost of a loop like this may well be found in the loop structure itself — incrementing the counter, comparing against the maximum, and jumping back to the beginning. A compiler that performs loop unrolling might try to reduce that cost by transforming the code into something like:
do_something_with(0);
do_something_with(1);
do_something_with(2);
do_something_with(3);
The loop overhead is now absent, so one would expect this code to execute more quickly. But there is a cost: the generated code may well be larger than it was before the optimization was applied. In many situations, the performance improvement may outweigh the cost, but that may not always be the case.
GCC provides an optimization option (-Os) with a different objective: it instructs the compiler to produce more compact code, even if there is some resulting performance cost. Such an option has obvious value if one is compiling for a space-constrained environment like a small device. But it turns out that, in some situations, optimizing for space can also produce faster code. In a sense, we are all running space-constrained systems, in that the performance of our CPUs depends heavily on how well those CPUs are using their cache space. Space-optimized code can make better use of scarce instruction cache space, and, as a result, perform better overall. With this in mind, compilation with -Os was made generally available for the 2.6.15 kernel in 2005 and made non-experimental for 2.6.26 in 2008.
Unfortunately, -Os has not always lived up to its promise in the real-world. The problem is not necessarily with the idea of creating compact code; it has more to do with how GCC interprets the -Os option. In the space-optimization mode, the compiler tends to choose some painfully slow instructions, especially on older processors. It also discards the branch prediction information provided by kernel developers in the form of the likely() and unlikely() macros. That, in turn, can cause rarely executed code to share cache space with hot code, effectively wasting a portion of the cache and wiping out the benefits that optimizing for space was meant to provide.
Because -Os did not produce the desired results, Linus disabled it by default in 2011, effectively ending experimentation with this option. Recently, though, Ling Ma posted some results suggesting that the situation might have changed. Recent Intel processors, it seems, have a new cache for decoded instructions, increasing the benefit obtained by having code fit into the cache. The performance of the repeated "move" instructions used by GCC for memory copies in -Os mode has also been improved in newer processors. The posted results claim a 4.8% performance improvement for the netperf benchmark and 2.7% for the volano benchmark when -Os is used on a newer CPU. Thus, it was suggested, maybe it is time to reconsider -Os, at least for some target processors.
Naturally, the situation not quite that simple. Valdis Kletnieks complained that the benchmark results may not be showing an actual increase in real-world performance. Distributors hate shipping multiple kernels, so an optimization mode that only works for some portion of a processor family is unlikely to be enabled in distributor kernels. And there is still the problem of the loss of branch prediction information which, as Linus verified, still happens when -Os is used.
What is really needed, it seems, is a kernel-specific optimization mode
that is more focused on instruction-cache performance than code size in its
own right. This mode would take some behaviors from -Os while
retaining others from the default -O2 mode. Peter Anvin noted that the GCC developers are receptive to
the idea of implementing such a mode, but there is nobody who has the time
and inclination to work on that project at the moment. It would be nice to
have a developer who is familiar with both the kernel and the compiler and
who could work to make GCC produce better code for the kernel environment.
Until somebody steps up to do that work, though, we will likely have to
stick with -O2, even knowing that the resulting code is not as
good as it could be.
| Index entries for this article | |
|---|---|
| Kernel | GCC |
