Rethinking optimization for size [LWN.net]

By Jonathan Corbet
January 30, 2013

Contemporary compilers are capable of performing a wide variety of optimizations on the code they produce. Quite a bit of effort goes into these optimization passes, with different compiler projects competing to produce the best results for common code patterns. But the nature of current hardware is such that some optimizations can have surprising results; that is doubly true when kernel code is involved, since kernel code is often highly performance-sensitive and provides an upper bound on the performance of the system as a whole. A recent discussion on the best optimization approach for the kernel shows how complicated the situation can be.

Compiler optimizations are often aimed at making frequently-executed code (such as that found in inner loops) run more quickly. As an artificially simple example, consider a loop like the following:

    for (i = 0; i < 4; i++)
	do_something_with(i);

Much of the computational cost of a loop like this may well be found in the loop structure itself — incrementing the counter, comparing against the maximum, and jumping back to the beginning. A compiler that performs loop unrolling might try to reduce that cost by transforming the code into something like:

    do_something_with(0);
    do_something_with(1);
    do_something_with(2);
    do_something_with(3);

The loop overhead is now absent, so one would expect this code to execute more quickly. But there is a cost: the generated code may well be larger than it was before the optimization was applied. In many situations, the performance improvement may outweigh the cost, but that may not always be the case.

GCC provides an optimization option (-Os) with a different objective: it instructs the compiler to produce more compact code, even if there is some resulting performance cost. Such an option has obvious value if one is compiling for a space-constrained environment like a small device. But it turns out that, in some situations, optimizing for space can also produce faster code. In a sense, we are all running space-constrained systems, in that the performance of our CPUs depends heavily on how well those CPUs are using their cache space. Space-optimized code can make better use of scarce instruction cache space, and, as a result, perform better overall. With this in mind, compilation with -Os was made generally available for the 2.6.15 kernel in 2005 and made non-experimental for 2.6.26 in 2008.

Unfortunately, -Os has not always lived up to its promise in the real-world. The problem is not necessarily with the idea of creating compact code; it has more to do with how GCC interprets the -Os option. In the space-optimization mode, the compiler tends to choose some painfully slow instructions, especially on older processors. It also discards the branch prediction information provided by kernel developers in the form of the likely() and unlikely() macros. That, in turn, can cause rarely executed code to share cache space with hot code, effectively wasting a portion of the cache and wiping out the benefits that optimizing for space was meant to provide.

Because -Os did not produce the desired results, Linus disabled it by default in 2011, effectively ending experimentation with this option. Recently, though, Ling Ma posted some results suggesting that the situation might have changed. Recent Intel processors, it seems, have a new cache for decoded instructions, increasing the benefit obtained by having code fit into the cache. The performance of the repeated "move" instructions used by GCC for memory copies in -Os mode has also been improved in newer processors. The posted results claim a 4.8% performance improvement for the netperf benchmark and 2.7% for the volano benchmark when -Os is used on a newer CPU. Thus, it was suggested, maybe it is time to reconsider -Os, at least for some target processors.

Naturally, the situation not quite that simple. Valdis Kletnieks complained that the benchmark results may not be showing an actual increase in real-world performance. Distributors hate shipping multiple kernels, so an optimization mode that only works for some portion of a processor family is unlikely to be enabled in distributor kernels. And there is still the problem of the loss of branch prediction information which, as Linus verified, still happens when -Os is used.

What is really needed, it seems, is a kernel-specific optimization mode that is more focused on instruction-cache performance than code size in its own right. This mode would take some behaviors from -Os while retaining others from the default -O2 mode. Peter Anvin noted that the GCC developers are receptive to the idea of implementing such a mode, but there is nobody who has the time and inclination to work on that project at the moment. It would be nice to have a developer who is familiar with both the kernel and the compiler and who could work to make GCC produce better code for the kernel environment. Until somebody steps up to do that work, though, we will likely have to stick with -O2, even knowing that the resulting code is not as good as it could be.

Index entries for this article
Kernel	GCC

to post comments

Rethinking optimization for size

Posted Jan 31, 2013 8:13 UTC (Thu) by kugel (subscriber, #70540) [Link] (25 responses)

Why should that mode be kernel-specific? It seems completely fine for user space programs as well (and likely()/unlikely() can be used by them too).

Rethinking optimization for size

Posted Jan 31, 2013 10:57 UTC (Thu) by mjthayer (guest, #39183) [Link] (23 responses)

kugel wrote:
> Why should that mode be kernel-specific? It seems completely fine for user space programs as well (and likely()/unlikely() can be used by them too).
Roughly what I was thinking too - and additionally that if -Os is performing better than -O2 then perhaps that means that the way the compiler handles -O2 (that is, what optimisations -O2 enables) needs some attention rather than that users should prefer -Os over -O2 for performance.

Rethinking optimization for size

Posted Jan 31, 2013 11:55 UTC (Thu) by epa (subscriber, #39769) [Link] (22 responses)

I never understood why the compiler doesn't just have an optimization option mode which optimizes for speed on the selected target CPU. Instead we have -O, -O2, and -O3 where each enables a fixed list of optimization strategies (for example -O3 is documented as turning on function inlining, tree vectorization, etc). This means that if it turns out that a particular optimization tends to help on some CPUs but not others, there is no way to just choose whatever is best, even though you may have specified the exact processor to generate code for.

After all, if you really were interested in a fixed set of optimization rules you could specify them by hand with a long list of -fthis and -fthat. Most people who aren't interested in that level of detail would rather have a way to let the compiler decide what's best - even if its decisions are not quite as good as a skilled developer hand-tuning the flags.

To be clear, I am not saying that the compiler can magically determine which optimization flags will help on a particular piece of code. But it can know in general which ones tend to work on which CPUs. An older CPU may benefit from loop unrolling while a newer one, whose performance is more memory-bound, will usually not benefit. It makes more sense for these heuristics to be codified in the compiler, with a table mapping CPU models to optimizations. This table would explicitly be open to change in newer compiler versions.

Rethinking optimization for size

Posted Jan 31, 2013 21:48 UTC (Thu) by mstefani (guest, #31644) [Link] (2 responses)

The compiler has no chance to know on which CPU the code will run in the end.
Unless you're Gentoo it is highly likely that your binaries were compiled on a totally different CPU.

Rethinking optimization for size

Posted Feb 1, 2013 10:51 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

The compiler has no chance to know on which CPU the code will run in the end.

The programmer can tell it, if not the exact CPU, then at least the family or families to optimize for. The -march and -mcpu options already affect code generation in the choice of instructions; it's strange that there is little connection from that to the higher-level optimizations that are applied.

The developer building Fedora, etc, also doesn't know on which CPU it will be run, but you can make a reasonable guess and try to use optimizations that perform well on typical target hardware. That won't be the same set of optimizations that worked on a typical i486, even if you exclude things that require new instruction set support. Yet the set of optimizations chosen by -O and -O2 is essentially fixed and gcc doesn't use its knowledge of the target CPU to influence that set.

Rethinking optimization for size

Posted Feb 7, 2013 14:50 UTC (Thu) by bluss (guest, #47454) [Link]

GCC does have options for this. -mtune to optimize for a cpu without losing generality, -mcpu to optimize and use exclusive instruction sets etc.

Rethinking optimization for size

Posted Jan 31, 2013 23:21 UTC (Thu) by khim (subscriber, #9252) [Link] (17 responses)

I never understood why the compiler doesn't just have an optimization option mode which optimizes for speed on the selected target CPU.

It does. Typically distributions compile code for 386 or 486 but optimize for something newer. The problem is that -Os does not use this information: it optimizes purely for size, not for speed.

Instead we have -O, -O2, and -O3 where each enables a fixed list of optimization strategies (for example -O3 is documented as turning on function inlining, tree vectorization, etc). This means that if it turns out that a particular optimization tends to help on some CPUs but not others, there is no way to just choose whatever is best, even though you may have specified the exact processor to generate code for.

The wast majority of options can lead to speedup or slowdown depending on the program, not on the CPU. And -O, -O2 and -O3 are just typical set of strategies. To find out the right set of optimizations for a given program is tough problem.

Rethinking optimization for size

Posted Feb 1, 2013 11:01 UTC (Fri) by epa (subscriber, #39769) [Link] (16 responses)

As I understand it, the -mtune flag affects the low-level code generation step, but it has no influence on the optimizations that are applied at a higher level. It won't, for example, turn on or off loop unrolling depending on how memory-bound a particular CPU tends to be.

The wast majority of options can lead to speedup or slowdown depending on the program, not on the CPU.

This is certainly true. But most programmers don't want to spend time testing each optimization separately; they tend to just pick -O2 and let the compiler decide. After all the compiler writers know more about optimization, even if I know more about my particular program. If the compiler could be just a bit smarter, that would be a big win.

Rethinking optimization for size

Posted Feb 1, 2013 21:02 UTC (Fri) by khim (subscriber, #9252) [Link] (15 responses)

It won't, for example, turn on or off loop unrolling depending on how memory-bound a particular CPU tends to be.

It's impossible to predict unless you know what your program is doing - and compiler deals with functions. Even puny Atoms have 32KiB of L1 cache where functions (even with loops unrolled) are usually smaller.

If the compiler could be just a bit smarter, that would be a big win.

This way lies madness. Difference between contemporary CPUs is much smaller then you think. The aforementioned cache which presumably should be handled differently in different cases is between 32KiB on most Intel CPUs (from Atoms to XeonsCore and 64KiB for AMD, L2 differs more substantially but difference is not large enough to affect issues at small (function-sized) scale and LTO is not yet in wide use and not all that mature besides.

But most programmers don't want to spend time testing each optimization separately; they tend to just pick -O2 and let the compiler decide.

It's even worse: they often abuse "premature optimization is the root of all evil" mantra to introduce 3x-5x-10x slowdown. What the compiler does after that is more-or-less irrelevant. If people care about efficiency they should start thinking about efficiency first and hot hope that compiler will magically make 10 levels of indirections disappear. No, they don't disappear and compiler very rarely can do anything to thembut looks like developers (except for kernel developers, of course) understand that. Instead most books explain how you can use them to nicely "encapsulate" and "separate" stuff - usually without ever mentioning their price.

Rethinking optimization for size

Posted Feb 8, 2013 17:50 UTC (Fri) by daglwn (guest, #65432) [Link] (14 responses)

> Difference between contemporary CPUs is much smaller then you think.

But not as small as you think. I work on a production compiler and we are always tuning for new targets. Even something as simple as moving from Sandy Bridge to Ivy Bridge can result in a change in strategy.

> What's even worse: they often abuse "premature optimization is the root
> of all evil" mantra to introduce 3x-5x-10x slowdown.

But worse than that is programmers trying to out-guess the compiler and do hand loop unrolling, converting of array accesses to pointer arithmetic and the like. This *kills* the compiler's ability to analyze the program and thus make transformations to improve the code.

It is essential for the developer to take a high-level view of performance. Algorithm and data structure choice is the #1 performance decision to make. After that, the ROI decreases rapidly for the programmer. Yes, the programmer should be aware of the cost of abstractions when appropriate but we should not throw away those abstractions on a whim. They save expensive programmer time. Do hand performance tuning *only* after a proven need via profiling.

Rethinking optimization for size

Posted Feb 8, 2013 21:41 UTC (Fri) by khim (subscriber, #9252) [Link] (13 responses)

Do hand performance tuning *only* after a proven need via profiling.

By that time it's often much too late to do anything. The one thing programmer should keep in mind are few numbers. If you introduce nice level of indirection (to facilitate future expansion or something like this) and this indirection triggers access to the main memory then you are losing about 500-600 ticks right there. And contemporary CPU can move hundred of sequential bytes and do thousand of operations in that time! If your program is built around bazillion tiny objects it's too late to anything at this point: to remove these useless levels of indirection you need to basically rewrite program from scratch.

But not as small as you think. I work on a production compiler and we are always tuning for new targets. Even something as simple as moving from Sandy Bridge to Ivy Bridge can result in a change in strategy.

Sure, but how much can you hope to win? I speak from the experience: just recently we've rewrote piece of code - it went from nice "modern" structure with five or six independent layers and couple of dozen structures to one function (autogenerated one) with 20'000 lines of code and dozen of simple local variables. Speedup was about 10x (in one mode was 8x and in another mode was 12x). Do you really believe you can do something like this with a compiler options or small tweaks after profiler run?

P.S. Actually we can squeeze additional ~30% with PGO and some other compiler tricks but in the end we decided that it complicates our build system too much and accepted "mere" 8x/12x speedup.

Rethinking optimization for size

Posted Feb 9, 2013 0:00 UTC (Sat) by dlang (guest, #313) [Link] (12 responses)

it depends on what you define as 'hand optimization'

unrolling loops is something that is almost always better left to the compiler.

But changing the algorithm from using a pointer-heavy set of linked lists to an implementation using small offsets in a buffer is not something the compiler will ever do, but can result is huge speedups.

"trust the compiler, write whatever you want" doesn't work well in real life, and by the time you get things under a real load to find the bottlenecks, it's frequently too late to change them short of a re-write.

You need to keep end efficiency in mind as you go along. This requires that you keep up to date with what sorts of things are cheap to do and what are expensive to do. If you get it right, you are an unsung hero (you seldom get thanks for things that don't crumple under load), if you get it wrong you get ridiculed.

Rethinking optimization for size

Posted Feb 9, 2013 10:32 UTC (Sat) by khim (subscriber, #9252) [Link] (11 responses)

Are you sure you wanted to answer to my post? Because it looks like you are saying almost exactly the same thing I did: compiler can usually optimize your code well enough, but it can't do anything to your data structures - and this is where major inefficiencies lurk. And my major I mean major: 10x slowdown, 100x slowdown, sometimes even 1000x slowdown!

Rethinking optimization for size

Posted Feb 9, 2013 23:37 UTC (Sat) by daglwn (guest, #65432) [Link]

I believe I said exactly the same.

Rethinking optimization for size

Posted Feb 10, 2013 0:16 UTC (Sun) by dlang (guest, #313) [Link] (9 responses)

> converting of array accesses to pointer arithmetic and the like.

You said this is bad behavior on the programmer's side

however, this may be exactly the algorithm/data structure change that produces the multiple orders of magnitude performance improvements, and the type of thing the developer should keep in mind

Rethinking optimization for size

Posted Feb 10, 2013 3:59 UTC (Sun) by daglwn (guest, #65432) [Link] (8 responses)

> > converting of array accesses to pointer arithmetic and the like.

> You said this is bad behavior on the programmer's side

> however, this may be exactly the algorithm/data structure change that
> produces the multiple orders of magnitude performance improvements, and
> the type of thing the developer should keep in mind

I find that extremely hard to believe. Do you have an example? I can't think of any possible reason that converting array index operations into pointer arithmetic is ever a win.

Now if you're talking about changing out an array data structure for some other kind of data organization, that's something else entirely.

Rethinking optimization for size

Posted Feb 10, 2013 11:34 UTC (Sun) by khim (subscriber, #9252) [Link] (7 responses)

I can't think of any possible reason that converting array index operations into pointer arithmetic is ever a win.

It may be a win because to access data using an index you need two variables (array address and index) but to access data using pointer you need one (just pointer is enough). When callbacks are involved it may replace code which uses just registers to pass information around with code which uses structure in memory - in this case wins can be substantial (speaking from experience). But of course such cases are rare.

But you are saying that conversion to pointer arithmetic is somehow bad? In my experience it's usually net neutral. What kind of optimization compiler can apply if I'm using indexes? How often it can apply them? I'm not saying it never happens, I just don't see and a common occurrence.

Rethinking optimization for size

Posted Feb 11, 2013 17:21 UTC (Mon) by daglwn (guest, #65432) [Link] (6 responses)

> It may be a win because to access data using an index you need two
> variables (array address and index) but to access data using pointer you
> need one (just pointer is enough).

At code-generation time the compiler knows well enough how to convert from array notation to pointer arithmetic so no extra registers are used. Indeed it must do so for most architectures. By doing that too early, the compiler loses valuable information about array stride accesses and the like that are critical for transformations like vectorization.

In addition, pointer arithmetic has all sorts of nasty aliasing properties that hamper the compiler's ability to analyze the code.

"But arrays and pointers are the same in C!" some may cry. Well, that's not entirely true and even if it were there's nothing that prohibits the compiler from representing the two very differently in its internal data structures.

> When callbacks are involved it may replace code which uses just
> registers to pass information around with code which uses structure in
> memory.

This sounds like an ABI issue. Most sane ABIs pass small structs via registers.

> What kind of optimization compiler can apply if I'm using indexes?

Anything that involves analyzing loop inductions and stride patterns.

- Vectorization
- Loop interchange
- Cache blocking
- Strip mining
- Loop collapse

and about a dozen other transformations. These are the "big improvement" optimizations. Petty things like CSE and copy propagation are important but are done more to enable these big-gain transformations than for their code improvement in and of themselves.

In rare cases the compiler can convert pointer arithmetic back to array indices but this often involves extra bit-twiddling or other arithmetic to recover the original indices and more likely than not aliasing rules get in the way and make it impossible to recover the critical information.

Pointer arithmetic is just generally bad for the compiler. Avoid it if possible.

Incidentally, this is also why it's a very *bad* idea to use unsigned as a loop counter. One is not "giving the compiler more information." On the contrary, by using unsigned one is moving the arithmetic away from normal algebraic rules to the realm of modulo arithmetic. The compiler must then make all kinds of pessimistic assumptions about loop termination and access patterns. This also kills many of the loop transformations listed above.

Of course I am speaking in generalities. One can always find a counter-example. However, the vast majority of codes behave as I describe.

Rethinking optimization for size

Posted Feb 11, 2013 18:45 UTC (Mon) by Aliasundercover (guest, #69009) [Link] (3 responses)

In times past I often reviewed the code output from compilers I was working with. It was helpful to calibrate what idioms carried what costs and track how they changed with compilers.

I still get the itch to look now and then but mostly recoil in horror at how hard it is to relate the code GCC gives me to what I wrote. Typically I have some function I want to review but give up after 10 minutes of searching the listing without finding it. The listing is of course fabulously ugly but that is no change from any other compiler I ever used. I just don't find my code. Well, I do, but it is all in a huge block of raw C code without the corresponding output.

No doubt some cool optimization hoisted my code out of its place in the C code I wrote and dropped it some place which makes sense to the compiler. No doubt all this transformation is good for run time performance or size or something. Perhaps listing quality just isn't an interesting bit of work for compiler writers.

I used to know what those old Windows and SunOS compilers would do with my code but now I have no similar feeling for GCC. I can read what people write about indexing vs. pointers and signed vs. unsigned but I used to really know from first hand observation.

I wish the listing got more respect. It can be a powerful optimization tool.

Maybe I just do it wrong. My makefile has the following incantation for generating listings.

%.lst: %.c
$(CC) $(CFLAGS) $(CPPFLAGS) -g -Wa,-a,-ad -c $< > $@

Rethinking optimization for size

Posted Feb 11, 2013 19:25 UTC (Mon) by daglwn (guest, #65432) [Link]

> The listing is of course fabulously ugly but that is no change from any
> other compiler I ever used.

This is actually a quality of implementation issue. There are compilers out there that give fantastic listings, even down to a mostly-readable highish-level decompliation even after very aggressive code transformations. This is not very easy to obtain and really has to be baked-in at the beginning of the compiler design.

But you're right in that a good compiler will make your code unrecognizable. :)

> I wish the listing got more respect. It can be a powerful optimization
> tool.

Indeed. I don't think there's much one can do with gcc to get a decent listing. It just doesn't carry enough information. Neither does LLVM, unfortunately, at least AFAIK.

Rethinking optimization for size

Posted Feb 11, 2013 21:09 UTC (Mon) by PaXTeam (guest, #24616) [Link] (1 responses)

maybe you want gcc -S -fverbose-asm for commented assembly output? also -fdump-tree-all and -fdump-rtl-all if you want to see what happens to your C and also relate the internal GIMPLE variable names to the comments in the verbose asm output.

Rethinking optimization for size

Posted Feb 12, 2013 16:28 UTC (Tue) by Aliasundercover (guest, #69009) [Link]

Well, you got me there. I tried those dump options and got 151 spam files in my source directory.

-fverbose-asm seems useful. It doesn't help with finding where my code went but the comments are nice.

I wish compiler listings got more respect.

Rethinking optimization for size

Posted Feb 11, 2013 19:54 UTC (Mon) by khim (subscriber, #9252) [Link] (1 responses)

Anything that involves analyzing loop inductions and stride patterns.
- Vectorization
- Loop interchange
- Cache blocking
- Strip mining
- Loop collapse
and about a dozen other transformations.

Which are not applicable at all if you have arrays of various complex objects. Sure, I can understand that in some tight loops where "vectorization", "loop interchange" and other cool things can be applied indexes may be beneficial. But for typical high-level code (that is: for 90% of code if not 99% of code in a given project) changes from indexes to pointers and back make absolutely no difference: any function call tend to break all these nice techniques - and there are a lot of them.

Sure, for inner loop it may be interesting, but then they usually are faster when implemented in the appropriate CPU intrinsics (the change in data structures needed to use them efficiently are impossible for the compiler) anyway thus in practice I rarely see any observable difference. Of course nowadays it's often better to use CUDA or OpenCL to push all that to the GPU - where different rules are used altogether and where traditional pointers make no sense at all, but these are specialized applications.

This sounds like an ABI issue. Most sane ABIs pass small structs via registers.

Yup. Up to six registers for x86-64 case. And if you have couple of arrays plus some kind of "options" argument plus some callback_data... you've already used all of them. Add one additional argument - and spill is inevitable.

You may say that code which calls a callback in a tight loop is hopeless in a first place, but that's the problem: quite often I can not afford doing anything else. It's just too expensive to have one function for buttons, another for lists and so on: code is pushed from from L1 (or sometimes even L2 cache) and all these benchmark-friendly optimizations actually slow the code down.

Rethinking optimization for size

Posted Feb 11, 2013 21:41 UTC (Mon) by daglwn (guest, #65432) [Link]

> Which are not applicable at all if you have arrays of various complex
> objects.

How did you come to that conclusion? Sometimes it is not worth it but for smallish structs it can be a win. Complex numbers are a good example.

> any function call tend to break all these nice techniques

If the function call is still there and even then the compiler can sometimes accomplish the task, depending on how aggressive the developers were.

> Of course nowadays it's often better to use CUDA or OpenCL to push all
> that to the GPU

No. There is never a case where CUDA or OpenCL are a good idea. Only poor compilers have led us down that path. Look at what's being done with OpenACC. A good compiler can match CUDA performance pretty easily and can often dramatically outperform it.

> Add one additional argument - and spill is inevitable.

In the grand scheme of thing, one write to cache/read from cache isn't usually critical. Sure, there are cases where this kind of optimization is very important -- if the callback is called in a tight loop for example. But in that case a better solution is often to refactor the code and/or rework the data structures. It should not be necessary to hand-linearize array accesses to get performance.

> You may say that code which calls a callback in a tight loop is hopeless
> in a first place

Actually, I would say that's often good design for the reasons you give. But then this kind of code usually isn't dealing with numerical arrays or arrays of very small structs which might be well vectorized. I suppose that is what you were thinking of with your first statement.

If that's the kind of code you're concerned about then yes, hand-linearizing array accesses is probably not going to hurt much. But it won't really help either. I've no problem with a loop that iterates over such arrays using a single pointer. I am more concerned about people who take multidimensional arrays and translate accesses to complex pointer arithmetic.

Still, if the objects are small enough, vectorization targeting efficient load/store of the data can often be a win, even if the actual data manipulation isn't vectorized.

Rethinking optimization for size

Posted Feb 8, 2013 17:44 UTC (Fri) by daglwn (guest, #65432) [Link]

gcc has -mtune for that. Many compilers do take target architecture into account when doing analyses and transformations.

Rethinking optimization for size

Posted Feb 8, 2013 17:43 UTC (Fri) by daglwn (guest, #65432) [Link]

Spot on.

Speaking as a compiler developer, the stated problems look like possible performance bugs to me. There are always tradeoffs in compiler transformations. You hope to speed one thing up without hurting other stuff too much. Most of it is heuristic based and changing the heuristics can have dramatic effects. I have seen register allocators swing performance +-20% simply by changing the heuristic of how to pick which object to allocate next.

Incidentally, I have run into exactly the same unrolling/icache issue before. It's one of those 2nd- or 3rd-order effects you hope doesn't matter but when it does it can be a fun time tracking it down. The main problem is that it is highly context sensitive. Unrolling a loop by 4 in one place may be exactly right but may result in disaster on another loop. This is one reason compilers try to do a global analysis when possible. Of course, that's another tradeoff, one between generated code quality and compile time.

The last thing we need is another compiler mode. We have too many already. The real solution is to re-tune some of gcc's passes for modeern architectures. That is realy work, though, and takes a tone of time and patience.

Rethinking optimization for size

Posted Jan 31, 2013 21:25 UTC (Thu) by robert_s (subscriber, #42402) [Link] (7 responses)

Offtopic, but this article reminded me of some work a GHC developer did a few years ago using genetic algorithms to search for an optimal set and order of LLVM optimization flags for a particular program.

(Quite a commendation to LLVM that it allows this amount of control over optimization passes I suppose)

http://donsbot.wordpress.com/2010/03/01/evolving-faster-h...

Wonder if anyone's done any further work on the subject.

Rethinking optimization for size

Posted Feb 1, 2013 16:20 UTC (Fri) by NAR (subscriber, #1313) [Link]

This occurred to me also while reading through the article - with a good benchmark finding the right optimization can be automated (to a degree).

Rethinking optimization for size

Posted Feb 4, 2013 18:38 UTC (Mon) by ssam (guest, #46587) [Link] (2 responses)

if you have the time then profile guided optimization (PGO) is worth a look. You compile, then run the code with a profiler so that you know which bits of code are used the most. Then the compiler can be selective about how different bits get optimized. At a simple level use -Os for rarely called functions and -O3 for repeatedly called ones.

Rethinking optimization for size

Posted Feb 4, 2013 23:29 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

It seems like PGO is something that would only have to be run once in a while. The hot code paths are likely to stay hot unless you substantially rewrite your program. So you'd only need to run the profiling occasionally, then use that to optimize your choice of compiler flags, which could be written into the makefile. Everyone downstream could benefit from the improvements without having to do the profiling themselves.

Rethinking optimization for size

Posted Feb 5, 2013 19:10 UTC (Tue) by khim (subscriber, #9252) [Link]

PGO does not work this way. It does not change set of flags - it's totally orthogonal optimization.

Basically a lot of optimizations are tradeoffs (perhaps most): "if we unroll this loop and it's hot then we win because it'll be faster, but if it's cold then we lose because we increase memory pressure... and we can unroll it twice or four time or even hundred of times... what to do, what to do". Without PGO there are some heuristics ("if branch will probably be taken more often then else branch", etc), but with PGO you know if the given piece of code is hot or cold. And this makes all the same optimizations perform better.

That's why you can not reuse results of PGO runs: you need the exact some code compiled twice. Most changes will invalidate the results (tiny changes in code may mean significant changes in the parsed tree - especially in C++). Yes, you know that hot codepath is still somewhere in this function, but where exactly? That's the question.

Rethinking optimization for size

Posted Feb 8, 2013 14:12 UTC (Fri) by jdbrandmeyer (guest, #85840) [Link] (2 responses)

Other way around, actually. Acovea came well in advance of your LLVM example, and the target compiler was GCC.

http://stderr.org/doc/acovea/html/acoveaga.html

Rethinking optimization for size

Posted Feb 8, 2013 17:56 UTC (Fri) by daglwn (guest, #65432) [Link] (1 responses)

Oh, this was done long before that. See for example:

http://dl.acm.org/citation.cfm?id=1134650.1134663&col...

http://dl.acm.org/citation.cfm?id=603339.603341&coll=...

http://dl.acm.org/citation.cfm?id=314403.314414&coll=...

And that's just a small sample. In addition, there are many papers using genetic algorithms to drive the heuristic decisions within various transformation passes.

Rethinking optimization for size

Posted Feb 8, 2013 22:20 UTC (Fri) by Shewmaker (guest, #1126) [Link]

The first project that I remember seeing that seriously pursued automatically optimizing for whatever architecture you compiled it on was the ATLAS linear algebra library. It was moderately successful, but it could be beat by Goto's BLAS.

I remember ACOVEA too, but it looks like it is no longer being maintained.

There is a current effort, Collective Tuning that goes beyond ACOVEA's intentions. They compare it to their Continuous Collective Compilation Framework

I don't know how it compares to the LLVM work.

Rethinking optimization for size

Posted Feb 7, 2013 8:58 UTC (Thu) by massimiliano (subscriber, #3048) [Link] (1 responses)

I might be wrong, but what we really need is not just a new compiler flag that says "optimize for speed knowing that code size heavily affects speed".

What would be nice to have would be profile-driven optimizations.

You compile your kernel, profile it during a "normal workload" (the definition of which is really tricky), and then, maybe, optimize functions differently.
For instance, for very hot ones you could afford loop unrolling, and for colder ones you'd keep size as small as possible.

My 2c,
_ Massi

Rethinking optimization for size

Posted Feb 11, 2013 23:59 UTC (Mon) by nix (subscriber, #2304) [Link]

Um, yes. That's what profile-guided optimizations *are*. :)

Rethinking optimization for size

Posted Feb 7, 2013 22:41 UTC (Thu) by stevenb (guest, #11536) [Link]

Perhaps the likely()/unlikely() hints were ignored because GCC did
not reorder basic blocks in traces at -Os. If that is the problem,
then it should be fixed in GCC 4.8. See gcc.gnu.org/PR54364

Perhaps someone can try GCC 4.8 -Os for the kernel and see if the
hints are honored now? Otherwise it'd be good to have a PR filed
for it in GCC Bugzilla.