A suspend blockers post-mortem [LWN.net]

By Jonathan Corbet
June 2, 2010

The failure of the lengthy attempt to get Android's suspend blockers patch set into the kernel offers a number of lessons at various levels. The technical side of this episode has been covered in a Kernel-page article; here we'll look, instead, at the process issues. Your editor argues that this sequence of events shows both the best and the worst of how Linux kernel development is done. With luck, we'll learn from both and try to show more of the best in the future.

Suspend blockers first surfaced as wakelocks in February, 2009. They were immediately and roundly criticized by the development community; in response, Android developer Arve Hjønnevåg made a long series of changes before eventually bowing to product schedules and letting the patches drop for some months. After the Linux Foundation's Collaboration Summit this year, Arve came back with a new version of the patch set after being encouraged to do so by a number of developers. Several rounds of revisions later, each seemingly driven by a new set of developers who came in with new complaints, these patches failed to get into the mainline and, at this point, probably never will.

In a number of ways, the situation looks pretty grim - an expensive failure of the kernel development process. Ted Ts'o described it this way:

Keep in mind how this looks from an outsider's perspective; an embedded manufacturer spends a fairly large amount of time answering one set of review comments, and a few weeks later, more people pop up, and make a much deeper set of complaints, and request that the current implementation be completely thrown out and that someone new be designed from scratch --- and the new design isn't even going to meet all of the requirements that the embedded manufacturer thought were necessary. Is it any wonder a number of embedded developers have decided it's Just Too Hard to Work With the LKML?

Ted's comments point to what is arguably the most discouraging part of the suspend blocker story: the Android developers were given conflicting advice over the course of more than one year. They were told several times: fix X to get this code merged. But once they had fixed X, another group of developers came along and insisted that they fix Y instead. There never seemed to be a point where the job was done - the finish line kept moving every time they seemed to get close to it. The developers who had the most say in the matter did not, for the most part, weigh in until the last week or so, when they decisively killed any chance of this code being merged.

Meanwhile, in public, the Android developers were being criticized for not getting their code upstream and having their code removed from the staging tree. It can only have been demoralizing - and expensive too:

At this point we've spent more engineering time on revising this one patchset (10 revisions to address various rounds of feedback) and discussion of it than we have on rebasing our working kernel trees to roughly every other linux release from 2.6.16 to 2.6.32 (which became much easier after switching to git).

No doubt plenty of others would have long since given up and walked away.

There are plenty of criticisms which can be directed against Android, starting with the way they developed a short-term solution behind closed doors and shipped it in thousands of handsets before even trying to upstream the code. That is not the way the "upstream first" policy says things should be done; that policy is there to prevent just this sort of episode. Once the code has been shipped and applications depend on it, making any sort of change becomes much harder.

On the other hand, it clearly would not have been reasonable to expect the Android project to delay the shipping of handsets for well over a year while the kernel community argued about suspend blockers.

In any case, this should be noted: once the Android developers decided to engage with the kernel community, they did so in a persistent, professional, and solution-oriented manner. They deserve some real credit for trying to do the right thing, even when "the right thing" looks like a different solution than the one they implemented.

The development community can also certainly be criticized for allowing this situation to go on for so long before coming together and working out a mutually acceptable solution. It is hard to say, though, how we could have done better. While kernel developers often see defending the quality of the kernel as a whole as part of their jobs, it's hard to tell them that helping others find the right solutions to problems is also a part of their jobs. Kernel developers tend to be busy people. So, while it is unfortunate that so many of them did not jump in until motivated by the imminent merging of the suspend blocker code, it's also an entirely understandable expression of basic human nature.

Anybody who wants to criticize the process needs to look at one other thing: in the end it appears to have come out with a better solution. Suspend blockers work well for current Android phones, but they are a special-case solution which will not work well for other use cases, and might not even work well on future Android-based hardware. The proposed alternative, based on a quality-of-service mechanism, seems likely to be more flexible, more maintainable, and better applicable to other situations (including realtime and virtualization). Had suspend blockers been accepted, it would have been that much harder to implement the better solution later on.

And that points to how one of the best aspects of the kernel development process was on display here as well. We don't accept solutions which look like they may not stand the test of time, and we don't accept code just because it is already in wide use. That has a lot to do with how we've managed to keep the code base vital and maintainable through nearly twenty years of active development. Without that kind of discipline, the kernel would have long since collapsed under its own weight. So, while we can certainly try to find ways to make the contribution process less painful in situations like this, we cannot compromise on code quality and maintainability. After all, we fully expect to still be running (and developing) Linux-based systems after another twenty years.

Index entries for this article
Kernel	Development model

to post comments

A suspend blockers post-mortem

Posted Jun 3, 2010 2:04 UTC (Thu) by fuhchee (guest, #40059) [Link] (4 responses)

"in the end it appears to have come out with a better solution"

Would it be fair to say that it is premature to beat the drums of success
on this issue, before this better solution is implemented *and* merged?

A suspend blockers post-mortem

Posted Jun 3, 2010 3:50 UTC (Thu) by brendan_wright (guest, #7376) [Link]

> Would it be fair to say that it is premature to beat the drums of success on this issue, before this better solution is implemented *and* merged?

Exactly - Android has a solution that is working well right now in the real world. I hope this optimism about the new alternative proves to be justified!

A suspend blockers post-mortem

Posted Jun 3, 2010 18:57 UTC (Thu) by malor (guest, #2973) [Link] (2 responses)

I think corbet was maybe looking for something nice to say for both sides, but it was a bit of a stretch. In exchange for treating Google like shit and putting their developer(s) through hell, some early design work on an alternate approach has been done. Maybe, someday, it might be better, if someone actually wants to put the work in, but without a commercial drive to do so, I'm not seeing much motivation for it to happen.

If the kernel does actually get a better, implemented approach, then the kind words will have been right, but if it goes nowhere, then nothing particularly good would seem to have come from this particular mess.

I don't think pushing this out onto the embedded devs is right. This is purely a dev team organizational problem.

If people in the dev community have the power to demand a rewrite, they also need the power to authorize a merge. Either merge authority needs to move further down the dev tree, or external submitters need a method of avoiding the people who can only say no.

I hate to say it, but the kernel team is turning bureaucratic, an organization with layers of people who can only refuse new ideas, not approve them, but who don't reflect the actual opinions of the people with merge authority. This is classic bureaucracy, and it's killed an awful lot of great organizations over the years.

A suspend blockers post-mortem

Posted Jun 3, 2010 19:06 UTC (Thu) by corbet (editor, #1) [Link]

No, I wrote the article to say the things I thought needed to be said.

With regard to the solution: yes it's early to be celebrating. But I do know that there is a strong desire in the community to solve this problem; that's why a lot of non-Android people have put a lot of time into it. I also see that the shape of the proposed solution is such that it may solve a number of problems for other people as well. And it doesn't look hugely difficult to try out. So I think something will happen.

But, then, I've always been an optimistic person.

As for "merge authority," only one person really has that. But there has always been a strong consensus culture in the kernel community; it has traditionally been easy for a developer with any amount of standing in the community to hold things up. Nothing new there.

A suspend blockers post-mortem

Posted Jun 3, 2010 20:38 UTC (Thu) by farnz (subscriber, #17727) [Link]

It's interesting that you describe it as "treating Google like shit and putting their developer(s) through hell"; I see it very, very differently (and I've been following the threads on LKML as well as reading the articles here).

I see Google's developers coming up with a solution to a very specific problem, that's not going to help people outside of their devices, and that involves intrusive changes all over the kernel. By the time they bring it to the kernel folk, they're heavily invested in it - changing it is going to cost them a lot of effort.

Needless to say, they get a lot of pushback, as the solution they're proposing doesn't work for anything bar the areas Android is targetted at, yet requires all kernel developers to make allowances for them. At first, most of the pushback is met by the argument that they've shipped huge numbers of devices, and can't possibly change a design that they've made work.

Eventually, we get a statement of the problem that Google are trying to solve; as seems common with controversial kernel features, this triggers a whole pile of ideas, some of which get shot down as unworkable, others get refined into better solutions. We now seem to be at the stage where we're down to a single core idea for solving the problem, which is being refined into a decent userspace interface that can stay put forever, and a kernel interface that will work for now, but that can always be replaced.

In short, I see Google coming up with a short-sighted design, tied closely to details of their platform, and then (not maliciously, mind - it's hard to accept that you've made a mistake) trying to use their size to bully the kernel developers into accepting a bad solution to the problem, that's both intrusive and not helpful to non-Android users.

Had Google brought wakelocks/suspend blockers to the mainstream early in Android development (spinning it as something they do to save server power, for example, as Android couldn't come into the open at that point), they'd have had much less pain - they'd almost certainly have ended up implementing something closely related to the QoS constraints that seem to be winning the day now. Similarly, had they been able to get what they wanted in a tightly confined "android.ko", the kernel guys would probably have accepted it without this huge argument that's ensuing now. It's the combination of "this only solves our problem - the rest of you can go swivel, because we're shipping this", and "all of you need to allow for our solution to this problem, because it affects code all over the shop" that's caused the strife.

Upstream first policy

Posted Jun 3, 2010 4:24 UTC (Thu) by epa (subscriber, #39769) [Link] (27 responses)

Is 'upstream first' really the best policy? Do not ship any code until it has run the gauntlet of placating the LKML? Of course the kernel developers would naturally suggest such a policy, since we all believe our own taste and judgement is superior to that of third parties. However, if left unchecked, this bias just leads to not-invented-here syndrome.

Just as an idea, might it not be a better discipline to have a 'downstream first' policy? No new feature should be added to the kernel until it has been widely tested in real-world use, preferably in a shipping product. And given two rival solutions, the one that solves an existing real-world problem and is already doing so for many users should be preferred to the one that is more general or cleaner, but does not exist yet or does not solve a problem immediately at hand.

Upstream first policy

Posted Jun 3, 2010 5:47 UTC (Thu) by alonz (subscriber, #815) [Link] (1 responses)

Yet another point—many kernel developers will not accept "upstream" code without a clear, demonstrable use-case (usually on actual hardware).

So embedded developers are stuck in a chicken-and-egg situation: they cannot ship working systems without a working kernel, and the (upstream) kernel will not accept required changes without seeing these systems.

Upstream first policy

Posted Jun 3, 2010 6:18 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I would be very surprised if actual hardware were required.

A driver that is just a driver will normally be accepted on its own merits with the assumption that there is hardware that it works on.

There could be a problem if you need to make changes to core-code to be able to support some aspect of the driver. You will probably be asked to show the driver that needs the functionality, but you might not want to finish of the driver depending on that functionality until you know it will be accepted.

In that case you need to open a dialogue, follow the "release-early, release often" principle (though maybe not too often) and risk the need to revise your driver if the core changes don't happen the way you hope.

Maybe the trickiest bit is know who to open the dialogue with in the first place...

Upstream first policy

Posted Jun 3, 2010 6:10 UTC (Thu) by neilbrown (subscriber, #359) [Link] (24 responses)

Nope, that way lies madness.

Maintainability is much more important that functionality. We (upstream) don't want new features if we cannot fix them when they break, or cannot improve surrounding code as the feature might break.

Developers focused on "make it work so we can ship" are going to be less focused on maintainability (or at least, that is the way it appears).

We don't want the rival solution that works now, we want the rival solution that will still work in 5 years.

Distributors are of course free to use a 'downstream first' policy- the GPL guarantees that freedom. But experience shows that 'upstream first' costs less in the long term.

Care to share your stats?

Posted Jun 3, 2010 10:02 UTC (Thu) by khim (subscriber, #9252) [Link] (6 responses)

From want I'm seeing the companies which employ 'upstream first' tactic routinely fail in marketplace. And this understandable: they can not ship stuff when there is market demand for it - they are stuck with pleasing the upstream.

Sure, if you'll try to support you changes indefinitely it'll become huge drain over time and you'll lose too - so you need to upstream your changes at some time. Sometimes different solution is accepted, not what you proposed initially - but that's not a problem, the goal is to solve the problem end-users are having, not to have your code in kernel. This is how RedHat worked for a long time (till they got enough clout in kernel community to muscle through their changes without much effort), this is how successful embedded developers work (including Google), etc. Novell tried to play 'upstream first' game and the end result does not look good for the company (even if it's may be good for the kernel).

If you have stats which show that 'upstream first' is indeed the best policy for the developers, please share them - I've certainly heard this claim often enough, but rarely, if ever, with numbers.

The only exception are "leaf" drivers which don't change any infrastructure at all and are usually accepted without even looking - here upstreaming is so cheap that it really makes sense to do this.

Care to share your stats?

Posted Jun 3, 2010 10:25 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Nope, no statistics. Just "a stitch in time saves nine" style anecdotal observations.

And it is only a long-term benefit. I can easily imagine a situation where the short term cost of going upstream-first would cause the business to fail so there is no possibility of a long term reward. But as soon as the horizon stretches out a bit, the more you get upstream the less you have to carry yourself.

Care to share your stats?

Posted Jun 3, 2010 13:00 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)

So companies like Intel, which are very strongly in the upstream first camp these days (most of the time) are failing in the marketplace?

"Upstream first" is not a hard and fast rule. It's also not exactly "get the code into the mainline kernel first"; it's more along the lines of "be sure that the code can get into the mainline kernel first." There is a difference there.

I'm not sure I see "upstream first" holding back Novell. Citation needed. Instead, I see the times they didn't do things that way (AppArmor), that that didn't work out all that well for them.

Care to share your stats?

Posted Jun 3, 2010 17:43 UTC (Thu) by jwarnica (subscriber, #27492) [Link]

Well, it seems that the "right thing" in the view of some company depends on what kind of market the company is in.

Component hardware companies typically don't sell software. Getting their new code into the kernel means *poof* they now have a bazillion systems that can use their hardware. It isn't to Intels advantage to keep their own git repository somewhere. If me, as an end user of some intel chipset cant get it to work on my software far, far removed from Intels repo, maybe next time, I won't get a mobo with Intel Inside.

Appliance/embedded hardware companies, or OS companies, are a different story. Doing the globally "right thing": "upstream first" means they are slower to deliver their actual product, and (it should be noted) their actual product has less distinction then do its competitors. Sure, the patch may very well be GPL'd, but their competitors patch which they just threw over the wall is harder for someone to use then something upstream. In a sense, it may as well be a secret.

More simply: If the end user is likely to interact directly with a single vendor, then that vendor can put their patches wherever they want, and not trying the gauntlet of the LKML is cheaper. If the end user is far removed from the provider, the provider should try to get that patch wide and far, which means in the upstream kernel.

So companies that do the globally "right thing" are rewarded by being slower, and less distinct, then those not.

Moving on:

I think part of the lesson here is that "be sure that the code can get into the mainline kernel first" is impossible to test. Until you actually submit code to the LKML, you have no idea the kinds of helpful, productive, petty, or absurd comments you will get in response. No one can predict with any level of accuracy if something will be accepted until it actually shows up in a release.

Care to share your stats?

Posted Jun 4, 2010 12:46 UTC (Fri) by kpvangend (guest, #22351) [Link] (1 responses)

I don't think bringing in Intel as an example is fair nor correct.
Intel can ship their processors without specific Linux support if they want to and the Linux code is not inside the box they ship.

Doing feature development like Intel or IBM can afford has interesting dynamics. For starters: not much secrecy. Secondly, no time-to-market pressure. Thirdly, the freedom to pick versions and platforms you want.

In contrast, most embedded vendors (and for now, I'm putting Google in that box, too) ship a Linux inside their box, running on some platform the software guys didn't choose.

If they take the time to merge their code upstream, they cannot ship.
And yes, many companies have failed by spending too much time in the community. Just compare the amount of announcements on LinuxDevices.com with the amount of code merged and the amount of products shipped.

When doing embedded development, your boss will only allw you a small window in which you can merge stuff upstream and benefit from it at the same time:
* after the prototype starts working
* before the code freeze happens
That period - in most cases I've seen is only a month or so - will be quickly over if you get push-back.
And then the madness of everyday work (bug hunts, etc) will draw you back inside your company.

Care to share your stats?

Posted Jun 11, 2010 21:00 UTC (Fri) by aliguori (guest, #30636) [Link]

I can promise you, there certainly is time-to-market pressure. And every public traded company cannot discuss products before they've been officially announced so that does mean working with the community on a feature for a product that you can't talk about.

Care to share your stats?

Posted Jun 3, 2010 16:11 UTC (Thu) by anton (subscriber, #25547) [Link]

So I guess you are saying that "upstream first" costs more in opportunity costs (worse time-to-market) than releasing before it has been upstreamed costs in additional development time for maintenance and increased later upstreaming effort.

Upstream first policy

Posted Jun 3, 2010 11:35 UTC (Thu) by epa (subscriber, #39769) [Link] (10 responses)

Maintainability is much more important that functionality.

To whom? Not to the users. Who is the development process intended to benefit?

Upstream first policy

Posted Jun 3, 2010 12:56 UTC (Thu) by corbet (editor, #1) [Link] (6 responses)

Yes it's important to the users...unless you assume that all of those users want to be running something other than Linux in five years. Without a focus on maintainability you will shortly have a kernel which nobody wants to run.

Upstream first policy

Posted Jun 3, 2010 13:20 UTC (Thu) by michel (guest, #10186) [Link] (3 responses)

Not sure who you consider the user in this case. If it's google (as the user/consumer of the kernel), I can agree with the comment. If it's a consumer using an android based phone, I think the vast majority of them could care less if it's linux under the hood.

Upstream first policy

Posted Jun 3, 2010 13:42 UTC (Thu) by rvfh (guest, #31018) [Link] (1 responses)

Indeed, the user is Google in this case, just like they use Linux in many other places. If Google starts seeing a decline in Linux quality, they will either fork or switch. And to some extend, developing wavelocks behind closed doors could be considered a kind of fork (though on a small scale).

Upstream first policy

Posted Jun 3, 2010 14:41 UTC (Thu) by dgm (subscriber, #49227) [Link]

I bet it would be rather the opposite. If Google starts to see that Linux does not have the _functionality_ they want, they will fork or switch.

Why do you think they are _not_ using some of the BSDs?

Upstream first policy

Posted Jun 3, 2010 15:45 UTC (Thu) by iabervon (subscriber, #722) [Link]

Users actually care a whole lot about maintainability of the code if it affects the quality of the maintenance that gets done. They'll be unhappy if apps they've gotten for their phones start misbehaving when they either upgrade the OS or get a new phone. This comes down to the question of whether the APIs that the apps use can be maintained across changes to the underlying system, and has implications for whether your favorite third-party IM program drains your battery when you're idle online or alternatively stops exchanging audio if you don't touch anything during a voice chat.

If Google's using a design that hasn't passed muster, and they eventually switch to a better design, and the original API bitrots, that ends up impacting users, especially ones who have the idea that they can buy an Android phone with the expectation that any program that they come to like will keep working forever.

Upstream first policy

Posted Jun 3, 2010 14:56 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Agreed, a focus on maintainability is important. But which is more maintainable? Merging existing, working, widely deployed code - or forcing developers like Google to stay out-of-tree for five years?

My point is that the fact that some code is already being used on millions of devices and works *now* should carry some weight, even in assessing future maintainability. (It's much more likely that little-used features will suffer code rot, no matter what their conceptual purity.) At the moment it appears to get no weight at all.

Upstream first policy

Posted Jun 3, 2010 16:54 UTC (Thu) by cry_regarder (subscriber, #50545) [Link]

Of course it got weight...tons of weight. If it hadn't we wouldn't be talking about this now.

Also, the "millions of devices" is a red herring. It is just a handful of different devices, all of the same class. The kernel developers need a solution that works for a vast range of devices over the long haul.

Upstream first policy

Posted Jun 3, 2010 13:36 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

> Who is the development process intended to benefit?

Make no mistake: the development process is intended to benefit the developers.

In the case of Linux, many of the developers are users first, and developers second (I certainly started that way), so as a consequence it ends up being focused on benefiting users too, which is nice.

Upstream first policy

Posted Jun 3, 2010 14:08 UTC (Thu) by faramir (subscriber, #2327) [Link] (1 responses)

>In the case of Linux, many of the developers are users first, and >developers second (I certainly started that way), so as a consequence it >ends up being focused on benefiting users too, which is nice.

Depending on how you define it, that should read "benefiting A FEW users".

Between Tivos, WRT54g routers, Android phones, some TVs, and a host of similar products; I suspect that the vast majority of users are not developers of any sort. In most cases, the manufacturers of those products discourage development as well (Android is obviously different).

As has already been stated elsewhere, these users usually neither know nor care that Linux is involved. That doesn't mean that kernel policy (to the extent it exists) should change. But lets be honest here, this is about certain kinds of developers not users.

If one is a developer of an appliance type product, there would appear to be little reason to even subscribe to LKML let alone be involved in the development process. Your product life cycle is short and chances are that any significant kernel changes that you propose will either take too long or never get accepted.

Upstream first policy

Posted Jun 3, 2010 14:32 UTC (Thu) by corbet (editor, #1) [Link]

If you are the developer of one appliance-type product, then maybe you can ignore the process. However, the life cycle of such products tends not to be very long; soon you'll be developing another one. There comes a point where you can't drag that 2.4.x kernel forward any further; it just won't work on the hardware you're using. So you're stuck with trying to make your stuff work on something newer. And that will be painful.

I've consulted for companies like this. Had they worked with upstream and made sure the stuff they needed got there, they would have found it waiting for them when the time came to move to a newer kernel. Instead, they set themselves up for a bunch of high-intensity, short-deadline pain. That can be lucrative for kernel consultants, but it's not really a good way to run a company.

To me, treating the kernel as a throwaway resource doesn't make sense even for the most myopic of embedded systems developers. Unless they plan to go out of business soon, they will want a maintainable kernel five or ten years down the road, and they will want it to meet their particular needs. And that doesn't just happen by chance.

Upstream first policy

Posted Jun 3, 2010 16:03 UTC (Thu) by fuhchee (guest, #40059) [Link] (3 responses)

"We don't want the rival solution that works now, we want the rival solution that will still work in 5 years."

Considering how much of the kernel is regularly rewritten, deprecated, this policy appears to be selectively applied.

Upstream first policy

Posted Jun 3, 2010 17:14 UTC (Thu) by martinfick (subscriber, #4455) [Link]

Yes selectively, but the point you likely missed, is that it is with a strong focus on primarily maintaining the Kernel/Userspace API.

Upstream first policy

Posted Jun 3, 2010 18:24 UTC (Thu) by foom (subscriber, #14868) [Link]

Well, 5 years isn't actually that long. Most parts of the kernel userspace API that get deprecated and removed lasted far longer than 5 years in their previous incarnation. :)

Upstream first policy

Posted Jun 14, 2010 23:10 UTC (Mon) by aigarius (subscriber, #7329) [Link]

"Considering how much of the kernel is regularly rewritten ..."

That's actually the whole point - if what you have in the kernel is a custom-made ABI-locked solution that is distributed to millions of devices and can never-ever change, then there can be no rewrite full or partial and the kernel stagnates.

There are from time to time changes in the kernel that require kernel developers to change things around. And they need a freedom to do this. Now and in 5 years time. That is why they insist on keeping out things that they will not be able to change later on, including strict ABIs and narrow use cases in the generic parts of the code.

Google already got the benefit from this code being open so they could add this feature, but here the question is how to balance the maintenance burden of the feature on one hand with usefulness of this feature to other people. The suggestions in the LKML dealt with both sides - they reduced the maintenance burden by focusing the changes in less places and increased the usefulness of the feature, by making it more generic.

If before the discussion the usefulness of the code (to people outside Google) was less than the added maintenance burden it put on the kernel developers, then after the new proposal is implemented its usefulness just might be higher than the burden.

Upstream first policy

Posted Jun 3, 2010 16:27 UTC (Thu) by bfields (subscriber, #19510) [Link] (1 responses)

Developers focused on "make it work so we can ship" are going to be less focused on maintainability (or at least, that is the way it appears).

There can also be maintainability risks from designs that look elegant/highly general/whatever but that haven't been tested in the field.

I'm not really arguing one side or the other. In practice I think the really hard stuff is hard to get right without working on both tracks (thinking through the design carefully, and testing it in real situations) in parallel.

Upstream first policy

Posted Jun 4, 2010 3:41 UTC (Fri) by neilbrown (subscriber, #359) [Link]

Yes, nothing is really black-or-white is it?

I actually think there is a place for saying that a given interface is *not* permanent. That seems the be the main sticking point here.

If it were just code, we could import it, tidy it up, and be happy. Maybe it would change completely over a few release cycles. But as there is an interface involved that not everyone agrees with, we are stuck waiting for "perfect".

If we could say "This interface is only guaranteed to work with this library" or in some cases "... with this program", then I feel there would be a lot more room for flexibility. I have a vague feeling that ALSA works like this, but I'm not certain.

We have well-understood infrastructures for versioning libraries, breaking old API's, having multiple versions available and allowing old versions to be discarded selectively by distros. It would be great if the kernel interface could benefit the same way, and I think it should be possible to head that way.

Specifically, the nfsservctl syscall is probably totally unused these days, but it keeps a quantity of legacy code in the kernel which has to be maintained (though it is entirely possible that it is broken and nobody noticed).

Similarly the ioctls used for md/raid should go (though mdadm would need an update first - I haven't bothered because I "know" the ioctls have to stay) ... actually I now see that the sysfs interface I created to replace the ioctl interface is pretty horrible and really needs to be redone. If I could be sure that all users used mdadm ... or some library that I could create ... it would be a lot easier to deprecate old stuff.

Would that have helped with wakelocks? It is hard to be sure, but I think that it may well have done.

A suspend blockers post-mortem

Posted Jun 3, 2010 11:57 UTC (Thu) by jmspeex (guest, #51639) [Link] (1 responses)

I don't know what the best solution is, but the issue here is that the call for embedded developers to merge their stuff is really in contradiction with how suspend blockers were handled. This means one should change, though I don't know which one. Either the developers accept this sort of functionality with the understanding that API/ABI *will* be unstable (and may be removed when something else comes), or else you just tell developers that "submit your patches upstream" only really applies for drivers.

A suspend blockers post-mortem

Posted Jun 3, 2010 13:26 UTC (Thu) by tglx (subscriber, #31301) [Link]

> Either the developers accept this sort of functionality with the understanding that API/ABI *will* be unstable (and may be removed when something else comes) ...

That's the main problem. Once we have an user space visible ABI/API we can not break it. We are stuck to it.

In kernel APIs are known to be subject of change, but that's a totally different playground.

Often easier to be out of tree

Posted Jun 3, 2010 21:27 UTC (Thu) by tbird20d (subscriber, #1901) [Link]

Unfortunately, for much code in the embedded space it is simply less costly to maintain your code out of tree than to mainline it. That is, if you have good procedures for forward-porting your code, it's really not that hard to move it to new kernels. You run the risk of it being obsoleted by kernel churn, and you lose the benefits of peer review, etc. But sometimes the informed decision to just avoid LKML is (unfortunately) the right one.

Many industry developers underestimate the benefit of mainlining. However my own experience is that many community developers underestimate the engineering cost to mainline a piece of core code. What is easy to a seasoned community contributor is, in fact, quite daunting to the majority of Linux kernel developers.

A suspend blockers post-mortem

Posted Jun 6, 2010 22:41 UTC (Sun) by mikov (guest, #33179) [Link] (3 responses)

One thing that wasn't mentioned in the discussion so far is the financial burden to a smaller development house. Sure, Google can afford to have a few highly paid kernel developers spend months, if not years, just trying to push upstream something which already is working. That would be suicide for a small company.

If I might venture a guess (well, it is more than a guess), small embedded vendors don't seriously consider upstreaming their development. It is simply outside of the realm of financial possibility.

It doesn't help that having a driver upstream can be an additional hassle rather than a benefit. It can actually make it harder to deliver updates to customers. (If your driver is already upstream and you need to deliver an urgent fix, what do you do, especially if your customers are not kernel developers?)

I am not sure there is a better way to handle these problems, but at least they ought to be acknowledged.

A suspend blockers post-mortem

Posted Jun 7, 2010 17:53 UTC (Mon) by nlucas (subscriber, #33793) [Link]

Was thinking the same thing until I saw your post.

Kernel developers seem to live on a world of multinationals, forgetting most of the economy passes by small and medium companies (specially on small countries, where a medium company is a micro-company on the states).

git FTW

Posted Jun 7, 2010 23:06 UTC (Mon) by dmarti (subscriber, #11625) [Link] (1 responses)

The Coraid web site no longer has Ed Cashin's presentation on this subject -- "Unstable API Sense". He covers how to maintain a driver in git, and release both vendor versions and merge requests to upstream. Of course this is for a "leaf" driver, not a core feature, but the releases to customers problem is manageable.

git FTW

Posted Jun 8, 2010 18:59 UTC (Tue) by mikov (guest, #33179) [Link]

I couldn't find this presentation, but from your brief description it seems like it is missing the point. Git has nothing to do with this - as a convenient tool it simplifies just one trivial aspect of it - namely rebasing the sources and storing revisions.

The vendor still has to maintain and test multiple versions, some of which are not under its control (the mainline versions). The procedure for replacing a mainline driver with an updated vendor version at a customer site is a huge PITA. Worse, some customers have the mainline one, some the vendor one, and it is non-trivial to find out which (most people are not kernel experts). You need different procedures for different cases and so on. This makes both support and development more expensive.

In short, for anything that is not a truly mass market product, it turns out it is actually in the vendor's and customers' detriment to have the driver in the mainline.

On a similar note, I have always found the notion that a driver in the mainline is somehow "better maintained" very very strange. Nobody actually tests all the drivers in the kernel before each release. All that is done is making sure they can compile. How can anybody be satisfied with that is beyond me.