A tale of two data-corruption bugs [LWN.net]

By Jonathan Corbet
May 24, 2015

There have been two bugs causing filesystem corruption in the news recently. One of them, a bug in ext4, has gotten the bulk of the attention, despite the fact that it is an old bug that is hard to trigger. The other, however, is recent and able to cause data loss on filesystems installed on a RAID 0 array. Both are interesting examples of how things can go wrong, and, thus, merit a closer look.

Extent confusion in ext4

Like many reasonably modern filesystems, ext4 includes a number of performance-enhancing features. One of those is delayed allocation, wherein the filesystem will not immediately allocate specific blocks on the disk for data that an application has just written. By delaying that allocation, the filesystem gives itself some time to see if more data will be written in the near future. If so, space for all of the written data can be allocated contiguously, improving future I/O performance. Delayed allocation works, but it does leave some written data (the "delayed extent") in a sort of limbo state for a brief period where it has no fixed home on the disk. Obviously, when the time comes to flush that data out to persistent storage, the task of allocating the destination blocks can be delayed no further.

Another performance feature is unwritten extents. An application can increase the size of a file with system calls like truncate() or fallocate(). These calls do not actually write data to any new extents added to the file. Allowing anybody to read blocks that have not been written to is clearly a bad idea; at best, the result will be garbage, while, at worst, another user's sensitive information could be disclosed. The filesystem could avoid this problem by writing zeroes to all new blocks once they are allocated, but that's an inefficient use of CPU time and I/O bandwidth, given that the blocks are ordinarily going to be overwritten with real data in the near future. The alternative is to mark the new blocks as being explicitly unwritten until that real data comes along. Attempts to read unwritten blocks will just result in a zero-filled buffer.

Ext4 keeps track of both delayed and unwritten extents in a data structure called the "extent status tree". Something interesting happens, though, if an unwritten extent is added when there are already delayed allocation blocks in the same block range. The entire unwritten extent ends up being marked as delayed as well, because the extent status tree can't track the fact that only part of the extent is delayed.

For example, consider a file that is currently 100 blocks long — blocks 0-99 are written and present on disk. The application writes blocks 100 and 101; the filesystem responds by putting them into the extent status tree as delayed-allocation blocks. The application then uses fallocate() to tell the filesystem to allocate blocks 100-109. Those unwritten blocks also go into the tree, but, since there are already two delayed blocks in that range, the entire range 100-109 is marked delayed as well.

A delayed extent is removed from the tree when the delayed buffers are actually written to disk. But, in this case, there are no delayed buffers for blocks 102-109; as a result, the extent remains as a delayed extent in the tree, even after the actual delayed portion (blocks 100 and 101) has been allocated and written out. There it will stay until another write to one of the affected blocks comes along. At that point, the entire block will be reallocated (because it is still marked as delayed), losing the previously delayed data that had already been written. That is about the point where alcohol consumption by both administrators and users increases unhealthily.

This bug has been present in the ext4 filesystem for some time; nobody seems to be quite sure when it was introduced. It has remained undetected because it is quite hard to hit; the process, as described by Ted Ts'o, is:

It requires the combination of (a) writing to a portion of a file that was not previously allocated using buffered I/O, (b) an fallocate of a region of the file which is a superset of region written in (a) before it has chance to be written to disk, (c) waiting for the file data in (a) to be written out to disk (either via fsync or via the writeback daemons), and then (d) before the extent status cache gets pushed out of memory, another random write to a portion of the file covered by (a) -- in which case that specific portion of (a) could be replaced by all zeros.

Nonetheless, corruption bugs are bad news. This one has been fixed by this patch from Lukas Czerner which was merged for 4.1-rc2. The fix also found its way into the 4.0.3, 3.18.14, 3.14.42, and 3.10.78 stable updates.

Discard discrepancies

About the time the above bug was being fixed, some users started reporting problems with RAID 0 arrays based on ext4; many assumed that they had been a victim of that bug. The truth of the matter was somewhat worse than that, though; they had found a nastier, easier-to-trigger bug that was the result of an overly hasty fix.

Back in April, Joe Landman reported a problem with RAID 0 volumes on the XFS filesystem. After some back-and-forth, Neil Brown tracked it down to a change merged for the 3.14 release. The code in question calculates the number of sectors that fit in the next RAID 0 chunk — in other words, the portion of the I/O request that maps to a single underlying drive. In simplified form, this calculation looks like:

    unsigned sectors = chunk_sects - sector_div(sector, chunk_sects);

The call to sector_div() returns the remainder of the division. What it also does, though, may be surprising: it replaces the value of sector with sector/chunk_sects. In other words, sector_div() is a macro that modifies one of its arguments. The code did not take that modification into account, with the result that it used the wrong value of sector from then on. Neil's fix was to simply reinitialize sector from the bio structure describing the operation in question.

There was only one little problem: by then the bio pointer had been advanced, so the new sector value was wrong. When the RAID 0 code then proceeded to map the sector in the RAID device to a sector in one of the underlying physical devices, it would end up in the wrong place — likely as not, on the wrong device entirely. In practice, the code path that goes wrong in this way would be executed relatively rarely; it requires an I/O operation that crosses multiple RAID 0 chunks. So it is perhaps not entirely surprising that it seems to manifest itself most often with "discard" requests, which can be applied to an entire file at once.

Discard is a mechanism for telling a storage device that a particular range of blocks no longer contains needed data. Its use can improve performance on solid-state drives, which can benefit from the knowledge that a certain range of blocks does not need to be preserved during wear-leveling operations. If told to do so, the ext4 filesystem will generate DISCARD requests when a file is deleted, and the RAID 0 code will pass those requests down to the underlying drives. When this particular bug hits, though, the discard request will go to the wrong place, resulting in the discarding of some random, unrelated data.

This unfortunate "fix" was merged into the mainline during the 4.1 merge window. If it had stayed with the 4.1 kernel, its impact would have been limited; 4.1 has still not seen an official release. But this patch also went into the 4.0.2, 3.19.8, 3.18.14, and 3.14.41 stable updates. The fix to the fix (written by Eric Work) has been pulled into the mainline kernel, but has not, as of this writing, found its way into any stable updates. One assumes that will happen soon, but it is worth noting that 3.19.8 is the end of the 3.19 stable series, so there may be no updated kernel for 3.19 users.

The good news is that the problem was caught reasonably quickly; there should not be huge numbers of users who have updated to one of the affected kernels. The bad news is that, for those users who are affected, there could be silent data corruption that will not be discovered for some time. Anybody who is running one of the affected kernels will — after moving to a safe kernel, of course — want to check the contents of their RAID 0 arrays against a backup.

Keeping data safely is one of the fundamental obligations of an operating system kernel, so data-corruption bugs can shake one's confidence in the whole structure. But, as has been seen here, bugs happen. Sometimes they lurk for years without causing trouble, and sometimes they make their presence known quickly. Such bugs are, fortunately, rare with Linux; with luck, these are the last we will see for a while.

Index entries for this article
Kernel	Filesystems/ext4
Kernel	RAID

to post comments

A tale of two data-corruption bugs

Posted May 24, 2015 14:29 UTC (Sun) by josh (subscriber, #17465) [Link] (15 responses)

> unsigned sectors = chunk_sects - sector_div(sector, chunk_sects);

> The call to sector_div() returns the remainder of the division. What it also does, though, may be surprising: it replaces the value of sector with sector/chunk_sects. In other words, sector_div() is a macro that modifies one of its arguments.

I'm surprised nobody has posted a patch yet to turn sector_div into a static inline and get rid of this in-place modification. It's easy enough to write something like sector = sector_div(sector, chunk_sects); if that's really what you want. And it's completely reasonable that someone wouldn't take in-place modification of a non-pointer argument into account, by an apparent function that isn't even named in all-caps to say "HEY, I'M A MACRO!"

A tale of two data-corruption bugs

Posted May 24, 2015 20:03 UTC (Sun) by xtifr (guest, #143) [Link] (4 responses)

Changing the behavior of an existing name is fraught with danger. You have to inspect every use of the name to see if it relies on the behavior being changed. Which might not sound that bad until you realize that this has to be done separately, not only for each branch, but for third-party forks as well. In other words, you've really gone beyond patch territory at that point, and are into "update procedure". Not a happy place to be.

A more reasonable approach might be to create an new static inline with a new name (e.g. sector_divide), and then either keep the current name for a macro that calls the inline and then does the extra assignment, or add a new macro with an all-caps name, and make the current name into a macro that uses #error to inform people that their code needs to be updated.

(The second approach causes more short-term pain, but less long-term. It does involve an update procedure, but one that can't be accidentally overlooked, which is the biggest danger with the naive approach.)

A tale of two data-corruption bugs

Posted May 25, 2015 8:34 UTC (Mon) by epa (subscriber, #39769) [Link]

Right, so you change the name and the behaviour at the same time. Then it will be obvious if there is any leftover code using the old name and expecting the old semantics.

A tale of two data-corruption bugs

Posted May 25, 2015 17:09 UTC (Mon) by josh (subscriber, #17465) [Link] (2 responses)

> Which might not sound that bad until you realize that this has to be done separately, not only for each branch, but for third-party forks as well.

Nope. This is exactly why the Linux kernel has a stated policy of not caring about out-of-tree drivers. Just change the macro to a function, convert all callers in the kernel, and delete the old macro.

> A more reasonable approach might be to create an new static inline with a new name (e.g. sector_divide)

Agreed, changing the name makes it easier to detect callers and avoid reintroducing new callers with that expectation (such as from patches in flight).

> and then either keep the current name for a macro that calls the inline and then does the extra assignment, or add a new macro with an all-caps name, and make the current name into a macro that uses #error to inform people that their code needs to be updated.

No macro needs to exist, nor does the kernel keep around compatibility interfaces that nothing in the kernel uses. Also, macros can't call #error, but there's a simpler solution to produce a compiler error for code that needs updating: just delete the macro, and attempted callers will get a compile time error saying sector_div does not exist.

A tale of two data-corruption bugs

Posted May 25, 2015 20:50 UTC (Mon) by jzbiciak (guest, #5246) [Link] (1 responses)

I guess for a function like this, which effectively returns two values (in this case, sector_div is a specialized divmod function), you'd replace it with a static inline that requires you to take the address of the left argument? ie. something like this?

/* Compute bar / baz and bar % baz.  Put bar / baz into bar, and return bar % baz. */
foo = divmod( &bar, baz );

With inlining, the reference/dereference will get optimized away, most certainly, and the required & would make the in-out characteristic of bar more obvious.

I have to admit, the in-out nature of sector_div would not have been apparent to me. I sat up in my chair and shoot my head when I read the bug description in the article.

A tale of two data-corruption bugs

Posted May 25, 2015 21:49 UTC (Mon) by jzbiciak (guest, #5246) [Link]

I have to admit, the in-out nature of sector_div would not have been apparent to me. I sat up in my chair and shoot my head when I read the bug description in the article.

err... I meant to say: I have to admit, the in-out nature of the first parameter to sector_div would not have been apparent to me. I sat up in my chair and shook my head when I read the bug description in the article.

A tale of two data-corruption bugs

Posted May 25, 2015 1:07 UTC (Mon) by cesarb (subscriber, #6266) [Link] (9 responses)

I believe it's a "div/mod" kind of macro, that is, it returns *both* the division result and its remainder. Since in C you have only one return value, the other return value has to go into one of the arguments. IIRC, it's also a *very old* macro, from way before the 2.0 days, perhaps even from before the 1.0 days.

IMHO the best way would be to keep it as a macro (with a different name of course), but with *no* return value (that is, make it a "do {...} while(0)" kind of statement), and with *no* "in/out" arguments: each argument should be either an input or an output (which means 4 arguments). That should make it move up in Rusty's classic API quality scale (http://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html and http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html).

Fixing sector_div

Posted May 25, 2015 12:57 UTC (Mon) by jreiser (subscriber, #11027) [Link] (8 responses)

The best fix from the viewpoint of safety and reliability is two new static functions sector_quo and sector_rem, along with changing the definition of macro sector_div so that its expansion causes a compile-time error which refers the programmer to sector_quo and sector_rem.

Even then, we are not done yet. There must be an audit to detect and evaluate all similar instances: any macro with arguments that is not equivalent to the obvious C-language function of those same arguments. There are valid uses of such macros, but they must be prominently documented because of the high risk of confusion with functions. Among other checks: correct use of parentheses, brackets and braces "()[]{}"; no use of any argument as an lvalue; no use of any identifier which is not in scope at time of definition. Automate the detection phase of the audit using help from your favorite shell, editor and compiler.

Fixing sector_div

Posted May 25, 2015 14:06 UTC (Mon) by cesarb (subscriber, #6266) [Link] (7 responses)

> The best fix from the viewpoint of safety and reliability is two new static functions sector_quo and sector_rem, along with changing the definition of macro sector_div so that its expansion causes a compile-time error which refers the programmer to sector_quo and sector_rem.

The whole point of that macro is to avoid doing the same division twice. Replacing it with a single macro with a different calling convention would be acceptable, replacing it with two separate operations would be a harder sell.

> There must be an audit to detect and evaluate all similar instances: any macro with arguments that is not equivalent to the obvious C-language function of those same arguments.

They're quite common in the kernel; the spin_lock_irqsave/irqrestore() pair is a popular example, and nobody ever gets confused by it. The problem is not "pass by reference" arguments; the problem is using the same argument for both an input and an output. In the spin_lock example, the argument to the spin_lock_irqsave() macro is always an output, and the argument to the spin_lock_irqrestore() macro is always an input.

Fixing sector_div

Posted May 25, 2015 14:50 UTC (Mon) by epa (subscriber, #39769) [Link] (5 responses)

I find it hard to believe that integer division and modulo operations could ever be a bottleneck (unless perhaps in a hashing function or cryptography). The performance loss from doing them separately must be imperceptible. I'd love to find out if I am wrong, however.

Fixing sector_div

Posted May 25, 2015 22:37 UTC (Mon) by cesarb (subscriber, #6266) [Link] (4 responses)

> I find it hard to believe that integer division and modulo operations could ever be a bottleneck (unless perhaps in a hashing function or cryptography).

Linux was developed on a 386. A quick search shows that the DIV instruction took 38 cycles. At a generous 16 MHz, that's 2.375 μs. Since it's part of the disk processing, it might have been used in an interrupt, where latency is important.

When connected to a fast modem, the CPU might have only around 3800 cycles to process each character. 38 cycles is 1% of that. Every cycle counts.

> The performance loss from doing them separately must be imperceptible.

Perhaps today in desktop CPUs with fast dividers. Definitely not the case back when the do_div macro was probably designed as a direct wrapper to the DIV instruction.

Fixing sector_div

Posted May 26, 2015 7:00 UTC (Tue) by epa (subscriber, #39769) [Link] (3 responses)

Back in the 386 days hard disks were far slower too, so I still doubt that it could have made a practical difference to anything. Though as you say if used in an interrupt things become more interesting. But I was talking about current systems.

Fixing sector_div

Posted May 26, 2015 17:03 UTC (Tue) by ncm (guest, #165) [Link] (2 responses)

Even in those dark days, compilers were smart enough to recognize a pair of divide and modulus operations on the same arguments and combine them. There never was an excuse for this macro as written. Putting it in lower case to conceal its nature was doubly wrong.

Probably instead of leaving delayed-write blocks in limbo, they should be referred to scratch blocks and then re-targeted when the time comes to sync them out. In a pinch, the actual scratch blocks can take the data, reliably.

Fixing sector_div

Posted May 26, 2015 19:32 UTC (Tue) by bcopeland (subscriber, #51750) [Link] (1 responses)

> There never was an excuse for this macro as written.

My guess, without having looked at history, is that sector_div() is that way because of do_div(), but even it has been recognized as a trap for the unwary going back quite a while (see e.g.: http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0617.html)

Fixing sector_div

Posted May 26, 2015 23:42 UTC (Tue) by cesarb (subscriber, #6266) [Link]

> Not very many people should use "do_div()" directly (and a quick grep shows that not very many people do). It's generally a mistake to do so, I suspect. The thing was originally written explicitly for "printk()" and nothing else.

That might explain why it's that way. The printk() family of functions is used in a lot of places.

And yeah, sector_div() is do_div() if CONFIG_LBDAF (large block devices) is set, otherwise it's a normal rem/div pair.

Fixing sector_div

Posted May 25, 2015 17:01 UTC (Mon) by josh (subscriber, #17465) [Link]

> The whole point of that macro is to avoid doing the same division twice. Replacing it with a single macro with a different calling convention would be acceptable, replacing it with two separate operations would be a harder sell.

Compilers are smart enough to notice and coalesce div and mod on the same operands and implement them with a single instruction on architectures with such an instruction.

A tale of two data-corruption bugs

Posted May 24, 2015 20:55 UTC (Sun) by ivuk (guest, #102574) [Link] (2 responses)

Regarding the 3.19 kernel, hasn't Canonical picked up its maintenance? I'd expect them to include the patch in their future releases.

A tale of two data-corruption bugs

Posted May 24, 2015 22:34 UTC (Sun) by tonyblackwell (guest, #43641) [Link]

Mageia pushed the patches a couple of days ago, for 3.19.8 (in M5) and 3.14 (M4)

A tale of two data-corruption bugs

Posted May 25, 2015 7:28 UTC (Mon) by tajyrink (subscriber, #2750) [Link]

Yes, it seems to be in the queue for the next ckt update: http://kernel.ubuntu.com/git/ubuntu/linux.git/commit/?h=l...

A tale of two data-corruption bugs

Posted May 25, 2015 0:54 UTC (Mon) by iabervon (subscriber, #722) [Link]

That is about the point where alcohol consumption by both administrators and users increases unhealthily.

To be precise, the alcohol isn't actually consumed until someone tries to read the blocks and finds that they don't have the right data. The alcohol consumption does become necessary at the time when the data is lost, however, but it is delayed...

A tale of two data-corruption bugs

Posted May 25, 2015 6:41 UTC (Mon) by marcH (subscriber, #57642) [Link] (1 responses)

> It has remained undetected because it is quite hard to hit;

I'm curious: was it possible to add a corresponding test case anyway?

More generally I'd really like to hear about how filesystems get tested. They're possibly the most critical kernel feature. For instance: does test code live mostly in user space or kernel? Are there many unit tests? etc.

A tale of two data-corruption bugs

Posted May 25, 2015 11:57 UTC (Mon) by jan.kara (subscriber, #59161) [Link]

So filesystems are usually tested using xfstests these days. That is a (quite big) collection of various tests (in userspace). Test for this particular bug has been added (as test generic/086) shortly after the kernel bug was fixed to make sure we don't reintroduce it :). See https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/commi...

A tale of two data-corruption bugs

Posted May 26, 2015 10:39 UTC (Tue) by alankila (guest, #47141) [Link] (7 responses)

Would probably be good idea to go through the kernel source tree now and replace every lowercase macro name with uppercase version.

A tale of two data-corruption bugs

Posted May 26, 2015 12:57 UTC (Tue) by JdGordy (subscriber, #70103) [Link] (6 responses)

Switching to uppercase wouldn't solve anything in this case....

A tale of two data-corruption bugs

Posted May 26, 2015 15:41 UTC (Tue) by iabervon (subscriber, #722) [Link]

If the macro name had been uppercase from the start, the original code probably wouldn't have had the bug of changing sector. And the way to avoid that bug that the author chose would probably not have been to reload it, since the author was more likely at that time to know that bio had changed than Neil Brown was recently.

It's also easier to catch the second problem in review with it being new code than as a change to old code.

A tale of two data-corruption bugs

Posted May 27, 2015 20:05 UTC (Wed) by alankila (guest, #47141) [Link] (4 responses)

Macros can do strange things, such as appear to manipulate values which are not passed as a pointer. Syntactically, it is better if they are different from functions.

A tale of two data-corruption bugs

Posted May 27, 2015 23:00 UTC (Wed) by cesarb (subscriber, #6266) [Link] (1 responses)

> Syntactically, it is better if they are different from functions.

Interestingly, the Rust language does precisely that: macros always have a ! suffix (like format!(...)).

But I agree with the parent comment: making it obvious that it's a macro (all-uppercase, special suffix, etc) would have made no difference. If a programmer sees macro called SECTOR_DIV, receiving a sector number and a sector size, what would the programmer think? "It's a macro which does whatever magic is necessary to divide a sector number by a sector size." The programmer would not expect it to overwrite the sector size passed in! The kernel is full of function-like macros (or functions which are macros in some kernel configurations), no matter how they're named people will expect them to behave like normal functions.

Now suppose the macro had one extra parameter, like "... = SECTOR_DIV(sector, sector, sector_size)". The programmer would question, "why is sector passed twice? Clearly one of them must be an output, or some other arcane trickery; I must look into its definition," and the bug wouldn't happen.

A tale of two data-corruption bugs

Posted May 28, 2015 0:24 UTC (Thu) by alankila (guest, #47141) [Link]

I can't believe that I have to argue such an obvious point.

I assert that chances of mistakes like these are almost definitely increased when macros look like function calls. If sector_div had actually been a function, it couldn't have altered the input argument given the way it was called. Because it looks like a function, it dupes programmers with a false expectation. To crystallize, things that are different in behavior should also look different. Therefore, having upper-case named macros is prudent (especially when they do macroy stuff that is impossible for functions to do) and it would increase the likelihood that someone checks the implementation because a macro can really do very strange things that break syntactic expectations. You don't know about a C macro where exactly the argument is expanded, or how many times it will be evaluated.

Anyway, macros are evil and should die, especially macros for which inline functions are a perfectly acceptable substitute.

A tale of two data-corruption bugs

Posted Jun 7, 2015 14:17 UTC (Sun) by parcs (guest, #71985) [Link] (1 responses)

Thankfully C++ makes this strange behavior a feature of normal functions.

A tale of two data-corruption bugs

Posted Jun 8, 2015 13:22 UTC (Mon) by zlynx (guest, #2285) [Link]

Which means that as a C++ programmer you realize that you always need to check the function declaration. Always.

It isn't a bad idea to do it for C either. Sure you can see that you are using the & to create a pointer and pass it to a function. What do you do for the second level function that passes that pointer to a third level? No visual indicator there at all.

A tale of two data-corruption bugs

Posted May 28, 2015 23:28 UTC (Thu) by lacos (guest, #70616) [Link] (13 responses)

I've just made three stunning discoveries.

(1) In all of the above discussion about "how this divmod interface should look like", noone mentions that the standard C library provides div(), ldiv(), lldiv(), and imaxdiv(), all of which return small *structures*. Obviously the kernel doesn't use the standard C library, but for the same functionality, the interface should be the same, preferably.

(2) So we have three commits here:

* 20d0189b1012 -- original commit. Introduces the original bug. Seven people on CC, no Acked-by or Reviewed-by tags. Whoever the maintainer was just pulled / applied the patch without reviews from the community.

* 47d68979cc96 -- first fix. Replaces the original bug with another bug. No Acked-by or Reviewed-by tags.

* a81157768a00 -- fix to the fix. Supposed to solve the most recent bug. Surprise: no Acked-by or Reviewed-by tags. Notice a pattern?

The QEMU community has a policy that *no patch is applied* until it gets at least one Reviewed-by. Edk2 (EFI Development Kit II) follows the same policy.

(3) Icing on the cake: the article states that the latest fix was written by Eric Work. Indeed the "author" metadata field on a81157768a00 carries his name. But, that's not enough. The primary / first author should always be named by the first Signed-off-by tag in the commit message. Reviewers and/or the maintainer should have immediately pointed out that Eric's S-o-b was missing from his patch email.

A tale of two data-corruption bugs

Posted May 29, 2015 0:12 UTC (Fri) by neilbrown (subscriber, #359) [Link] (11 responses)

Your point about the lack of review is well made. But it is not a new point, nor one that applies only to this case.
I would love more review. I don't know how to get it.

However your final issue about a missing s-o-b from Eric is something I can say something useful about.
Signed-off-by is primarily a statement about copyright ownership. It is tightly tied to the "Developers Certificiate of Origin"
When I added my Signed-off-by, I was attesting that I was satisfied that the person I received the patch from had the right to give it to me under the terms of the GPL, and was doing so. I was satisfied because I could see the history of how he developed it and his general attitude to how it should be used was clear. Also it was a trivial patch with no creative content and hence nothing that was copyrightable. I would have happily accepted a s-o-b from Eric, but it seemed pointless to ask for one. When someone has done me a favour with useful testing and analysis and created a patch, sending it back to have some 'i's crossed and 't's dotted seems impolite.

But maybe that is why I have trouble getting people to review my code - I'm too polite :-)

A tale of two data-corruption bugs

Posted May 29, 2015 0:50 UTC (Fri) by lacos (guest, #70616) [Link] (2 responses)

Thanks for your prompt answer!

I think for such bugs it should be possible to "forcefully elicit" some reviews -- you could say you wouldn't apply the patch (hence, you'd delay the fix for the issue) until someone gave a Reviewed-by. Faking / rushing a Reviewed-by is usually a greater psychological obstacle for pepole than ignoring a patch, so I think getting a Reviewed-by would prove at least *some* level of attention to the patch.

I do realize that lack of reviews is a general plight.

Regarding the S-o-b -- I fully agree that when a patch goes through a maintainer's tree / pull req it should get a new S-o-b from that maintainer. But that S-o-b tag should come under the original author's own, first, S-o-b.

Also, I believe this issue could be handled without a full patch round-trip. I think I'd just ask the submitter publicly, in the thread of his patch, if he was okay with me supplementing his missing S-o-b, in my tree. When confirmed (publicly), just rebase the patch locally and add the tag, similarly to any Reviewed-by or Acked-by tags posted by reviewers.

Regarding copyright -- I prefer to err on the safe side. "No creative content and hence nothing that was copyrightable" is not a statement that I'd ever like to have to defend. :) Especially if the submitter provided the patch as a result of work-for-hire (ie. from a company email address, more or less). I do realize this was likely not the case here.

Thanks!

A tale of two data-corruption bugs

Posted May 29, 2015 1:24 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

demanding that the first author approve patches to the code just isn't going to work. The first author may not be around any longer.

A tale of two data-corruption bugs

Posted May 29, 2015 11:24 UTC (Fri) by lacos (guest, #70616) [Link]

I think you may have misunderstood my point about the original author's S-o-b. I'm not proposing that the original author review all further patches that modify his original code; that would just be futile. What I'm proposing is: if a contributor posts a patch without his own S-o-b, then the maintainer with jurisdiction should immediately point out that fact, and either ask for a resubmission, or (if the patch is otherwise okay) ask for permission to add the S-o-b to the commit on the maintainer side.

In this regard, lack of the S-o-b in the original patch is just a bug, like any other bug that can be present in a patch. The contributor is definitely around during the week or so after posting his patch; how would he expect to react to any review comments otherwise? A maintainer or reviewer saying "thanks for your patch, but it's missing your S-o-b, please submit a corrected version, or authorize me to supplement it in my tree" is no different from a maintainer or reviewer saying "thanks for your patch, but it introduces an integer overflow, please submit a corrected version".

A tale of two data-corruption bugs

Posted May 29, 2015 9:22 UTC (Fri) by johannbg (guest, #65743) [Link] (7 responses)

"I would love more review. I don't know how to get it."

With training an reviewer/coders with the associated cost ( of time ) of doing so.

People in opensource projects have a tendency to skip this step ( for one reason or another like they dont like to hand hold people etc ) then complain about lack of reviews, followed by company's that now are depended on that projects code either to hire *experienced* developers to review/code ( and those dont grow on trees ) and or fund that procedure collectively.

A tale of two data-corruption bugs

Posted May 30, 2015 0:08 UTC (Sat) by neilbrown (subscriber, #359) [Link] (6 responses)

> With training an reviewer/coders with the associated cost ( of time ) of doing so.

I have invested in a number of people over the years, and it has been rewarding in the short term. In the longer term they seem to disappear, either vanish from the community or get caught up in more interesting projects with related technologies. I can't say I blame them.

The people who do hang around are the "users" rather than "developers" - there are a number of people on linux-raid who routinely help people out with various configuration problems so I don't need to do much of that any more - and I am extremely grateful (should probably say so more often). But they aren't kernel developers and I don't expect them to review code.

Maybe we do need to pay someone competent to become interested, but I'd rather paying someone who was interested to become competent...

A tale of two data-corruption bugs

Posted Jun 1, 2015 13:39 UTC (Mon) by johannbg (guest, #65743) [Link] (5 responses)

Here's where upstream attitude and presentation matters.

On top of that keeping people engaged and interested with hardware depended projects is tricky so I'm not surprised if you are working in that field that you have lost quite few members from that community in the long run since you need to continuously to be supplying those committed with the relevant hardware for the times and that project, for that to work.

And I agree paying someone who was interested, engaged and part of the community to become competent is the better choice than the alternative which has negative effects on existing participants of the community.

A tale of two data-corruption bugs

Posted Jun 3, 2015 15:15 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

If upstream attitude helped, Neil would have dozens of helpers (it is pretty uncontroversial there are few more helpful and all-around pleasant people in the community). It's not enough :(

A tale of two data-corruption bugs

Posted Jun 3, 2015 17:00 UTC (Wed) by johannbg (guest, #65743) [Link] (1 responses)

Upstream attitude matters and I'm pretty sure if it has not been as good as it has been it would have fewer participants but it's not the only thing that matters when dealing with hardware depended communities since it's users need to poses or have access to ( and the ability to test as well ) the required hardware necessary to be able to participate.

Once participating individual stops owning or otherwise have access to the relevant hardware for his participation, his participation gradually starts to decline and he's gone in 6 months to an year.

Maintaining a healthy community full of capable participants dedicating their free time to help out is alot of work due to constant struggle with the competition, evolution or both.

It's truly said when you see upstream communities with developers that either ignore and or take those resources for granted.

A tale of two data-corruption bugs

Posted Jun 3, 2015 20:14 UTC (Wed) by nix (subscriber, #2304) [Link]

The required hardware? None. Well, a storage device. (You can test using multiple sparse loopback devices...)

This is not hardware RAID, after all. It's software! Anyone with more than one disk and *without* expensive hardware will probably be looking at md.

A tale of two data-corruption bugs

Posted Jun 4, 2015 18:12 UTC (Thu) by ksandstr (guest, #60862) [Link] (1 responses)

> Here's where upstream attitude and presentation matters.

It might. I've not seen the evidence, only heard the stories of such evidence existing -- though the storyteller rarely claims to have laid eyes on it for themselves. They're just very sure, somehow; and anyway they've heard the story quite often.

On the other hand, I've got anecdotal evidence that suggests there's currently a smaller pool of potential contributors to Linux for the reason that the GNU/Linux and Free software scene is apparently replete with and/or defenseless with regard to the worst kind of militant progressive activist. Given choice between a loose association that's so naïvely accepting of vile lunatic filth[0], and eight hours per day of one that lies somewhere between oppressively corporate, frothy-mouthed and cult-like (startup companies, Apple), and just a humdrum trade below the line of equitability (everything else), it's easy to pick one where the individual developer's mortgage doesn't hang on the appetites of various extremist pundits and their compliant political officers.

[0] all in the name of "decency", of course. what else would it be?

parse: bailing out at line ~#?

Posted Jun 4, 2015 18:52 UTC (Thu) by sdalley (subscriber, #18550) [Link]

Golly. You sound as if you could do with a lie down.

A tale of two data-corruption bugs

Posted Jun 1, 2015 12:38 UTC (Mon) by pm215 (subscriber, #98099) [Link]

"The QEMU community has a policy that *no patch is applied* until it gets at least one Reviewed-by."

That's overstating things somewhat. We would like all patches to get a reviewed-by, but that's an ideal, not a flat requirement. If you look through QEMU's git history it's easy to find patches which didn't get a reviewed-by tag.