Kernel bugs: out of control?

[Posted May 10, 2006 by corbet]

As has been widely reported, Andrew Morton recently told an audience at LinuxTag about his fears that the Linux kernel is getting buggier over time. That worry resonates with a number of users and developers, many of whom have never gotten entirely used to the 2.6 development model. The result of this discussion may be a long look at how the kernel is developed, culminating in a discussion at the annual Kernel Summit in Ottawa this July. Easy answers may be difficult to come by, however.

Even the core question - are more bugs being added to the kernel than are being fixed? - is not straightforward. Many developers have a sort of gut sense that the answer is "yes," but the issue is hard to quantify. There is no mechanism in place to track the number of kernel users, the number of known bugs, and when those bugs are fixed. Some information can be found in the kernel bug tracker run by OSDL, but acceptance of this tracker by kernel developers is far from universal, and only a subset of bugs are reported there. Distributors have their own bug trackers, but there is little flow of information between those trackers and the OSDL one; distributor trackers will also reflect problems (and fixes) in distributor patches which are not in the mainline kernel.

Dave Jones publishes statistics from the Fedora tracker, but it is hard to know what to make of them.

Part of the problem is that an increasing bug count does not, in itself, indicate that the kernel is getting worse. A kernel which is larger and more complex may have more bugs, even if the density of those bugs is going down - and the 2.6 kernel is growing quickly. Increased scrutiny will result in a higher level of reported bugs, but a lot of those bugs could be quite old. The recent Coverity scans, for example, revealed some longstanding bugs. If the user base is growing and becoming more diverse, more bugs will be reported in the same code, even if that code has not changed.

Dustin Kirkland has taken a different approach. For each 2.6 kernel version, he performed a search for "linux 2.6.x", followed by searches for strings like "linux 2.6.x panic". The trouble reports were then normalized by the total number of results, and the graph shown on the right was produced (click on it for the full-resolution version). Dustin's results show a relatively stable level of problem reports, with the number of problems dropping for the most recent kernel releases.

Clearly, there are limits to the conclusions which can be drawn from these sorts of statistics. The results which show up in Google may not be representative of the real troubles afflicting Linux users, and the lower levels for recent kernels may simply reflect the fact that fewer people are using those kernels. But the fact that these results are as good as anything else available shows how little hard information is available.

Some other efforts are in the works to attempt to quantify the problem - stay tuned to LWN for information as it becomes available. In a way, however, whether the problem is getting worse is an irrelevant question. The simple fact is that there are more kernel bugs than anybody would like to see, and, importantly, many of these bugs are remaining unfixed for very long periods of time. So, regardless of whether the situation is getting worse, it seems worth asking (1) where the bugs are coming from, and (2) why are they not getting fixed?

The first question has no easy answer. It would be nice if somebody would look at bug fixes entering the kernel with an eye toward figuring out when the fixed bug was first introduced - and whether similar bugs might exist elsewhere. That would be a long and labor-intensive task, however, and nobody is doing it. In general, the kernel lacks a person whose time is dedicated to tracking (and understanding) bugs. At the 2005 Kernel Summit, Andrew Morton indicated that he would like to have a full-time bugmaster, but this person does not yet exist. If, somehow, such a position could be funded (it is hard to see as a long-term volunteer job), it could help with the tracking and understanding of bugs - and with ensuring that those bugs get fixed.

Why bugs do not get fixed might be a little easier to understand. Certainly part of the problem must be that it is more fun to develop cool new features than to track down obscure problems. The older development process - where, at times, new features would not even be merged into a development kernel for a year at a time - might have provided more motivation for bug fixing than the 2.6 process, where the merge window opens every month or two. But feature development cannot be the entire problem; most developers have enough pride and care about their work to want their code to work properly.

The kernel is a highly modular body of code with a large development community. Many (or even most) developers only understand a relatively small part of it. So it is easy for kernel developers to feel that the bulk of the outstanding bugs are "not their department" - somebody else's problem. But the person nominally responsible for a particular part of the code may be overwhelmed with other issues, unresponsive and difficult to deal with, or missing in action. Many parts of the kernel have no active maintainer at all. So problems in many kernel subsystems tend to get fixed slowly, if at all - especially in the absence of an irate and paying customer. For this reason, Andrew has encouraged kernel developers to branch out and address bugs outside of their normal areas. That is a hard sell, however.

Kernel bugs can be seriously hard to find and fix. The kernel must operate - on very intimate terms - with an unbelievable variety of hardware and software configurations. Many users stumble across problems that no developer or tester has ever encountered. Reproducing these problems can be impossible, especially if nobody with an interest in the area has the affected hardware. Tracking down many of these bugs can require long conversations where the developer asks the reporter to try different things and come back with the results. Developers often lack the patience for these exchanges, but, crucially, users often do as well. So a lot of these problems just fall by the wayside and are not fixed for a long time, if ever.

Bug prevention is an area with ongoing promise. Many of the most error-prone kernel interfaces have been fixed over the years, eliminating whole classes of problems, but more can be done. More formal regression tests could be a good thing, but (1) the kernel developers have, so far, not found a huge amount of value in the results from efforts like the Linux Test Project, and (2) no amount of regression testing can realistically be expected to find the hardware-related problems which are the root of so many kernel bugs. Static analysis offers a great deal of promise, but free tools like sparse need quite a bit of work, yet, to realize that promise.

The end result is that, while there are ways in which the kernel process can be improved, there is a distinct lack of quick fixes in sight. Fixing kernel bugs is hard work, and the kernel maintainers lack the ability to order anybody to do that work. So, while the kernel community can be expected to come to grips with the problem - to the extent that there is a problem - the process of getting to a higher-quality kernel could take some time.

http://lwn.net/Articles/183053/ there are many comments there ,,

linux lover

2009年1月5日星期一