Saturday, May 2, 2015 - 14:54

Buffer Overflow puts Dreamliner's power generators to sleep

The problem doesn't seem to be as drastic as it sounds under real conditions since it requires the plane to be powered for a full 8 month. On the other hand the problem raises questions about the general quality of software development at Boeing or the industry in general.

According to an FAA report the 787 can lose all power due to a failure in the generator control units which switch to failsafe mode after about 248 days of continuous operations. There's more than one generator on the plane but if they are all started at the same time, which sounds more than just likely, they will all fail at the same time.

So what's happening here? The FAA report just mentions an internal counter. And with that 248 days is a very suspicious number for a mistake. 248 days is a signature problem for a signed integer buffer overflow in a timer function and sadly a fairly common problem.

Why is that?

A common frequency for such applications is 10ms. And if you do that in bad code every 10ms you run out of signed integer space in 248 days and a few hours at which time your buffer will flow over to a negative value. That is a pretty obvious problem and I'm fairly certain this is what is happening here. If for some unknown reason your software fucks up after 248 days and you use some sort of counter it's most likely exactly this problem.

Considering the obviousness of this problem it's astounding how often it happens. Including in code that should be thoroughly tested.

In critical environments you need to keep an extra eye on potential buffer overflows. And everything that grows over time is a prime candidate. In this particular case a very well known and persistent one. There are two simple solutions for that. 64 bit space or a solution where this particular problem doesn't matter. Just assuming that your application will not overflow is reckless in such an environment. But that's clearly not what happened here. Otherwise airlines would be very well aware that they are not supposed to keep the machine powered for that long and thus the warning would be pointless.

Bugs like this are fairly easy to identify if you run a solid testing regime. In fact the potential problem with such code should already be obvious to any coder implementing it. Sadly it's not. Such functions are not exotic code. They are fairly common and for some reason they are that kind of code everyone assumes just works. They are particularly dangerous because of the simplicity.

If (A > B + X) {...} What could possibly go wrong with that eh?

Well. With bad timing (pun intended) your plane will crash and everyone will die. Otherwise...probably not much.

As I already said I don't think this is a likely scenario to this particular case. But since the 787 was riddled with software problems earlier airlines used to keep them powered for longer than necessary.  The general risk is rather obvious. If keeping them powered prevents nasty behavior you might be inclined to keep them powered for as long as possible. I'd still doubt 8 month is feasible but with every day you're one day closer to a catastrophic failure.