What You Can Learn from the Zune Leap Year Bug
For some background on this issue, you can read this post. Basically, a bug in a driver causes all Zune 30 GB models to hang on the last day of a leap year. Letting the battery die and restarting it another day was the only way to recover.
Every time I see a technology problem in the news, I think:
- How could this have happened?
- Would this problem have happened if I were on the project?
- How can I use this problem to improve and benefit myself?
You should also think about these issues when you read the news.
Here are some of the lessons from the Zune problem:
- If a product or service has your name on it, you get the glory, but you also own the problems. The bug was in a driver that Microsoft did not write. But it's a Microsoft Zune, not a Freescale Zune. Freescale wrote the driver, but in consumer's minds, it's Microsoft that messed up.
- Code reviews and testing actually matter. Looking at this code raises all sorts of red flags. There are multiple magic numbers (365, 366), and they're each used multiple times. And any time code has consecutive numbers and also has comparison tests and a loop, you know you're looking at a recipe for disaster. Any reasonable review of this code would have suggested that it needs to be made clearer, and half-way decent testing would have found the problem, too.
- This code is non-trivial to get right. Reading Slashdot, Reddit, Zuneboards, and any number of other blogs shows many, many people suggesting ways to fix the code. And 90% of these "fixes" are incorrect! They either have the same hanging behavior, different hanging behavior, or produce an incorrect result. And these are from people who know there's a problem with the existing code. Can you imagine if they were trying to write the code from scratch?
- There is a very strong desire to optimize, even when it doesn't matter or is actually harmful. Several commenters suggested to use lookup tables to compute the year and day of year, and some of these lookup tables were many dozens of kilobytes in size. This is a handheld device, where memory is at a premium, the CPU is ample, and the date is calculated rarely. Who cares how optimum this routine is in space or time? A similar tendency was shown by people who wanted to "simplify" the code and make it a one-liner. Why? The one-liners were all either wrong, confusing, or used floating point (often incorrectly). Really, this code just needs to work and be maintainable. Anything else is just introducing the potential for errors.
So, what can we learn from this?
- Test everything. This includes libraries you use, drivers you rely on, and your own code. A client will never, ever accept that the problem was in code you use, not in code you wrote. And a good tester will understand the boundary conditions that need to be tested with date manipulation code. The last day of a leap year is always in the list of dates you need to check.
- Many of the (admitedly impromptu) optimizations were bad news. It's well known that you should get code working first, and optimize it last, if ever. As Donald Knuth said, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." Only if a profiler pointed out this code as a bottleneck should anyone ever consider any criteria other than correctness and readability. The desire to optimize is very strong, and I find that I have to be ever vigilant in my own work to resist this temptation.
- For well-known algorithms, such as this, search the web before you start writing code. No one should be writing date manipulation (or bit-twiddling, or net present value) algorithms from scratch.