WHS: A Series of Unfortunate Eventsby Ryan Smith on March 13, 2008 12:00 AM EST
- Posted in
- Ryan's Ramblings
When it rains, it pours, and sometimes you get hit by lightning too which will really ruin your day.
Since very late last year, Microsoft has been facing an issue with Windows Home Server where under certain conditions files on a server’s shares could become corrupt. The severity of the situation is pretty immense and the situation straightforward: nothing should be getting corrupt on a file server, otherwise it’s a pretty useless file server. Since the initial report Microsoft has been attempting to reproduce the issue in order to fix it, and finally this week they have announced that they have fully identified the problem, its causes, and what needs to be done to fix it.
It just about couldn’t get any worse.
Before we jump too far ahead, it’s probably important to quickly go over what technology makes Windows Home Server unique, as this relates directly to the culprit. As we discussed in our WHS preview, WHS uses a new Microsoft technology called Drive Extender that sits on top of the file system. Drive Extender is responsible for pooling together all the hard drives in a server to function as one contiguous drive, and at the same time provide redundancy through automatic duplication. In practice it’s JBOD + RAID 1 implemented above the file system rather than below it as in a real RAID implementation.
When a file is transferred to a WHS share (and the server has more than 1 hard drive), the Drive Extender software is responsible for relocating it and if it’s marked for duplication making sure there are an appropriate number of distributed copies of the file. How this is done gets rather gritty, but for now we’ll leave the explanation as being such that it’s a rather ingenious implementation of Microsoft’s shadow file technology, along with the spare file ability of NTFS manipulated in to using sparse files as a way to implement symbolic links on a file system that doesn’t (or rather didn’t at the time) support them.
With that out of the way, why is the situation so grim? We’ve been talking with our sources trying to find out what exactly is going on, with some success. What follows is a mixture of things we know for sure, along with some educated guessing based on what sources will tell us; no one has so far been willing to fully explain the technical details of the problem nor completely confirm our guesses, but with the information we have we believe our guesses to be correct.
We do know for sure that the problem is in the Drive Extender software, which would make sense given that the problem is unique to WHS. What this means is that the corruption issues only extend to situations where the Drive Extender software is used: file shares on a system with more than 1 hard drive. Client backups are unaffected in all cases, and systems with only 1 drive are unaffected because Drive Extender is not used. Furthermore the issue is only with writes and not reads, so reading data is safe.
Finally, the problem is only with so-called “incremental” file writes, that is rewriting part of an existing file or appending additional data to it (aka file edits); full file writes are also unaffected. This is why in Microsoft’s own notes on the matter, the only applications that trigger the corruption issue are applications that explicitly and frequently use incremental writes; Outlook for example uses a flat-file database to store mail and Microsoft’s photo gallery application rewrite the metadata of a file in certain situations. This is great news because it means anything that does a full file write is unaffected, such as copying a file to a share or any number of applications (e.g. notepad) that simply replace a file when saving rather than rewriting any part of it. The situation could have been worse for Microsoft if it affected full file writes too.
One thing we have been trying to ascertain is how many users are affected by this corruption bug, with some success. A number of WHS-powered servers that have been sold are the lower-end models with only a single drive, automatically disqualifying/saving them from having issues. Far fewer servers use multiple drives, and then there are an unknown number of OEM copies of the OS sold to enthusiasts with unknown configurations. On systems that are susceptible to the issue, Microsoft seems confident that only a handful of them have or ever will experience the corruption issue, and based on the flaw that causes it this seems a safe bet. Our best guess is that among historic blunders the percentage of users affected is similar to that of the Pentium FDIV bug; far too many to be comfortable but few enough that most people will likely never notice the issue.
So if the estimated number of victims is so low and the problem confined to file edits, why do we still say the situation couldn’t get any worse? Because the flaw is not a bug in the code, it’s a fundamental flaw in the algorithms used by Drive Extender. We have with particular interest been trying to track down a precise explanation of what causes the corruption problem and this is where the guessing starts to come in.
The lynchpin in the problem is the Drive Extender technology, the DEmigrator process in particular. Something is going wrong when an incremental write is being made to a file that DEmigrator is already writing to, resulting in garbage being written instead. At first we suspected it was an issue with how WHS handles file locks, but upon further review this seems unlikely. Instead there appears to be some kind of race condition with the file write itself, with the condition resulting in somehow both writes simultaneously making it to the OS write cache, and subsequently the bad cache is being written to disk. This explanation would fully account for the low incidences of the problem (DEmigrator already needs to have the victim file open for writing) and Microsoft’s troubles identifying the problem.
The worst part of this however is that this race condition is a fundamental flaw in the Drive Extender software, it’s not simply a bug where someone typed the wrong thing in for a line of code. Microsoft does not completely explain how their Drive Extender technology works so we do not have any inner details on what these algorithms are, but a race condition is particularly worrisome because they’re one of the hardest issues in computer science to account for and correct. To put things in perspective, we were have the following analogy:
If you’re familiar with the concept of sorting in CompSci then you should be familiar with the notion of sort stability. If data has previously been sorted by another field, a new sort with a stable sort will maintain that order, an unstable sort will not. Ultimately if you are using an unstable sort you can not just modify a sort to be stable, you have to replace it with a stable sort. This is the kind of issue Microsoft is facing.
The faulty algorithm in the Drive Extender software is effectively an unstable sort. The WHS development team has to completely rewrite part of the Drive Extender software to use new, safe algorithms that will not suffer from this assumed race condition. This is what makes the situation so grim, there is little worse than having to completely rewrite a piece of shipping software to fix a bug, both for issues of morale and time.
This kind of rewrite takes time to accomplish and the QA process takes even more time. Microsoft has estimated that the fix will not be ready for another 3 months (June) and while it’s unlikely for this to take less time it can certainly take more if the QA process finds more problems. What little good news that is here is that the dev team has already identified a possible fix and has started testing it, so it is in fact possible to correct the issue and the dev team has a good enough understanding of the issue to create a fix.
Until that fix arrives however, this puts WHS in a perilous position. As a v1 product breaking new ground WHS already has plenty of challenges. The media will eat this up (and we’re just as guilty) and this will tarnish the product’s name for the rest of its life; customers don’t need to understand an issue to understand that a product is imperfect and that they should stay away from it. Yet data corruption is a serious issue that isn’t acceptable and can’t be ignored.
Perhaps the worst bit however is that as an OEM-only product, Microsoft is not exerting any real control over what the OEMs do about the issue until the corruption problem is fixed. As of right now retailers are still selling OEM servers with 2+ drives (making them susceptible to the bug) and computer enthusiast retailers are still selling the OS itself, all with no notice about this bug. WHS is a good product where plenty of functionality can still be used even with the presence of the bug (e.g. backups) but we have serious problems with it still being offered for sale given these problems. WHS is already heavily tarnished due to this bug, there’s no (okay, some) shame in cutting ones losses and halting all sales of the OS until the bug is fixed, even if it won’t affect most users.
Ultimately it’s a damn shame to see something like this happen, no one is going to be a winner. Windows Home Server will be fixed, but only after a lot of grief for the developers and a lot of concern for server owners. Thankfully current server owners can take steps to prevent the corruption issue entirely, but at a cost of functionality, and we don’t doubt some people will still feel insecure about their data even after taking those steps. For the time being WHS is dead in the water, it’s a promising product that is not suitable for further sale given the potential severity of the bug. It also undermines a great deal of confidence in Microsoft that will take some time to recover.
Finally this also brings in to further question just how long of a shelf life WHS will have anyhow. It’s a poorly kept secret that Microsoft is already working on the next version based on the Longhorn kernel (Drive Extender practically begs for prioritized I/O) and we have always expected it to come fairly soon once Windows Server 2008 was completed. WHS v1 may be done for entirely, even once Microsoft fixes the corruption issue and continues/resumes selling it. If by the time the issue is fixed we’re looking at less than a year before the next WHS, it may not be worth buying WHS v1 at all. Then again Microsoft is undoubtedly carrying much of the WHS technology and software forward for the next version, this bug could very well push WHS v2 back if the bug was getting carried forward too.