Motherboards Memory Storage Cases/Cooling/PSUs IT Computing Displays Mobile Mac CPUs & Chipsets Video Digital Cameras Linux Gadgets Systems Trade Shows Guides Home Increase Font Size Decrease Font Size Change Page Size
WHS: A Series of Unfortunate Events
WHS: A Series of Unfortunate Events
Date: March 13th, 2008
Author: Ryan Smith
 
 

When it rains, it pours, and sometimes you get hit by lightning too which will really ruin your day.

Since very late last year, Microsoft has been facing an issue with Windows Home Server where under certain conditions files on a server’s shares could become corrupt. The severity of the situation is pretty immense and the situation straightforward: nothing should be getting corrupt on a file server, otherwise it’s a pretty useless file server. Since the initial report Microsoft has been attempting to reproduce the issue in order to fix it, and finally this week they have announced that they have fully identified the problem, its causes, and what needs to be done to fix it.

It just about couldn’t get any worse.

Before we jump too far ahead, it’s probably important to quickly go over what technology makes Windows Home Server unique, as this relates directly to the culprit. As we discussed in our WHS preview, WHS uses a new Microsoft technology called Drive Extender that sits on top of the file system. Drive Extender is responsible for pooling together all the hard drives in a server to function as one contiguous drive, and at the same time provide redundancy through automatic duplication. In practice it’s JBOD + RAID 1 implemented above the file system rather than below it as in a real RAID implementation.

When a file is transferred to a WHS share (and the server has more than 1 hard drive), the Drive Extender software is responsible for relocating it and if it’s marked for duplication making sure there are an appropriate number of distributed copies of the file. How this is done gets rather gritty, but for now we’ll leave the explanation as being such that it’s a rather ingenious implementation of Microsoft’s shadow file technology, along with the spare file ability of NTFS manipulated in to using sparse files as a way to implement symbolic links on a file system that doesn’t (or rather didn’t at the time) support them.

With that out of the way, why is the situation so grim? We’ve been talking with our sources trying to find out what exactly is going on, with some success. What follows is a mixture of things we know for sure, along with some educated guessing based on what sources will tell us; no one has so far been willing to fully explain the technical details of the problem nor completely confirm our guesses, but with the information we have we believe our guesses to be correct.

We do know for sure that the problem is in the Drive Extender software, which would make sense given that the problem is unique to WHS. What this means is that the corruption issues only extend to situations where the Drive Extender software is used: file shares on a system with more than 1 hard drive. Client backups are unaffected in all cases, and systems with only 1 drive are unaffected because Drive Extender is not used. Furthermore the issue is only with writes and not reads, so reading data is safe.

Finally, the problem is only with so-called “incremental” file writes, that is rewriting part of an existing file or appending additional data to it (aka file edits); full file writes are also unaffected. This is why in Microsoft’s own notes on the matter, the only applications that trigger the corruption issue are applications that explicitly and frequently use incremental writes; Outlook for example uses a flat-file database to store mail and Microsoft’s photo gallery application rewrite the metadata of a file in certain situations. This is great news because it means anything that does a full file write is unaffected, such as copying a file to a share or any number of applications (e.g. notepad) that simply replace a file when saving rather than rewriting any part of it. The situation could have been worse for Microsoft if it affected full file writes too.

One thing we have been trying to ascertain is how many users are affected by this corruption bug, with some success. A number of WHS-powered servers that have been sold are the lower-end models with only a single drive, automatically disqualifying/saving them from having issues. Far fewer servers use multiple drives, and then there are an unknown number of OEM copies of the OS sold to enthusiasts with unknown configurations. On systems that are susceptible to the issue, Microsoft seems confident that only a handful of them have or ever will experience the corruption issue, and based on the flaw that causes it this seems a safe bet. Our best guess is that among historic blunders the percentage of users affected is similar to that of the Pentium FDIV bug; far too many to be comfortable but few enough that most people will likely never notice the issue.

So if the estimated number of victims is so low and the problem confined to file edits, why do we still say the situation couldn’t get any worse? Because the flaw is not a bug in the code, it’s a fundamental flaw in the algorithms used by Drive Extender. We have with particular interest been trying to track down a precise explanation of what causes the corruption problem and this is where the guessing starts to come in.

The lynchpin in the problem is the Drive Extender technology, the DEmigrator process in particular. Something is going wrong when an incremental write is being made to a file that DEmigrator is already writing to, resulting in garbage being written instead. At first we suspected it was an issue with how WHS handles file locks, but upon further review this seems unlikely. Instead there appears to be some kind of race condition with the file write itself, with the condition resulting in somehow both writes simultaneously making it to the OS write cache, and subsequently the bad cache is being written to disk. This explanation would fully account for the low incidences of the problem (DEmigrator already needs to have the victim file open for writing) and Microsoft’s troubles identifying the problem.

The worst part of this however is that this race condition is a fundamental flaw in the Drive Extender software, it’s not simply a bug where someone typed the wrong thing in for a line of code. Microsoft does not completely explain how their Drive Extender technology works so we do not have any inner details on what these algorithms are, but a race condition is particularly worrisome because they’re one of the hardest issues in computer science to account for and correct. To put things in perspective, we were have the following analogy:

If you’re familiar with the concept of sorting in CompSci then you should be familiar with the notion of sort stability. If data has previously been sorted by another field, a new sort with a stable sort will maintain that order, an unstable sort will not. Ultimately if you are using an unstable sort you can not just modify a sort to be stable, you have to replace it with a stable sort. This is the kind of issue Microsoft is facing.

The faulty algorithm in the Drive Extender software is effectively an unstable sort. The WHS development team has to completely rewrite part of the Drive Extender software to use new, safe algorithms that will not suffer from this assumed race condition. This is what makes the situation so grim, there is little worse than having to completely rewrite a piece of shipping software to fix a bug, both for issues of morale and time.

This kind of rewrite takes time to accomplish and the QA process takes even more time. Microsoft has estimated that the fix will not be ready for another 3 months (June) and while it’s unlikely for this to take less time it can certainly take more if the QA process finds more problems. What little good news that is here is that the dev team has already identified a possible fix and has started testing it, so it is in fact possible to correct the issue and the dev team has a good enough understanding of the issue to create a fix.

Until that fix arrives however, this puts WHS in a perilous position. As a v1 product breaking new ground WHS already has plenty of challenges. The media will eat this up (and we’re just as guilty) and this will tarnish the product’s name for the rest of its life; customers don’t need to understand an issue to understand that a product is imperfect and that they should stay away from it. Yet data corruption is a serious issue that isn’t acceptable and can’t be ignored.

Perhaps the worst bit however is that as an OEM-only product, Microsoft is not exerting any real control over what the OEMs do about the issue until the corruption problem is fixed. As of right now retailers are still selling OEM servers with 2+ drives (making them susceptible to the bug) and computer enthusiast retailers are still selling the OS itself, all with no notice about this bug. WHS is a good product where plenty of functionality can still be used even with the presence of the bug (e.g. backups) but we have serious problems with it still being offered for sale given these problems. WHS is already heavily tarnished due to this bug, there’s no (okay, some) shame in cutting ones losses and halting all sales of the OS until the bug is fixed, even if it won’t affect most users.

Ultimately it’s a damn shame to see something like this happen, no one is going to be a winner. Windows Home Server will be fixed, but only after a lot of grief for the developers and a lot of concern for server owners. Thankfully current server owners can take steps to prevent the corruption issue entirely, but at a cost of functionality, and we don’t doubt some people will still feel insecure about their data even after taking those steps. For the time being WHS is dead in the water, it’s a promising product that is not suitable for further sale given the potential severity of the bug. It also undermines a great deal of confidence in Microsoft that will take some time to recover.

Finally this also brings in to further question just how long of a shelf life WHS will have anyhow. It’s a poorly kept secret that Microsoft is already working on the next version based on the Longhorn kernel (Drive Extender practically begs for prioritized I/O) and we have always expected it to come fairly soon once Windows Server 2008 was completed. WHS v1 may be done for entirely, even once Microsoft fixes the corruption issue and continues/resumes selling it. If by the time the issue is fixed we’re looking at less than a year before the next WHS, it may not be worth buying WHS v1 at all. Then again Microsoft is undoubtedly carrying much of the WHS technology and software forward for the next version, this bug could very well push WHS v2 back if the bug was getting carried forward too.


17 Comments
Username:
Password:
In Depth by Desslok, 617 days ago
Thanks for the update.

Reply
Hoping by djc208, 616 days ago
I really like the WHS concept, and like you am sorry to see this problem tarnishing it's future.

I have been slowly working to get my current HTPC moved to WHS thanks to the WHS version of SageTV and their new (but hard to get) HD extender. I get the additional features of WHS and can move the computer from my living room to a closet and use a small, quiet extender box in my entertainment center.

But I have to agree that I'm reluctant to drop the money on an OEM copy for all the reasons you outline. I would have a multi-HDD setup and normaly do edit files on the HTPC/server, so I have no reason to jump on this bandwagon right now.

I'm hoping MS is smart enough to keep this product around, it was/is a step in the right direction but I can't support it yet until I know it's stable and safe.

Reply
The solution is simple... by petersterncan, 616 days ago
The solution is simple... never use software RAID. Only use RAID with a proper hardware-based solution.

Just for performance reasons alone it's a bad idea. Yeah, it seems to be practically as fast in a benchmark situation... but have a bunch of other stuff running and then see how well it does!



Reply
RE: The solution is simple... by beoba, 616 days ago
This isn't even software RAID, it's above the level of the filesystem, while software RAID would be below it.

Linux example:

[filesystem/partition]
[kernel + software raid]
[physical drives]

WHS:

[hard drive spanning software]
[filesystems]
[kernel]
[physical drives]

Reply
RE: The solution is simple... by mindless1, 616 days ago
That's a bit backwards. The ideal solution would be to throw a software raid card in these effected systems so it has the spanning and redundancy. It's ideal instead of a hardware raid card because for a home server the hardware raid card will be mostly wasted money without performance benefit (assuming an otherwise modern system set up for home server duty).

Software raid does not significantly effect performance on a WHS candidate home server. Using today's hardware the CPU load using software raid would be about 10% if that. If you have a bunch of other stuff running it will make no difference, assuming it is reasonably home server related (a few clients' mail, DNS, proxy, etc) not trying to run video encoding jobs or game on your server.



Reply
Typical Microsoft complexity by androticus, 616 days ago
The design of this subsystem is typical of the Microsoft approach to design: make something supremely complex, rather than something simple and elegant. This reminds me of the Crypto file capability in NT/2000 -- I remember reading an article about it and looking at the information flow diagram and just shook my head in dismay -- it was unbelievably complex for something that was just supposed to encrypt files. And ironically, despite all that complexity, the system was almost useless -- you had to encrypt manually, so could easily forget to encrypt, and it left all filenames exposed in the clear -- the structure of a secure volume and all the names can be just as important as the content of the files themselves. And compared to a whole-volume solution, its performance was pitiful.

I have no sympathy for Microsoft in this issue. First, they clearly built something WAY too complex and risky for a new product that critically needed to be reliable. And second, they obviously completely undertested it, which is shocking and egregiously irresponsible.

Microsoft (at least used to) like to boast that it highers "super smart" people -- I actually think this has become their curse, since these near-geniuses have become inmates running the asylum, and have turned the company into their own little private puzzle-solving playground, with things like Encrypting filesystem, OLE/ActiveX, Vista (DX10, etc.), this file system extender, etc. etc. They also have WAY too much money and time on their hands -- if they assigned half the developers and half the time to their projects, I am certain the end product would be twice as good!

Reply
Doctor, Doctor! by Dsjonz, 616 days ago
Patient: It hurts when I do *this*!
Doctor: Then don't do *that*.

I'll simply avoid doing incremental file saves on my WHS server until it's fixed. I will not throw out a whole basketful of good because of one bad. My experience with Microsoft goes back to Windows v1.04 (circa 1987-88), and WHS v1.0 is among the best first-generation products Microsoft has ever released.

One thing is for certain: my install of WHS v1.0 is more stable than my install of Vista Ultimate!



Reply
keep it in perspective by Ares2600, 616 days ago
I agree race conditions are pretty rough, but it feels like this is a little over-dramatic. Using the sort example is kind of a worst case scenario type thing. Generally they can be solved with some additional synchronization (and hopefully preserve performance). Lock up a resource rather than fight for it (hence the 'race' metaphor).

Since it hits so close to home with respect to the 'great new features' category of WHS I'm sure the dev responsible won't be in a great position, but playing the odds my guess is it will be handled in short order.

Reply
Maybe too much - too soon? by TheBeagle, 616 days ago
This whole unfortunate situation with WHS reminds me a the TV advertising of Paul Masson (you know, the wine guy) a few years ago. They used to run an ad that said, "We will sell no wine before its time!" Indeed they even got none other than Orson Welles to do that TV piece. I sure wish the WHS folks at Microsoft had paid a little better attention to that ad.

Of all the shortcomings that a new software server product might encounter, data corruption is by far the worst. That type of a bug just destroys public confidence in a product, a large portion of which will likely never be fully regained. It also made a fool out of a number of computer pundits that unequivocally endorsed the WHS product.

Now just in case anyone wonders if I know what I'm talking about, I run a computer business and had planned on selling the heck out of WHS. In fact I still hope to do so. But NOT with WHS Rev.1. It is a plain fact of life that we desperately need a re-born, distinctly new version (Rev 2.0) to sell (and use) in order to overcome the terrible effects of this data corruption fiasco.

Oh, and by the way Microsoft, if you have a collective brain in your head, you darn well ought to do the right thing by offering a free upgrade to Rev.2 for anyone who bought Rev.1. At least that way the significant number of early adopters won't become a mortal enemy to WHS and kill the thing before it ever gets a chance to resurrect itself. That's just some food for thought. But you sure ought to give it some serious consideration.

Reply
WHS- running fine for me... by ianken, 616 days ago
...I have yet to hit this, been running since beta.

Backup - no problems
Restore - no problems
RDP gateway/Remote - no problems
Serviing files - no problems
Copying content to the server - no problems

I never edit on the server, even before this was identified.

The fact that the WHS is being so open about this and up front about how hard it is to address is refreshing.

As to drive extender being "overly complicated." Well, it does cool things. Online, on the fly storage aggregation without having to suffer long array build times is not simple. That said, I disable all file duplication and rely on an Areca controller and RAID5 for recovery. File mirroring is a waste of space when you have RAID5 and 6 as viable options.

To be honest my biggest beef: you cannot boot a client off the restore CD if the optical drive is SATA, even if a SATA drive is plugged in and you;re booting from a PATA device it will crash. Lame.

However the painless backup/restore process is awesome.

Reply
RE: WHS- running fine for me... by gpaul, 615 days ago
I was at first very concerned about this issue. Then I came to realize that due to the nightly backup most of the files I had always kept on my W3K server so they where safe were now just as safe on my WS. The only reason in a home environment to have a file 'shared' on the server is for multi-user live access needs. I don't know of any in my environment. I keep everything on the WS and let the WHS handle my long term archive, mp3, wav, and backup needs. It does it well. I wish this bug didn't exist but I also know it will be fixed and in the meantime I'm not impacted by it.



Reply
RE: WHS- running fine for me... by Dsjonz, 614 days ago
My experience matches yours, except for this:

"To be honest my biggest beef: you cannot boot a client off the restore CD if the optical drive is SATA, even if a SATA drive is plugged in and you;re booting from a PATA device it will crash. Lame."

I boot one of my client PCs with the WHS restore disk from a SATA DVD-RW with no problem.

System: LG GSA-H26N DVD-RW, Gigabyte P35-DQ6, QX9650, Vista Ultimate

Reply
Better information than we've had from Microsoft thus far..... by FireTech, 615 days ago
Very nice article Ryan, which I'm sure now makes the situation seem a lot less 'scary' for the majority of current WHS users.

Reply
True True by cbutters, 614 days ago
Great article, I'm glad I'm not alone seeing this corruption.
I commented about this corruption bug in my article I wrote a few weeks ago on eXoid.com I can't believe this wasn't caught earlier, You would think this would be a lawsuit waiting to happen.

Reply
RE: True True by ajdavis, 614 days ago
You mean a lawsuit like hard drive manufacturers face? Computers are expected to, at some point, lose/corrupt/completely obliterate data. Users are expected to have backups. Period.

Reply
Just bought one yesterday by WT, 613 days ago
OfficeMax was selling WHS with a free 500gbHD, but it looks like the issue is occurring with 2 HD setups, so that free HD would not be a wise addition to the WHS until this bug is corrected (June ?? C'mon MS !!)
Either way, WHS itself is a great idea and the software itself is what makes it powerful, but the users afflicted with this bug are suffering with no end in sight.

Reply
OEM Experience, not so good unfortunately by Itsamuppett, 583 days ago
So close, and yet so far.

I bought an OEM copy as soon as it came out. I re-purposed my old gaming machine platform and built it into an HTPC case using quiet fans etc. AMD 64x2 4400+, DFI RD200 MoBo, 3x Samsung 500Gb, DVDRam, 2Gb OCZ400.

I had a big fight on setup as I wanted to use the DFI hardware RAID and everytime in order to RAID5 thr 3 500Gb discs. I configured it and WHS refused to identify it on installation. If I loadad XP or Vista it would install over the top without a fight, but WHS would not see it. I had to unbundle and load the core OS as JBOD.

As many people say, good n00b friendly interface, easy setup, backup and recovery great feature as there are a lot of laptops around the hose now a days. Remote access and domain hosting very easy to setup and very valuable as a I travel a lot (Thank you slingbox as well)

JBOD and "application level software RAID" became annoying as I copied all the local content over to the duplicated folders on the server. The copy speed sucked and the number of times I had to see if the thing was still alive (I thought it had detacthed and gone to sleep) and found "storage balancing" on the bottom line began to make me frustrated. This thing has enough horsepower and the statement of MS is that it should be capable of running on older retired hardware.

Then it all went furry. I backup a Dell 1710 and 1330 as our two house machines which have important material. We needed to restore the 1330 in order to get a file that was overwritten and it failed badly. I started to look at the web and found an excellent site

http://www.wegotserved.co.uk/

this explained a lot of the issue you very concisely cover in your article. So I looked at a number of my other files as a test (170Gb of personal music uploaded through WMP) and I had added comments and rankings to them. Quite a lot of errors and bad files.

I have continued to use the system, the utility that allows you to prune it back to a single drive and de-mount the other drives works well. I dropped it to a single 500Gb and added the others to the new gaming system as a ICH-9R RAID array and copied all of the files over. It is now the default house server using VistaU64. Lets just agree is is faster, easier and more effective. It does not have the remote access and backup but that costs about £20 in software to add ?

I am disgusted by MS stance, in general I think they do a pretty good job and whilst it is easy to knock them (I was an Apple field engineer when the Mac was released and worked for Sun for 3 years) they do deserve respect and I can think of companies that do a lot worse.

So you avoid telling us. You continue to ship the thing. You don't assist your OEM (unless it was on the quiet.....) You don't offer any form of guidance. When the customer does find out he has to dig deep to find out the reality and then gets the wake up call.

I have a software package that does not work. I remember the old joke about MS taking the mickey out of GM for the slow rate of development of the motor car and then GM firing back about the BSOD issue at 90. (lets not mention the work of Mr Nader on an early example) MS just provided a swing axle operating system at this point and do deserve an attitude adjustment.

I am not saying litigation is the answer. I would like to either return the software or be given an SME S2008 licence, they offered me a home server and I don't have one that is trustworthy. If you want to regain respect in a situation like this you should support the customer and offer a positive commitment to improvement. I think a promise to provide V2, an interim "offer" of an alternative if the cutomer does want to progress it (admittedly the OEM version would probably take this but the pre-builts would have problems) would be a decent gesture.

I did recommend it to many of my friends - when will I learn.

Come on MS pull your finger out and show some respect to people who pay your wages.

Reply
Comments Page 1 of 1





AnandTech.com Blog Categories
All categories
Anand's Macdates
Anand's Theater Construction
Anand's Updates
Cases and Power Supplies
CeBIT 2008
CES 2008
Computex 2009
Derek Decanted
Eddie's Got Game
Gary's First Looks
IT Computing general
Jarred's Musings
Kris's Corner
Raja's Ramblings
Rob's Experiences...
Ryan's Ramblings
Virtualization
What's New with Wes
Blank
Blank

Blank

Latest news by
DailyTech

 November 20, 2009

Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank

 November 19, 2009

Blank
Blank
Blank
Blank
Blank
Blank
Blank
Blank


more Blogs Discussions



pipeboost
Copyright © 1997-2009 AnandTech, Inc. All rights reserved. Terms, Conditions and Privacy Information.
Click Here for Advertising Information