The SF-2281 BSOD Bug

A few weeks ago I was finally able to reproduce the SF-2281 BSOD bug in house. In working on some new benchmarks for our CPU Bench database I built an updated testbed using OCZ's Agility 3. All of the existing benchmarks in CPU Bench use a first generation Intel X25-M and I felt like now was a good time to update that hardware. My CPU testbeds need to be stable given their importance in my life so if I find a particular hardware combination that works, I tend to stick to it. I've been using Intel's DH67BL motherboard for this particular testbed since I'm not doing any overclocking - just stock Sandy Bridge numbers using Intel's HD 3000 GPU. The platform worked perfectly and it has been crash free for weeks.

A slew of tablet announcements pulled me away from CPUs for a bit, but I wanted to get more testing done while I worked on other things. With my eye off the ball I accidentally continued CPU testing using an ASUS P8Z68-V Pro instead of my Intel board. All of the sudden I couldn't complete a handful of my benchmarks. I never did see a blue screen but I'd get hard locks that required a power cycle/reset to fix. It didn't take me long to realize that I had been testing on the wrong board, but it also hit me that I may have finally reproduced the infamous SandForce BSOD issue. The recent Apple announcements once more kept me away from my CPU/SSD work but with a way to reproduce the issue I vowed to return to the faulty testbed when my schedule allowed.

Even on the latest drive firmware, I still get hard locks on the ASUS P8Z68-V Pro. They aren't as frequent as before with the older firmware revision, but they still happen. What's particularly interesting is that the problem doesn't occur on Intel's DH67BL, only on the ASUS board. To make matters worse, I switched power supplies on the platform and my method for reproducing the bug no longer seems to work. I'm still digging to try and find a good, reproducible test scenario but I'm not quite there yet. It's also not a Sandy Bridge problem as I've seen the hard lock on ASRock's A75 Extreme6 Llano motherboard, although admittedly not as frequently.

Those who have reported issues have done so from a variety of platforms including Alienware, Clevo and Dell notebooks. Clearly the problem isn't limited to a single platform.

At the same time there are those who have no problems at all. I've got a 240GB Vertex 3 in my 2011 MacBook Pro (15-inch) and haven't seen any issues. The same goes for Brian Klug, Vivek Gowri and Jason Inofuentes. I've sent them all SF-2281 drives for use in their primary machines and none of them have come back to me with issues.

I don't believe the issue is entirely due to a lack of testing/validation. SandForce drives are operating at speeds that just a year ago no one even thought of hitting on a single SATA port. Prior to the SF-2281 I'm not sure that a lot of these motherboard manufacturers ever really tested if you could push more than 400MB/s over their SATA ports. I know that type of testing happens during chipset development, but I'd be surprised if every single motherboard manufacturer did the same.

Regardless the problem does still exist and it's a valid reason to look elsewhere. My best advice is to look around and see if other users have had issues with these drives and have a similar system setup to you. If you do own one of these drives and are having issues, I don't know that there's a good solution out today. Your best bet is to get your money back and try a different drive from a different vendor.

Update: I'm still working on a sort of litmus test to get this problem to appear more consistently. Unfortunately even with the platform and conditions narrowed down, it's still an issue that appears rarely, randomly and without any sort of predictability. SandForce has offered to fly down to my office to do a trace on the system as soon as I can reproduce it regularly. 

Introduction The Newcomers
POST A COMMENT

88 Comments

View All Comments

  • imaheadcase - Thursday, August 11, 2011 - link

    I was wondering the same thing...this seems to happen a lot lately with roundups. Reply
  • Anand Lal Shimpi - Thursday, August 11, 2011 - link

    My apologies! An older version of the graphs made its way live, I've updated all of the charts :)

    Take care,
    Anand
    Reply
  • Nickel020 - Thursday, August 11, 2011 - link

    I always thought the difference in price between a 25nm SF1200 drive and a synchronous SF2200 was mainly due to the cost of the controller, but since you put the controller at $25, it's the NAND in the SF1200 that must be cheaper.

    A Corsair F115 with synchronous 25nm (G08CAMDB)* costs $170, a Force 3 with asynchronous NAND costs $185 and a Force GT with synchronous NAND costs $245. The synchronous NAND in the F115 must be way cheaper than the synchronous in the Force GT thus.

    I'm guessing the SF2200 is more expensive than the SF1200, so that basically means that following your cost breakdown, the asynchronous NAND in drives such as the Force 3 or Agility 3 must be similarly priced as the synchronous NAND in the 25nm SF1200 drives.

    Why is the synchronous in the SF1200 drives so much cheaper than the one in the SF1200 drives? Could you decipher the the whole part number?

    *I'm assuming the F115 uses the same NAND as the first Vertex 2s with 25nm:
    http://www.tomshardware.de/ocz-vertex-2-25nm-ssd,t...
    Reply
  • Coup27 - Thursday, August 11, 2011 - link

    If the current state of affairs are due to the reasons you have outlined in the first couple of paragraphs then this has been brought on by the manufacturers themselves.

    All the manufacturers have tried to bring costs down as much as possible for obvious reasons, but they should not have brought them down so low that they sacrifice validation and testing to get there.

    The benefits SSD's have over HDD's are enormous and I am sure I am not alone when I say that I would quite happily pay an additional 15-25% than the current prices for my drive knowing that it works, full stop.
    Reply
  • QChronoD - Thursday, August 11, 2011 - link

    I understand sync and async, but not really sure what toggle means. Is it safe to assume that means that it can switch between the two modes? Or is there something else that is special about it? Reply
  • Nickel020 - Thursday, August 11, 2011 - link

    It's a different NAND standard. Intel/Micron NAND follows the ONFI standard (which they developed afaik), Toggle is another standard that's developed by Samsung and others, the Toggle NAND in SF2281 SSDs is 34nm from Toshiba.

    If I understand it correctly, the difference is mainly the interface, with which the MLC cells are connected to the controller. Both are MLC though, the basic principle on which they are based is the same.

    The Toggle NAND SSDs are generally faster, because 34nm means less density, more NAND dies, and thus more interleaving. Same thing causes bigger SSDs to be faster than smaller ones (read Anands other recent articles if you want to know more).
    Reply
  • Conscript - Thursday, August 11, 2011 - link

    is there a reason the same products aren't in every graph? Corsair GT seems to be missing from quite a few? Reply
  • Anand Lal Shimpi - Thursday, August 11, 2011 - link

    Fixed :)

    Take care,
    Anand
    Reply
  • Shadowmaster625 - Thursday, August 11, 2011 - link

    Is there a way you can force the drive to run at SATA2 speeds to see if that eliminates the lockups? Reply
  • irev210 - Thursday, August 11, 2011 - link

    You open this SandForce article on Intel 320 SSDs firmware bug.

    I love how the BSOD is a page two reference.

    Anand, your OCZ/sandforce bias bleeds through pretty hard. I hope you can be a bit more objective with your reports moving forward.

    The speed difference between SSDs at this point is pretty trivial. As you continue to hammer about reliability, you never even reviewed the Samsung 470, rarely talk about the Crucial C300/M4, and Toshiba seems to be an afterthought.

    At least tomshardware made an attempt to look at SSD reliability.

    Bottom line, it seems like sandforce-driven ssds have the biggest number of issues, yet you still recommend them. You say "well I never really experience the issues" but just because you don't doesn't mean that it is the most reliable drive.

    I think you should work a little harder at focusing on reliability studies instead of performance metrics. For most users, it taking 1.53 seconds or 1.54 seconds to open an application is pretty irrelevant if SSD A is 10x more likely to fail over SSD B.
    Reply

Log in

Don't have an account? Sign up now