The SF-2281 BSOD Bug

A few weeks ago I was finally able to reproduce the SF-2281 BSOD bug in house. In working on some new benchmarks for our CPU Bench database I built an updated testbed using OCZ's Agility 3. All of the existing benchmarks in CPU Bench use a first generation Intel X25-M and I felt like now was a good time to update that hardware. My CPU testbeds need to be stable given their importance in my life so if I find a particular hardware combination that works, I tend to stick to it. I've been using Intel's DH67BL motherboard for this particular testbed since I'm not doing any overclocking - just stock Sandy Bridge numbers using Intel's HD 3000 GPU. The platform worked perfectly and it has been crash free for weeks.

A slew of tablet announcements pulled me away from CPUs for a bit, but I wanted to get more testing done while I worked on other things. With my eye off the ball I accidentally continued CPU testing using an ASUS P8Z68-V Pro instead of my Intel board. All of the sudden I couldn't complete a handful of my benchmarks. I never did see a blue screen but I'd get hard locks that required a power cycle/reset to fix. It didn't take me long to realize that I had been testing on the wrong board, but it also hit me that I may have finally reproduced the infamous SandForce BSOD issue. The recent Apple announcements once more kept me away from my CPU/SSD work but with a way to reproduce the issue I vowed to return to the faulty testbed when my schedule allowed.

Even on the latest drive firmware, I still get hard locks on the ASUS P8Z68-V Pro. They aren't as frequent as before with the older firmware revision, but they still happen. What's particularly interesting is that the problem doesn't occur on Intel's DH67BL, only on the ASUS board. To make matters worse, I switched power supplies on the platform and my method for reproducing the bug no longer seems to work. I'm still digging to try and find a good, reproducible test scenario but I'm not quite there yet. It's also not a Sandy Bridge problem as I've seen the hard lock on ASRock's A75 Extreme6 Llano motherboard, although admittedly not as frequently.

Those who have reported issues have done so from a variety of platforms including Alienware, Clevo and Dell notebooks. Clearly the problem isn't limited to a single platform.

At the same time there are those who have no problems at all. I've got a 240GB Vertex 3 in my 2011 MacBook Pro (15-inch) and haven't seen any issues. The same goes for Brian Klug, Vivek Gowri and Jason Inofuentes. I've sent them all SF-2281 drives for use in their primary machines and none of them have come back to me with issues.

I don't believe the issue is entirely due to a lack of testing/validation. SandForce drives are operating at speeds that just a year ago no one even thought of hitting on a single SATA port. Prior to the SF-2281 I'm not sure that a lot of these motherboard manufacturers ever really tested if you could push more than 400MB/s over their SATA ports. I know that type of testing happens during chipset development, but I'd be surprised if every single motherboard manufacturer did the same.

Regardless the problem does still exist and it's a valid reason to look elsewhere. My best advice is to look around and see if other users have had issues with these drives and have a similar system setup to you. If you do own one of these drives and are having issues, I don't know that there's a good solution out today. Your best bet is to get your money back and try a different drive from a different vendor.

Update: I'm still working on a sort of litmus test to get this problem to appear more consistently. Unfortunately even with the platform and conditions narrowed down, it's still an issue that appears rarely, randomly and without any sort of predictability. SandForce has offered to fly down to my office to do a trace on the system as soon as I can reproduce it regularly. 

Introduction The Newcomers
POST A COMMENT

88 Comments

View All Comments

  • Ipatinga - Thursday, August 11, 2011 - link

    So, the Corsair Force GT is really going against OCZ Vertex 3? I thought it was agains Vertex 3 Max IOPS.

    In this case, the Corsair Force 3 is going after Agility 3?
    And Corsair Performance 3 is going after Solid 3?

    Thanks :)

    Would like to hear more about NAND Flash that is Async and Sync and Toogle.
    Reply
  • bob102938 - Thursday, August 11, 2011 - link

    There are some factors that were not considered on the first page of the article. The number of dies per wafer is important, but you are forgetting the cost of producing a flash memory wafer vs a VLSI wafer. Flash memory is a ~20 layer process that has margins for error which can be worked around. VLSI is a 60+ layer process that has 0 margin for error. Producing flash memory wafers is more than an order of magnitude cheaper than producing the same-size VLSI wafer. Additionally, turnaround time on a flash wafer can be achieved in ~20 days, whereas a VLSI wafer can require 3 months.

    Also the internal cost of a 300mm flash memory wafer is more like $1000. A VLSI wafer is around $8000.
    Reply
  • philosofool - Thursday, August 11, 2011 - link

    I don't want to blame the victims, end users. Obviously, manufacturers have a responsibility to QA.

    Still, when you look at the market forces here, it seems obvious that market forces are driving the problem.

    Manufacturer makes the COOL drive that gets the best performances marks of any drive out there. One year later, the COOLER drive is released. No one wants a COOL drive anymore. Plus, the margin making COOL drives is so small, you can't drop your price on a COOL drive to make it an attractive "midrange" option. So you have to start developing a new controller to make something down-right freezing.

    Because there's such an emphasis on performance, controllers and the drives they run become obsolete before a water-tight reliable version of the controller can be made. Of course, they're not really obsolete--there's nothing wrong with the X-25M controller--but they can't compete in a market with drives that show twice the random read performance of an unreliable competitor.

    Constant R&D on new controllers and the demand for performance mean that reliability takes a backseat. You can't sell COOL drives as long as someone makes a COOLER drive, even if cooler drives have reliability problems. Think about yourself: would you buy an X-25 M knowing that you could get a Vertex 3 instead?
    Reply
  • Bannon - Thursday, August 11, 2011 - link

    I built a system on an Asus P8Z68 Deluxe motherboard and used two Intel 510 250GB drives with it. One is the system drive and the other data drive with firmwares PWG2 and PWG4 respectively. To date I have not experienced a BSOD BUT my system drive will drop from 6Gbs to 3Gbs for no apparent reason and stay there until I power the system off. My data drive is rock solid at 6Gbs and stays there. I've just started working with Intel so I don't know where that is going to lead. Hopefully it end up with a new drive with the latest firmware and 6Gbs performance. Given my druthers I'd rather have this problem than the Sandforce BSOD's but I wanted to point out that everything isn't perfect in Intel-land. Reply
  • Coup27 - Thursday, August 11, 2011 - link

    Anand,

    Can we ever expect a 470 review?
    Reply
  • nish0323 - Thursday, August 11, 2011 - link

    or am I the only one about the fact that the OWC drive is the ONLY one with a 5 year warranty on it!! That's nuts... they actually back up the claim of their SSD drive longevity by giving you such a long warranty. I love SSDs. Reply
  • OWC Grant - Friday, August 12, 2011 - link

    Glad you noticed that warranty term because it's somewhat related to topic of this article. I've been in direct contact with Anand on this as the tone of article is all-encompassing and I wanted to shed some light on that from our perspective.

    While many SF based SSDs share firmware, not all hardware is the same. Our SSDs have subtle design and/or component differences which is what we feel reduces or eliminates our products susceptibility to the BSOD issue.

    The honest truth is we have not been able to create a BSOD issue here with our SSDs using the same procedures that caused other brands' SSDs to experience BSOD. Nor have we received or read one direct report of such an occurrence using our drives.

    And while we cut our teeth so to speak in the Mac industry, PLENTY of PC users have our SSDs in their systems...as well as that we do extensive testing on a variety of motherboards/system configs to ensure long term reliable operation.

    More supportive perhaps is the fact that we've had other brand users who experienced BSOD, but after buying our SSD, they reported back that it eliminated any issues they were experiencing.
    Reply
  • ckryan - Thursday, August 11, 2011 - link

    should be getting more reliable, not less. As profit margins get slimmer and slimmer, shouldn't manufactures be producing more reliable drives? Also, Intel might be making less money per drive, but surely their enterprise sales require the same levels of validation (required previously). Reply
  • Conscript - Thursday, August 11, 2011 - link

    am I nuts after reading multiple reviews from Anand as well as elsewhere, that I keep thinking I'm best off with a 256GB Crucial M4? I've had my 160GB X-25 for a while now, and think I'm going to hand it down to the wifey. Reply
  • Bannon - Thursday, August 11, 2011 - link

    I had a 256GB M4 which worked fine except it would BSOD if I let my system sleep. Reply

Log in

Don't have an account? Sign up now