File Server Builder's Guide

Name: File Server Builder's Guide
Item: File Server Builder's Guide
Author: Zach Throckmorton

by Zach Throckmorton on September 4, 2011 3:30 PM EST

152 Comments | Add A Comment

152 Comments

Hard drives

One of the most frequently asked questions I hear is 'what's the most reliable hard drive?' The answer to this question is straightforward - the one that's backed up frequently. Home file servers can be backed up with a variety of devices, from external hard drives to cloud storage. As a general guideline, RAID enhances performance but it is not a backup solution. Some RAID configurations (such as RAID 1) provide increased reliability, but others (such as RAID 0) actually decrease reliability. A detailed discussion of different kinds of disk arrays is not within the scope of this guide, but the Wikipedia page is a good place to start your research if you're unfamiliar with the technology.

As for hard drive reliability, every hard drive can fail. While some models are more likely to fail than others, there are no authoritative studies that implement controlled conditions and have large sample sizes. Most builders have preferences - but anecdotes do not add up to data. There are many variables that all affect a drive's long-term reliability: shipping conditions, PSU quality, temperature patterns, and of course, specific make and model quality. Unfortunately, as consumers we have little control over shipping and handling conditions until we get a drive in our own hands. We also generally don't have much insight into a specific hard drive model's quality, or even a manufacturer's general quality. However, we can control PSU quality and temperature patterns, and we can use S.M.A.R.T. monitoring tools

One of the most useful studies on hard drive reliability was presented by Pinheiro, Weber, and Barroso at the 2007 USENIX Conference on File and Storage Technologies. Their paper, Failure trends in a large disk drive population, relied on data gleaned from Google. So while the controls are not perfect, the sample size is enormous, and it's about as informative as any research on disk reliability. The PDF is widely available on the web and is definitely worth a read if you've not already seen it and you have the time (it's short at only 12 pages with many graphs and figures). In sum, they found that SMART errors are generally indicative of impending failure - especially scan errors, reallocation counts, offline reallocation counts, and probational counts. The take home message: if one of your drives reports a SMART error, you should probably replace it and send it in for replacement if it's under warranty. If one of your drives reports multiple SMART errors, you should almost certainly replace it as soon as possible.

From Pinheiro, Weber, and Barroso 2007. Of all failed HDDs, more than 60% had reported a SMART error.

Pinheiro, Weber, and Barroso also showed how temperature affects failure rates. They found that drives operating at low temperatures (i.e. less than 75F/24C) actually have the highest (by far) failure rates, even greater than drives operating at 125F/52C. This is likely an irrelevant point to many readers, but for those of us who live further up north and like to keep our homes at less than 70F/21C in the winter, it's an important recognition that colder is not always better for computer hardware. Of use to everyone, the study showed that the pinnacle of reliability occurs around 104F/40C, from about 95F/35C to 113F/45C.

From Pinheiro, Weber, and Barroso 2007. AFR: Annualized Failure Rate - higher is worse!

Given the range of temperatures that hard drives appear to function most reliably at, it might take some experimentation in any given case to get a home file server's hard drives in an ideal layout.

So rather than answering what specific hard drive models are the most reliable, we recommend you do everything you can to prevent catastrophic failure by using quality PSUs, maintaining optimal temperatures, and paying attention to SMART utilities. For such small sample sizes as a home file server necessitates, the most important factor in long-term HDD reliability is probably luck.

Pragmatically, low-rpm 'green' drives are the most cost-effective storage drives. Note that many of the low-rpm drives are not designed to operate in a RAID configuration - be sure to research specific models. The largest drives currently available are 3TB, which can now be found for as little as $110. The second-largest capacity drives at 2TB generally offer the best $/GB ratio, and can regularly be found for $70 (and less when on sale or after rebate). 1TB drives are fine if you don't need much space, and can sometimes be found for as little as $40.

Cases and Power Supplies Concluding Remarks

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

152 Comments

View All Comments

mino - Tuesday, September 6, 2011 - link
For a plenty of money :)

Basically, a SINGLE decent raid card costs ~200+ for which you have the rest of the system.

And you need at least 2 of them for redundancy.

Also, with a DEDICATED file server and open sourced ZFS, who needs HW RAID? ...
alpha754293 - Tuesday, September 6, 2011 - link
In most cases, the speed of the drives/controller/interface is almost immaterial because you're going to be streaming it over a 1 Gbps network at most.

And if you actually HAVE 10GoE or IB or Myrinet or any of the others, I'm pretty sure that if you can afford the $5000 switch, you'd "splurge" on the $1000 "proper" HW RAID card.

Amusing how all these people are like "speed speed speed!!!!" forgetting that the network will likely be the bottleneck. (And wifi is even worse, 0.45 Gbps is the best you can do with wifi-n.)
DigitalFreak - Sunday, September 4, 2011 - link
I've been using Dell PERC-5i cards for years. You can find them relatively cheap on E-bay, and they usually include the battery backup. I believe they're limited to 2TB drives though.
JohanAnandtech - Monday, September 5, 2011 - link
"But there's the fact that software RAID (which is what you're getting on your main board) is utterly inferior to those with dedicated RAID cards"

hmm. I am not sure those entry-level firmware thingies that have a R in front of them are so superior. They offload most processing tasks to the CPU anyway, and they tend to create problems if they break and you replace them with a new one with a newer firmware. I would be interested to know why you feel that Hardware RAID (except the high end stuff) is superior?
Brutalizer - Monday, September 5, 2011 - link
When you are saying that software raid is inferior to hardware raid, I hope you are aware that hw-raid is not safe against data corruption?

You have heard about ECC RAM? Spontaneous bit flips can occur in RAM, which is corrected by ECC memory sticks.

Guess what, the same spontaneous bit flips occur in disks too. And hw-raid does not detect nor correct such bit flips. In other words, hw-raid has no ECC correction functionality. Data might be corrupted by hw-raid!

Neither does NTFS, ext3, XFS, ReiserFS, etc correct bit flips. Read here for more information, there are also research papers on data corruption vs hw-raid, NTFS, JFS, etc:
http://en.wikipedia.org/wiki/ZFS#Data_Integrity

In my opinion, the only reason to use ZFS is because it detects and corrects such bit flips. No other solution does. Read the link for more information.
sor - Monday, September 5, 2011 - link
Many RAID solutions scrub disks, comparing the data on one disk to the other disks in the array. This is not quite as robust as the filesystem being able to checksum, but as your own link points out, the chances of a hard drive flipping bits is something on the order of 1 in 1.6PB, so combined with a RAID that regularly scrubs the data I don't see home users needing to even think about this.
Brutalizer - Monday, September 5, 2011 - link
You are neglecting something important here.

Say that you repair a raid-5 array. Say that you are using 2TB disks, and you have an error rate of 1 in 10^16 (just as stated in the article). If you repair one disk, then you need to read 2 000 000 000 000 byte, every time you read a bit, an error can occur.

The chances of at LEAST ONE ERROR, can be calculated by this wellknown formula:
1 - (1-P)^n
where P is the probability of an error occuring, and "n" is the number of times the error can occur.

If you insert those numbers, then it turns out that during repair, there is something like 25% of hitting at least one read error. It might you have hit two errors, or three errors, etc. Thus, there are 25% chance of you getting read errors.

If you repair a raid, and then run into read errors - you have lost all your data, if you are using raid-5.

Thus, this silent corruption is a big problem. Say some bits in a video file is flipped - that is no problem. An white pixel might be red instead. Say your rar file has been affected, then you can not open it anymore. Or a database is affected. This is a huge problem for sysadmins:
http://jforonda.blogspot.com/2006/06/silent-data-c...
Brutalizer - Monday, September 5, 2011 - link
PS. There is 1 in 10^16 that the disk will not be able to recover the bit. But there are more factors involved: current spikes (no raid can do this):
http://blogs.oracle.com/elowe/entry/zfs_saves_the_...

bugs in firmware, loose cables, etc. Thus, the chance is much higher than 10^ 16.

Also, raid does not scrub disks thoroughly. They only compute parity. That is not checksumming data. See here about raid problems:
http://en.wikipedia.org/wiki/RAID#Problems_with_RA...
alpha754293 - Tuesday, September 6, 2011 - link
@Brutalizer
Bit flips

I think that CERN was testing that and found that it was like 1 bit in 10^14 bits (read/write) or something like that. That works out (according to the CERN presentation) to be 1 BIT in 11.6 TiB.

If a) you're concerned about silent data corruption on that scale, and b) that you're running ZFS - make sure you have tape backups. Since there ARE no offline data recovery tools available. Not even at Sun/Oracle. (I asked.)
sor - Monday, September 5, 2011 - link
Inferior how? I've been doing storage evaluation for years, and I can say that software raid generally performs better, uses negligible CPU, and is easier to recover from failure (no proprietary hardware). The only reason I'd want a hardware RAID is for ease of use and the battery-backed writeback.

File Server Builder's Guide

Post Your Comment

152 Comments

View All Comments

mino - Tuesday, September 6, 2011 - link

alpha754293 - Tuesday, September 6, 2011 - link

DigitalFreak - Sunday, September 4, 2011 - link

JohanAnandtech - Monday, September 5, 2011 - link

Brutalizer - Monday, September 5, 2011 - link

sor - Monday, September 5, 2011 - link

Brutalizer - Monday, September 5, 2011 - link

Brutalizer - Monday, September 5, 2011 - link

alpha754293 - Tuesday, September 6, 2011 - link

sor - Monday, September 5, 2011 - link

Log in

Don't have an account? Sign up now