Case Study: SSDs in AnandTech's Server Environment

For the majority of the history of AnandTech we've hosted our own server infrastructure. A benefit of running our own infrastructure is that we're able to gain a lot of hands on experience with enterprise environments that we'd otherwise have to report on from a distance.

When I first started covering SSDs four years ago I became obsessed with the idea of migrating nearly every system over to something SSD based. The first to make the switch were our CPU testbeds. Moving away from mechanical drives ensured better benchmark consistency between runs as any variation in IO load was easily absorbed by the tremendous amount of headroom that an SSD offered. The holy grail of course was migrating all of the AnandTech servers over to SSDs. Over the years our servers seem to die in the following order: hard drives, power supplies, motherboards. We tend to stay on a hardware platform until the systems start showing the signs of their age (e.g. motherboards start dying), but that's usually long enough that we encounter an annoying number of hard drive failures. A well validated SSD should have a predictable failure rate, making it an ideal candidate for an enterprise environment where downtime is quite costly and in the case of a small business, very annoying.

Our most recent server move is a long story for a separate article but to summarize the move, we recently switched hosting providers and data centers. Our hardware was formerly on the east coast and the new datacenter is in the middle of the country. At our old host we were trying out a new cloud platform while our new home would be a mixture of a traditional back-end with a virtualized front-end. With a tight timetable for the move and no desire to deploy an easily portable solution at our old home before making the move we were faced with a difficult task: how do we physically move our servers half way across the country with minimal downtime?

Thankfully our new host had temporary hardware very similar in capabilities to our new infrastructure that they were willing to put the site on as we moved our hardware. The only exception was, as you might guess, a relative lack of SSDs. Our new hardware uses a combination of consumer and enterprise SSDs but our new host only had mechanical drives or consumer grade SSDs on tap (Intel SSD 320s). The fact that we had run the site's databases off of a mechanical drive array for years meant that even a small number of consumer drives should be more than capable of handling the load. Remembering back to some of our earliest lessons in the SSD space: a single solid state drive can offer an order of magnitude better random IO performance than even the fastest hard drives. Sequential performance is typically closer, but with modern 3Gbps SSDs you're still looking at roughly a 100MB/s advantage in sequential throughput over the fastest mechanical drives.

Not wanting to deal with any potential IO performance issues, we decided to deploy a bunch of consumer grade Intel SSDs on our temporary DB servers that our host had on hand. This isn't uncommon, in fact a huge percentage of enterprise workloads are served just fine by consumer SSDs. It's only the absolute heaviest of workloads that demand eMLC or SLC drives. Given that we were going to stay on this platform for a short period of time, there was no need for eMLC/SLC drives. As we know from years of dealing with SSDs, NAND cells have a finite lifespan. Consumer grade MLC NAND is good for about 5000 program/erase cycles per cell, while enterprise grade MLC (eMLC/MLC-HET) can get you over 6x that. While most client workloads won't ever hit 5000 p/e cycles, a heavy enough enterprise workload can definitely reach that point. It's not just the amount of writing you do, it's also how much each write is amplified. Remember that although NAND is programmed at the page level (4KB - 8KB), it can only be erased at the block level (512KB - 2048KB). This imbalance in write/erase granularity means that eventually you'll have to write more to NAND than you've sent to the host (e.g. go to write 8KB but have to read, modify, write an entire 2048KB block as there are no empty blocks to write to). The ratio of NAND to host writes is referred to as write amplification. The combination of workload and write amplification are what determine the longevity of any SSD, but in the enterprise world it's something you actually need to pay attention to.


Write amplification around 1 may be realistic for client (read heavy) workloads, but not in the enterprise

Our host had eight Intel SSD 320s (120GB) on hand that we could use for our temporary database servers. From a performance standpoint these drives should be more than enough to handle our workload, but would they be reliable?

The easiest way to combat write amplification is to increase the amount of spare area on an SSD. NAND that isn't user addressable can be used for background operations and will help ensure that there are empty blocks to be written to as often as possible. If writes happen on empty vs. full blocks, write amplification goes down considerably.

We deployed the Intel SSD 320s partitioned down to 100GB each. Curious as to how they've been holding up over the past ~9 days that they've been running as our primary environment, I installed and ran Intel's SSD Toolbox on our DB servers.

As I mentioned in our review of Intel's SSD 710 you can actually measure things like write amplification, host writes and estimated lifespan using SMART attributes on the latest Intel SSDs. This won't work on the Intel SSD 510, X25-E or first generation X25-M, but on all other Intel SSDs it will.

The attributes of importance are E1, E2, E3 and E4 (these are hex representations of the SMART attribute numbers 225 - 228). E2 - E4 are all timer dependent, while E1 gives you an indication of just how much you've written to the NAND over the life of the drive. I'll get into the timer based counters shortly, but when we originally setup the servers I didn't have the foresight to reset the counters to get an accurate estimate of write amplification on our workload. Instead I'll look at total bytes written and use some internal estimates for write amplification to gauge lifespan.

The eight drives are divided among two database servers running two different applications (one for the main site and one for the forums/ads). The latter more heavily loaded but both are pretty demanding. The four drives are configured in a RAID10 to increase capacity and offer some redundancy. A simple RAID1 of larger drives would be fine, but 120GB drives are all we had on hand at the time. The total number of writes across all of the drives for the past 9 days is documented in the table below:

AnandTech Database Server SSD Workload
SMART Attribute E1 MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB)
Drive 0 576.28GB 1.03TB
Drive 1 563.28GB 1.03TB
Drive 2 564.44GB 1.13TB
Drive 3 568.03GB 1.13TB

Each drive in the first DB server has seen around 570GB of writes in 9 days, or roughly 63GB/day. The drives in the second DB server have gone through 1.03TB of writes in the same period of time or 114GB/day. Note that both workloads are an order of magnitude greater than an average consumer workload of 5 - 10GB/day. That's not to say that we can't run these workloads on consumer SSDs, we just need to be careful.

With no write amplification we could run on these consumer drives indefinitely. With each MLC NAND cell good for 5000 program/erase cycles, we could write to the drives 5000 times over before we started to lose NAND. Based on the numbers above, we'd blow through a p/e cycle every ~2 days on the first DB server and every ~1 day on the second server.

While we like to assume write amplification is nice and low, in reality it isn't. Intel's own datasheets tell us the worst case write amplification for the 320:

If you divide the column on the right by the column on the left you'll come up with 125 program/erase cycles per cell (if you define 1TB as one trillion bytes). If we assume that each cell is good for 5000 p/e cycles (Intel's 25nm MLC NAND spec) then it means that we're actually writing 40x what we think we're writing. This 40x value gives us an upper bound for write amplification on Intel's SSD 320. That's far lower than the peak theoretical max write amplification of 256 (writing 2048KB for every 8KB write sent to the host), but it's safe to say that Intel's firmware won't let things get that bad.

Write amplification of 40x isn't very good but it's also not very realistic for the majority of workloads. Our database workloads are heavy but they are not perfectly random writes over all LBAs for the life of the drive. Those workloads do exist, but we're simply not an example of one. A more realistic, but still conservative estimate for write amplification in our case would be 10x (just based on some internal estimates for write amplification). The actual write amp is likely less than half that but again, I wanted to be conservative.

Calculating longevity based on this data is pretty simple. Multiply the total bytes written by estimated write amplification and that's how we can scale up/down:

Intel SSD 320 - 120GB - SSD Longevity
  MS SQL Server (Main Site DB) MySQL Server (Forums/Ads DB)
Total Bytes Written (9 days) 576GB 1030GB
Estimated Write Amplification 10x 10x
Drive Capacity 120GB 120GB
Cycles Used 48 85.8
Cycles Used per Day 5.33 9.53
Worst Case P/E Cycles Available 5000 5000
Estimated Lifespan (Days) 937.5 524.3
Estimated Life (Years) 2.57 years 1.44 years

With 10x write amplification we're looking at roughly 2.5 years for one set of drives and 1.4 years for the other set. From a cost standpoint, it's likely cheaper to go the consumer drive route and proactively replace drives compared to jumping to eMLC although that's not always desirable from an uptime perspective. If our write amplification estimates are off (they are) then you can expect something more along the lines of 5 years for the first set of drives and 3 for the second set.

We could combat write amplification by setting aside even more spare area for the drives. Partitioning them as 80GB or even 60GB drives would tangibly reduce write amplification and give us even more time on these consumer drives.

This isn't so much an issue for us as our stay on the 320s is temporary, but it did bring up an interesting question. As enterprise SSD endurance is heavily dependent on write amplification, how would Intel's SandForce based SSD 520 handle enterprise workloads given that its effective write amplification is often times less than 1x?

If you were able to average 0.5x write amplification with Intel's SSD 520, your 5000 p/e cycle MLC NAND would behave more like 10000 p/e cycle NAND. While that's still short of what you get from eMLC, it is perhaps enough to be a better balance of price/performance for many SMB enterprise customers.

It was time to do some investigating...

Intel's SSD 520 in the Enterprise
POST A COMMENT

55 Comments

View All Comments

  • Anand Lal Shimpi - Thursday, February 09, 2012 - link

    Given enough spare area and a good enough SSD controller, TRIM isn't as important. It's still nice to have, but it's more of a concern on a drive where you're running much closer to capacity. Take the Intel SSD 710 in our benchmarks for example. We're putting a ~60GB data set on a 200GB drive with 320GB of NAND. With enough spare area it's possible to maintain low write amplification without TRIM. That's not to say that it's not valuable, but for the discussion today it's not at the top of the list.

    The beauty of covering the enterprise SSD space is that you avoid a lot of the high write amp controllers to begin with and extra spare area isn't unheard of. Try selling a 320GB consumer SSD with only 200GB of capacity and things look quite different :-P

    Take care,
    Anand
    Reply
  • Stuka87 - Wednesday, February 08, 2012 - link

    Great article Anand, I have been waiting for one like this. It will really come in handy to refer back to myself, and refer others too when they ask about SSD's in an enterprise environment. Reply
  • Iketh - Thursday, February 09, 2012 - link

    Anand's nickname should be Magnitude or the OOM Guy. Reply
  • wrednys - Thursday, February 09, 2012 - link

    What's going on with the media wear indicator on the first screenshot? 656%?
    Or is the data meaningless before the first E4 reset?
    Reply
  • Kristian Vättö - Thursday, February 09, 2012 - link

    Great article Anand, very interesting stuff! Reply
  • ssj3gohan - Thursday, February 09, 2012 - link

    So... something I'm missing entirely in the article: what is your estimate of write amplification for the various drives? Like you said in another comment, typical workloads on Sandforce usually see WA < 1.0, while in this article it seems to be squarely above 1. Why is that, what is your estimate of the exact value and can you show us a workload that would actually benefit from Sandforce?

    This is very important, because with any reliability qualms out of the way the intel SSD 520 could be a solid recommendation for certain kinds of workloads. This article does not show any benefit to the 520.
    Reply
  • Christopher29 - Thursday, February 09, 2012 - link

    Members of this forum are testing (Anvil) SSDs with VERY extreme workloads. X25-V40GB (Intel drive) has already 685 TB WRITES ! This is WAY more than 5TB suggested by Intel. They also fill drives completely! This means that your 120GB SSDs (limited even to 100GB) could withstand almost 1 PB writes. One of their 40GB Intel 320 failed after writting 400TB!
    Corsair Force 3 120GB has already 1050TB writes! You shoul reconsider your assumptions, because it seems that those drives (and large ones especially) will last much longer.

    Stats for today:
    - Intel 320 40GB – 400TB (dead)
    - Samsung 470 64GB – 490TB (dead)
    - Crucial M4 64GB – 780TB (dead)
    - Crucial M225 60GB – 840TB (dead)
    - Corsair F40A - 210TB (dead)
    - Mushkin Chronos Deluxe 60GB – 480TB (dead)
    - Corsair Force 3 120GB – 1050TB (1 PB! and still going)
    - Kingston SSDNow 40GB (X25-V) (34nm) - 640TB

    SOURCE:
    http://www.xtremesystems.org/forums/showthread.php...
    Reply
  • Christopher29 - Thursday, February 09, 2012 - link

    PS: And also interestingly Force 3 (that lasted longest) is exactly SF-2281 drive? So what is it in reality Anand, does this mean that SF do write less and therefore SSD last longer? Reply
  • Death666Angel - Thursday, February 09, 2012 - link

    In every sentence, he commented how he was being conservative and that real numbers would likely be higher. However, given the sensitive nature of business data/storage needs, I think most of them are conservative and rightly so. The mentioned p/e cycles are also just estimates and likely vary a lot. Without anyone showing 1000 Force 3 drives doing over 1PB, that number is pretty much useless for such an article. :-) Reply
  • Kristian Vättö - Thursday, February 09, 2012 - link

    I agree. In this case, it's better to underestimate than overestimate. Reply

Log in

Don't have an account? Sign up now