One of the expanding elements of the storage business is that the capacity per drive has been ever increasing. Spinning hard-disk drives are approaching 20 TB soon, while solid state storage can vary from 4TB to 16TB or even more, if you’re willing to entertain an exotic implementation. Today at the Data Centre World conference in London, I was quite surprised to hear that due to managed risk, we’re unlikely to see much demand for drives over 16TB.

Speaking with a few individuals at the show about expanding capacities, storage customers that need high density are starting to discuss maximum drive size requirements based on their implementation needs. One message starting to come through is that storage deployments are looking at managing risk with drive size – sure, a large capacity drive allows for high-density, but in a drive failure of a large drive means a lot of data is going to be lost.

If we consider how data is used in the datacentre, there are several levels regarding how often the data is used. Long-term storage, known as cold storage, is accessed very infrequently and occupied with mechanical hard-drives with long-time data retention. A large drive failure at this level might lose substantial archival data, or require long build times. More regularly accessed storage, or nearline storage / warm storage, is accessed frequently but is often used as a localised cache from the long-term storage. For this case, imagine Netflix storing a good amount of its back-catalogue for users to access – a loss of a drive here requires accessing colder storage, and the rebuild times come in to play. For hot storage, the storage that has constant read/write access, we’re often dealing with DRAM or large database work with many operations per second. This is where a drive failure and rebuild can result in critical issues with server uptime and availability.

Ultimately the size of the drive and the failure rate leads to element of risks and downtime, and aside from engineering more reliant drives, the other variable for risk management is drive size. 16TB, based on the conversations I’ve had today, seems to be that inflection point; no-one wants to lose 16TB of data in one go, regardless of how often it is accessed, or how well a storage array has additional failover metrics.

I was told that sure, drives above 16TB do exist in the market, however aside from niche applications (such as risk is an acceptable factor for higher density), volumes are low. This inflection point, one would imagine, is subject to change based on how the nature of data and data analytics will change over time. Samsung’s PM983 NF1 drive tops out at 16 TB, and while Intel has shown samples of 8 TB units of its long ruler E1.L form factor, it has listed future drives using QLC up to 32TB. Of course, 16 TB per drive puts no limits on the number of drives per system – we have seen 1U units with 36 of these drives in the past, and Intel has been promoting up to 1 PB in a 1U form factor. It is worth noting that the market for 8 TB SATA SSDs is relatively small - no-one wants to rebuild that large a drive at 500 MB/s, which would take a minimum of 4.44 hours, bringing server uptime down to 99.95% rather than the 99.999% metric (5m22 per year).

Related Reading

POST A COMMENT

85 Comments

View All Comments

  • PeachNCream - Thursday, March 14, 2019 - link

    Yup, you do not take the SAN down when you need to rebuild one failed drive. I do agree that the market for 8+TB SATA drives is small, but it hasn't much to do with rebuild time. There are other factors at play here that were overlooked. Reply
  • abufrejoval - Thursday, March 14, 2019 - link

    Yup, SANs will rebuild from standby drives and live rebuilding is why you pay for smart RAID controllers even at home.

    I think rebuilding my home-lab RAID6 after upgrading from 6 to 8 4TB drives actually took longer than copying the content over a 10Gbit network (more than two days). But I slept soundly while that was going on because it primary data wasn't at risk and I also had a full backup on another RAID.
    Reply
  • Kevin G - Wednesday, March 13, 2019 - link

    Storage and networking are relatively flexible nowadays.

    I also challenge the idea that arrays being rebuilt at 500 MByte/s: NVMe is here and a good NVMe RAID controller should be able to rebuild at close to 3 GByte/s. Granted that would shift the ~5 hour of down time to ~50 minutes, still beyond what 99.999% uptime would require. The rest this can be controlled by better storage policies as say leverage multiple smaller RAID5/6 arrays the sit behind a logical JBOD instead of a single larger units. IE six RAID5 arrays leveraging six drives a piece instead of one massive 36 drive array. Beyond that, leverage system level redundancy so that requests that would normally be services are then directed to a fully functional mirror. Granted load would not be even between the systems but externally there would be no apparent drop of service and a slight dip in performance for a subset of accesses from normal. The end result would be no measurable downtime.

    Data redundancy is mostly a solved problem. The catch is that the solutions are not cheap which I see price being the bigger factor in maintaining proper redundancy.
    Reply
  • zjessez - Wednesday, March 13, 2019 - link

    Enterprises are now using dual port SAS 12Gbps SSD's and they will rebuild a lot faster than 500MBps. In addition many vendors are adopting NVMe and that would rebuild even faster. The reality is that it is a cost issue with regards to high-capacity SSD's and second it is a bandwidth issue in terms of rebuild times. However there are also considerations in how the drive manages a rebuild, is it a Raid-5 or a Raid-6 implementation which will be slower, or is it distributed raid that can rebuild in parallel or erasure coding based on Reed Solomon. These are all factors that come in to play in terms of rebuild speeds and degradation of performance during such rebuild. Reply
  • jhh - Wednesday, March 13, 2019 - link

    One of the other issues with extremely high capacity drives is that the I/O interface doesn't scale with capacity. If the data is so infrequently accessed, that a higher speed interface isn't needed, why not just use cheaper HDD storage? Latency to that infrequently accessed data is about the only downside. A similar issue arrives with disaggregated storage, that even 100 Gbps Ethernet becomes a bottleneck for a system with many NVME drives. Reply
  • PixyMisa - Wednesday, March 13, 2019 - link

    True, but with NVMe, and with PCIe 4.0 and 5.0 on the way, we do have a lot of bandwidth. A 16TB PCIe 5.0 x4 drive would take less than 20 minutes to read. Reply
  • jordanclock - Wednesday, March 13, 2019 - link

    I love that Ian talked to people in actual data center industries, reports that those same people all said 16TB is the effective limit based on all of the factors, and all the comments are saying they're wrong.

    I'm seeing a lot of armchair engineering in these comments and not a lot of citation of sources or even experiences.
    Reply
  • rpg1966 - Thursday, March 14, 2019 - link

    People are questioning it because he hasn't explained why *any*-sized drive would be an issue given proper data management techniques. Reply
  • Fujikoma - Thursday, March 14, 2019 - link

    He stated that he talked to a few individuals. He should have included the size of the data centers the drive were intended for, that these people ran, as a matter of perspective. Myself, I could see drives larger than 16TB for small businesses that use the data internally and could rebuild on a weekend. Reply
  • ksec - Wednesday, March 13, 2019 - link

    I think I need to dig up some info on this because I don't quite understand it. Any players that are using 16TB per "Ruler" will likely have huge redundancy built into their system. Not just consumer grade RAID, but something likely much more sophisticated than ZFS Cluster or Black Blaze's Reed-Solomon.

    And with these drives going up to even 8GB/s, I don't see how 16TB will be a factor. If anything it will be the cost per GB that will be the issue. Assuming 8TB / Ruler has the same NAND reliability of 16TB.

    If anything it will likely be the speed of Network that is the handycap.
    Reply

Log in

Don't have an account? Sign up now