Original Link: https://www.anandtech.com/show/2105



Introduction

The first magnetic disk was introduced by IBM in the 305 RAMAC computer on September 13th, 1956. The first disk drive was the size of two large refrigerators, could hold 4.4 MB, and cost $10,000 per MB. Although the capacity of the hard disk has exploded and price per GB has decreased spectacularly, the price of a complete enterprise storage solution can still quickly amount to tens of thousands of dollars and more.

Building a complete server solution for our own server lab, we quickly found out that finding the best storage solution for our needs was pretty hard when you are on a tight budget. As usual, the companies active in this market are not helping out. Minor evolutions are called "Breakthrough Architectures", "Affordability" means "not too expensive unless you need more than two drive bays filled" and "Business Intelligence" or "Investment Protection" just means that the marketing people were running out of buzz words and inspiration. In fact, the storage companies do their best to confuse people by calling both a simple SCSI DAS and a very expensive Fibre Channel SAN "scalable", "flexible", "affordable" and "serviceable".

The seasoned storage veteran quickly weeds out all the fluffy buzzwords, but what if you are relatively new to this market? What if your own experience with storage has been limited to adding disks to your old trusty tower server or the workstations of your colleagues? Welcome to the second part of our server guide! Just like our first guide, our goal is to offer you a no-nonsense introduction into the server room, and in this particular server guide we focus on storage performance and different disk interfaces.


Disk performance?

Before we start discussing the different topologies and technologies in the storage world, it is good to get back to basics. The basic component of 99.9% of the storage technology out there is still the hard disk.

To understand the basic performance of a disk, take a look at what happens when a request is sent to the disk:
  1. The Disk controller translates a logical address into a physical address (cylinder, track, and sector). The request is a matter of a few tens of nanoseconds, the command decoding and translating can take up to 1 ms.
  2. The head is moved by the actuator to the correct track. This is called seek time, the average seek time is somewhere between 3.5 and 10 ms
  3. The rotational motor makes sure that the correct sector is located under the head. This is called rotational latency and it takes from 5.6 ms (5400 rpm) to 2 ms (15000 rpm). Rotational latency is thus determined by how fast the rotational motor spins.
  4. The data is then read or written. The time it takes is dependent on how many sectors the disk has to write or read. The rate at which data is accessed is called the media transfer rate (MTR).
  5. If data is read, the data goes into disk buffer, and is transferred by the disk interface to the system.
Media transfer rate (MTR) depends on the rotation speed and on the density with which data is stored. The higher the density, the more data moves under the head in the same amount of time.

Which operation will be the most important? That depends on the amount of data you read or write. If you need many small pieces of data all scattered all over the disk, seek time and latency are the most important. On the other hand if you transfer larger, contiguous pieces of data (i.e. data that is located in close proximity on the drive surface), the MTR will be the most important parameter.

To illustrate this, take a look at the table below. The table below calculates how much time it would take to transfer one block of 4 MB, similar to opening a MP3 song on a desktop PC. We also calculate the time it takes to get 100 different blocks of 4 KB, similar to what would happen if 100 users sent a very simple query to a database server simultaneously. At the end of the table we calculate the total time it takes to perform the requested actions, and we calculate the sustained transfer rate (STR), or the amount of data divided by the total time.



The Faster SATA and SCSI disk performing a database and a typical desktop workload

Although it's transferring one tenth the amount of data, the database access takes almost 15 times more time. In the case of our database access, seek time and latency determine 90-95% of our disk performance, while transfer time is only 1%. If we increase the size of the blocks that we need to 16 KB, little would change. The transfer time would quadruple, but the total time would hardly increase. However, if we increase the numbers of blocks or more generally the number of "I/O operations" that we access, the total time necessary to complete this action would scale almost linearly: twice as many I/O operations will double the time.

In our "desktop MP3" example, transfer time is good for 85% of the time: MB/s is the most important metric. File and FTP servers are somewhere between the desktop and database server examples: on average the number of KB per I/O operation is much higher than a transactional database, but I/O operations are also requested simultaneously.

So basically, there are two ways to measure storage performance:
  1. In MB/s
  2. In I/O operations per second
Notice that in the worst case, database storage server performance can be less than 1 MB/s. Of course, smart techniques such as Native Command Queuing, read ahead buffers, Out of Order Data delivery, and smart caches can lower the impact of concurrent accesses. However, it is not uncommon for database applications to lower the STR (Sustained Transfer Rate) of very fast drives to a few MB per second.



Enterprise Disks: all about SCSI

There are currently only two kinds of hard disks: those which work with SCSI commands and those which work with (S)ATA commands. However, those SCSI commands can be sent over three disk interfaces:
  • SCSI-320 or 16 bit Parallel SCSI
  • SAS or Serial Attached SCSI
  • FC or Fibre Channel
Fibre Channel is much more than just an interface; it could be described as a complete network protocol like TCP/IP. However as we are focusing on the disks right now, we consider it for now as an interface through which SCSI commands are sent. Right now, Fibre channel disks sales amount to about 20% of the Enterprise market and are mostly sold in the high end market. SCSI-320 used to have about 70% of this market, but it is being replaced quickly by SAS[1]. Some vendors estimate that SAS drives are already good for about 40% of the enterprise market. It is not clear which percentage of enterprise drives are SATA drives, but it is around 20%.


Will SATA kill off SCSI?

One look at the price of a typical "enterprise disk" -- whether it be a SCSI, FC or SAS disk -- will tell you that you have to pay at least 5 times and up to 10 times more per GB. Look at the specification sheets and you will see that the advantage you get for paying this enormous price premium seems to be only a 2 to 2.5 times lower access time (seek + latency) and a maximum transfer rate that is perhaps 20 to 50% better.

In the past, the enormous price difference between the disks which use ATA commands and the disks which use SCSI commands could easily be explained by the fact that a PATA disk would simply choke when you sent a lot of random concurrent requests. As quite a few of reviews here at AnandTech have shown, thanks to Native Command Queuing the current SATA drives handle enterprise workloads quite well. The number of concurrent I/O operations per second is easily increased by 50% thanks to NCQ. So while PATA disks were simply pathetically slow, the current SATA disks are - in the worst case - about half as fast as their SCSI counterparts when it comes to typical I/O intensive file serving.

There is more. The few roadblocks that kept SATA out of the enterprise world have also been cleared. One of the biggest problems was the point to point nature of SATA: each disks needs its own cable to the controller. This results in a lot of cable clutter which made SATA-I undesirable for enterprise servers or storage rack enclosures.

This roadblock can be removed in two ways. The first way is to use a backplane with SATA port multipliers. A port multiplier can be compared to a switch. One Host to SATA connection is multiplexed to multiple SATA connectors. At most 15 disks can make us of one SATA point to point connection. In reality, port multipliers connect 4 to 8 disks per port. As the best SATA disks are only able to sustain about 60 to 80 MB/s in the outer zones, a four disk port multiplier make sense even for streaming applications. For more random applications, even 8 or 15 disks on one 300 MB/s SATA connection would not result in a bottleneck.


Port multipliers are mostly used on the back panel of a server or the backplane of a storage rack. The second way is to use your SATA disks in a SAS enclosure. We will discuss this later.



Parallel SCSI in trouble

Just like P-ATA, which was limited to 133 MB/s, all kinds of skew and crosstalk problems kept parallel SCSI from using a 160 MHz clock. That 160 MHz clock was necessary to reach the 640 MB/s for the next evolution of parallel SCSI, SCSI-640. The result is that SCSI-640 died a silent death.

As you can use up to 14 disk drives on one shared SCSI bus, and with the fastest SCSI hard disk reaching up to 100 MB/s, the 320 MB/s transfer rate was starting to become a bottleneck in many fileserver related applications. By design, SCSI is not very efficient when it comes to raw bandwidth. Tests show that 2 Gb/s or 200 MB/s FC offers more raw bandwidth than SCSI 320. Of course, fourteen SCSI disk still do not need to transfer several hundreds of megabytes per second when running a transactional OLTP database workload, but maximum wire speed wasn't the only concern with parallel SCSI.

As SCSI-320 was still backwards compatible with the early SCSI versions, commands were sent at a "back to the eighties" pace: 5 MB/s. This slow rate of sending commands wastes up to 30% of the performance of the SCSI bus. Another big problem was the fact that you could only attach 14 devices on one host bus adapter. This limited the possibilities to expand your storage in directly attached storage configurations.


The SAS/SATA revolution

SAS is much more than a serialized version of SCSI-320. Instead of writing a - probably boring - essay about the new features in the SAS protocol, let us see what new functionality is available by just looking at real SAS implementations and products. First, we take a look at the LSI Logic SAS3442X-R.



LSI Logic SAS3442X-R: a 8-port 3Gb/s SAS, PCI-X HBA

The first things that you will notice is that you can attach two cables to the internal SAS connector: one for SATA and one for SAS drives. The male connector on our SAS HBA card can connect to both the "female" connectors of SATA and the SAS cables which are slightly different. As you can see below, it is not possible to use the SAS male connector (the lower drawing in blue and green) of the disk with SATA cables. Basically, SAS HBA supports both SAS and SATA drives whereas SATA HBA only support SATA drives.



The connector on top enables the ability to connect - internally - a SATA drive to our HBA, and the connector at the bottom allows us to connect SAS drives.

SAS is just like TCP/IP or Fibre Channel in that it's a layered protocol. There are three transport protocols that use the same SAS underlying physical and link-layer protocols:
  • The Serial SCSI Protocol (SSP), which transports the SCSI commands over the link layer (similar to the Fibre Channel Protocol)
  • The Serial ATA Tunnelling Protocol (STP), which transports the SATA frames to the SATA drives
  • The Serial Management Protocol (SMP), the protocol which makes it possible to use expanders and to get diagnostic information (does the disk work, is it plugged in?)
Below you can see the whole layered model.


The Serial Management Protocol supports expanders, which give SAS a very high degree of flexibility and scalability. The "Port layer" allows wide port which enables much more bandwidth than would have been possible with SCSI-320 and 640. Notice that the SCSI Protocol (SSP) and the Serial ATA Tunnelling Protocol use the same link and physical layers. This allows SATA and SAS drives on the same expander. Let us bring the theory into actual practice....



SAS layers in the real world

The SAS layered model is much more than just theory. Below you can see how our LSI Logic card is able to combine a SATA RAID volume with two SATA disks and a SAS RAID volume with four 15000 rpm Fujitsu SAS hard disks in the same Promise Vtrak 300JS JBOD storage rack. Don't let the word JBOD confuse you: it does not refer to the RAID level "Just a Bunch Of Disks". It means that the storage rack doesn't have any raid capabilities, and that the RAID implementation needs to be on the HBA card. Also notice the word "expander" in the screenshot below, which refers to the circuit which is basically a SAS crossbar switch and a router at the same time. We will discuss the expander's functionality in more detail.


The bandwidth between the HBA and the SAS storage racks is also much higher than similar SCSI-320 configurations. Let us take a look at the external port of our HBA card. We find an industry standard 4x wide-port SAS connector, which combines four 300 MB/s SAS cables in one "SFF-8470 4X to 4X" external SAS cable. This gives us 1.2 GB/s bandwidth to any external storage rack, four times more than what would be possible with SCSI-320.



A 4x wide port

When we combine this wide port with a storage rack that has a built-in expander, magic happens. We get the advantages of the parallel SCSI architecture and those of the Serial SATA architecture. To understand why we say that "magic happens", let us take a concrete example. The storage rack that we tried out was the Promise Vtrack J300S, which can contain up to 12 3.5 inch drives. Take a look at the picture below.


In this configuration you get a big 1.2 GB/s pipe, called a SAS wide port, to your storage rack. If we use SATA without a port multiplier, we would only be able to attach four drives: each drives needs its own cable, its own point to point connection. If we use SATA with a port multiplier, we are able to use all 12 drives, but our maximum bandwidth to our HBA would be limited to 300 MB/s. This is OK for transaction based traffic, but it might introduce a bottleneck for streaming applications.

With SCSI we would be able to use up to 14 drives without a port multiplier, but we would have to add another SCSI HBA card if we ran out of space with our 14 SCSI drives. As hot swap PCI slots are very rare and expensive, this could mean we would have to take the server and storage down for some time.

Thanks to the built-in expander of the Vtrack J300s, not only can we address 12 drives with only 4 point to point connections, but we can also use up to four cascaded storage racks. So in this case, Promise has limited the SAS routing and SAS traffic distribution to 48 (4x 12) drives. In theory SAS can support up to 128 devices, but in practice HBA is limited to about 122 drives.

In other words SAS combines all the advantages of SATA and SCSI, without inheriting any of the disadvantages:
  • You can use up to 122 drives instead of 14 (SCSI)
  • You do not have to use a cable for each drive (SATA-1)
  • Thanks to wide ports, the bandwidth of several channels can be combined into one big multiplexed pipe
  • Thanks to the serial signalling of SAS, the bandwidth of wide ports will double in 2008 (4x600 MB/s instead of 4x 300 MB/s)
  • You only need one SAS cable to attach external storage
  • You can use cheaper SATA and fast SAS drives in the same storage rack
It is now crystal clear why SAS will completely replace SCSI-320 and why SAS is pretty popular among the drive manufacturers. Seagate, Fujitsu-Siemens and Hitachi have entered the SAS drive market with new SAS drives. Western Digital is the exception and has no SAS disk plans at all, but that doesn't mean they don't see a future for SAS. Western Digital views SAS racks as an ecosystem, a breeding place for their Raptors (10000 RPM Enterprise SATA disks). If it was up to Western Digital, SAS would only exist as cables and storage racks, filled with their Raptor disks.

Serial Attached SCSI or SAS has been available since late 2005 and is a logical evolution of the old parallel SCSI-320. However, it is quite revolutionary as it offers in some ways functionality that was only available with high end fibre channel storage.



Enterprise SATA

So the question becomes: will SATA conquer the enterprise market with the SAS Trojan horse, killing off the SCSI disks? Is there any reason to pay 4 times more for a SCSI based disk which has hardly one third of the capacity of a comparable SATA disk just because the former is about twice as fast? It seems ridiculous to pay 10 times more for the same capacity.

Just like with servers, the Reliability, Availability and Serviceability (RAS) of enterprise disks must be better than desktop disks to be able to keep the TCO under control. Enterprise disks are simply much more reliable. They use stiffer covers, heads with very high rigidity, expensive and more reliable rotation engines combined with smart servo algorithms. But that is not all; the drive electronics of SCSI disks can and do perform a lot more data integrity checks.



The rate of failures increase quickly as SATA drives are subjected to server workloads. Source: Seagate

The difference in reliability between typical SATA and real enterprise disks has been proven in a recent test by Seagate. Seagate exposed three groups of 300 desktop drives to high-duty-cycle sequential and random workloads. Enterprise disks list a slightly higher or similar failure rate than desktop drives, but that does not mean they are the same. Enterprise disks are tested for heavy duty highly random workloads and desktop drives are tested with desktop workloads. Seagate's tests revealed that desktop drives failed twice as often in the sequential server tests than with normal desktop use. When running random server or transactional workloads, SATA drives failed four times as often![²] In other words, it is not wise to use SATA drives for transactional database environments; you need real SCSI/SAS enterprise disks which are made to be used for the demanding server loads.

Even the so called "Nearline" (Seagate) or "Raid Edition" (RE, Western Digital) SATA drives which are made to operate in enterprise storage racks, and which are more reliable than desktop disks, are not made for the mission critical, random transactional applications. Their MTBF (Mean Time Between Failure) is still at least 20% lower than typical enterprise disks, and they will show the similar failure rates when used with highly random server workloads as desktop drives.

Also, the current SATA drives on average experience an Unrecoverable Error every 12.5 terabytes written or read (EUR of 1 in 1014 bits). Thanks to the sophisticated drive electronics, SAS/SCSI disks experience these kinds of errors 100 (!) times less. It would seem that EUR numbers are so small that they are completely negligible, but consider the situation where one of your hard drives fails in a RAID-5 or 6 configuration. Rebuilding a RAID-5 array with five 200 GB SATA drives results in reading 0.8 terabytes and writing 0.2 terabytes, in total 1 terabytes. So you have 1/12.5 or 8% chance of getting an EUR on this SATA array. If we look at a similar SCSI enterprise array, we would get a 0.08% chance on one unrecoverable error. It is clear an 8% chance of getting data loss is a pretty bad gamble for a mission critical application.

Another good point that Seagate made in the same study concerns vibration. When a lot of disk spindles and actuators are performing a lot of very random I/O operations in a big storage rack, quite a bit of rotational vibration is the result. In the best case the actuator will have to take a bit more time to get to the right sector (higher seek time) but in the worst case the read operation has to be retried. This can only be detected by the software driver, which means that the performance of the disk will be very low. Enterprise disks can take about 50% more vibration than SATA desktop drives before 50% higher seek times kill the random disk performance.



Conclusion

The fact that a SAS HBA and SAS storage rack can contain both SAS and SATA is revolutionary. It is not unlikely that SATA will push all SCSI based disks - SCSI-320, FC and SAS - into a niche market, more precisely the transactional database storage market. The high price premium for the 15000 RPM SCSI based disks can only be justified if they are used in a mission critical environment with OLTP or similar transactional workloads. In that case, the ultra low access times and slightly higher reliability pays off.

Western Digital even tries to convince us that even in that niche market, expensive SAS disks should be replaced by 10000 rpm enterprise SATA disks. It is not likely that the WD Raptors will replace the twice as expensive SAS disks, as the latter still perform better thanks to higher RPM and lower seek times. But of course, we'll give them the benefit of the doubt until we have performed some thorough testing. If the size of your storage rack is not really a concern, twice as many SATA drives might outperform a SAS drive configuration. If space, power consumption and performance are your most important concerns, the relatively expensive but small 2.5 inch SAS disks (10000 rpm) are the best option.

For all other storage needs, SATA in a SAS storage rack is the most interesting candidate. It is however unwise to use cheap desktop SATA disks in large disk arrays for enterprise use. Intensive use of those arrays will lead to very slow access times (as the seek time increase significantly with vibration) and high failure rates. At a small price premium "Nearline" or "Enterprise" SATA disks are available which are less sensitive to vibration and much more reliable. In a nutshell, SAS, FC and SCSI drives are still the only choice for OLTP database applications, but the cheaper "Nearline", "Enterprise" and "RE" disks are probably going to chase the SCSI based drives away in the e-mail, archive, file, FTP and backup servers.

Choosing a disk interface is only a small part of choosing the right storage solution. What about SAN, NAS, iSCSI, DAS, Switched Fabric? What influence will the SAS/SATA revolution have on the topology of storage? Watch out for our next server guide which will continue to guide you through the storage and server jungle!


Thanks and References

Special thanks to Steven Peeters, expert in Storage solutions at eSys distribution, for allowing us to test the Promise Vtrak J300s. I also like to thank Remy van Heugten, Aimée Boerrigter and Kenneth Heal of Promise EMEA for their support.

[1] "Evolution in Hard Disk Drive Technology: SAS and SATA", IDC, Dave Reinsel September 2005

[2] WinHEC 2005, "SATA in the Enterprise," and Seagate Market Research

Log in

Don't have an account? Sign up now