After the exhaustive building and testing process, we've found several areas where we could have improved the original build.

Improved CPU

When we initially decided which hardware components to use, we thought we would not need very much CPU.  While we are not doing any type of parity with our storage, we neglected to account for the checksumming that ZFS does to maintain data integrity.  This checksumming consumes significantly more processor time than we had originally anticipated.  Many tests were using 70% or more of the CPU.  We believe that at this high of CPU utilization that there is significant IO contention.  Our next ZFS based storage system will probably be based on a dual socket platform and higher clocked (possibly more cores also) CPU's, giving additional headroom for the checksumming and allowing you to use more advanced features that consume CPU resources like Deduplication and Compression.  It is not a noticeable problem when testing with gigabit Ethernet speeds.  We have been doing some additional benchmarking using 20Gbps InfiniBand, and we have been able to max out the CPU in the ZFS server well before we approached the limits of 20Gbps networking.

More Memory

Going into this project, we did not really know how much main memory we would need in the ZFS SAN, or how well the system would perform with more main memory.  After doing some tests on smaller datasets that fit entirely into main memory, we decided that our next build would be 48GB of RAM or more.  As a general rule, ZFS will benefit from as much RAM as you can afford to give it.  The ARC (main memory) cache of Nexenta and OpenSolaris both function great when the dataset fits entirely into the main cache, and the performance benefits gained from having significant amounts of main memory are huge.  At some point you will run into diminishing returns.  If you're working with a dataset that is able to fit into main memory and is mainly reads, having more memory for the ARC cache will significantly improve performance.  We saw numbers in the 100's of thousands of IOPS when working just out of main memory for random reads.  On the flip side of the coin, if your workload is mainly writes then adding 48GB of RAM or more may not give you any noticeable performance advantage.

SAS drives

We thought ZFS's advanced software could overcome some of the inherent problems with slow spindle speeds, and it did up to a certain point.  ZFS on OpenSolaris was able to outperform the Promise M610i at basically the same price point.  However, we feel we left a lot more performance on the table.  Next time we deploy a ZFS server, we plan to use 15k RPM SAS drives instead of 7200 RPM SATA drives as the primary storage.  We suspect that we could have easily doubled the performance of our ZFS box in certain tests by using 15k RPM SAS drives.  The downside of the SAS drives will be increased cost and decreased capacity, but those tradeoffs will be worthwhile for us if we can double the IOPS, especially on write operations where all transactions have to be committed to disk as quickly as possible.  Reads may not be affected as much since many of the reads are coming from SSD storage already, and having SAS drives feed the SSD's would probably not increase overall performance unless your working set is large enough to exceed the total capacity of the SSD drives.

SSD Drives

In the ZFS project, we used SLC style SSD drives for ZIL and MLC style SSD drives for L2ARC.  If the price on MLC style SSD drives continues to fall, we will eventually omit the L2ARC and simply use MLC style SSD drives for all of the primary storage.  When that day comes, we will also need to use multiple SAS controllers and a much faster CPU in each ZFS box to keep up with all of the IO that it will be able to deliver.  Our only concern would be the wear leveling on the MLC drives and the ability of the drives to sustain writes over an extended period of time.  Only time will tell if the drives will be able to handle the sustained writes in an L2ARC role or as a primary storage role.

If you decide to use MLC SSD drives for actual storage instead of using SATA or SAS hard drives, then you don’t need to use cache drives. Since all of the storage drives would already be ultra fast SSD drives, there would be no performance gained from also running cache drives. You would still need to run SLC SSD drives for ZIL drives, though, as that would reduce wear on the MLC SSD drives that were being used for data storage.

If you plan to attach a lot of SSD drives, remember to use multiple SAS controllers. The SAS controller in the motherboard for our ZFS Build project is based on the LSI 1068e chipset.  We could not find specific numbers for our integrated SAS controller, but another LSI 1068 based standalone card the LSI SAS3080X-R is able sustain 140,000 IOPS. If you use enough SSD drives, you could actually saturate the  SAS controller. As a general rule of thumb, you may want to have one additional SAS controller for every 24 MLC style SSD drives.  Of course, we have not tested with 24 MLC style SSD's, that number could be higher or lower, but based on our initial performance numbers and the percieved performance of our SAS controller, we believe that 24 would be a good starting point.

Shortcomings of OpenSolaris Conclusion
Comments Locked

102 Comments

View All Comments

  • sfw - Wednesday, October 13, 2010 - link

    I'm just wondering about SAS bandwidth. If you connect the backplane via 4 SAS lanes you have a theoretical peak throughput of around 1,200MB/s. The RE3 has an average read/write spead of around 90MB/s so you could already saturate the backplane connection with about thirteen RE3s at average speed. Given the fact you also connect the SSDs this seems to a bottleneck you may wish to consider on your "areas where we could have improved the original build" list.

    By the way: really great article! Thanks for it...
  • Mattbreitbach - Wednesday, October 13, 2010 - link

    While in pure sequential reads (from all drives at the same time) would yield a bottleneck, I don't know of any instances where you would actually encounter that in our environment. Throw in one random read, or one random write, and suddenly the heads in the drives are seeking and delivering substantially lower performance than in a purely sequential read situation.

    If this was purely a staging system for disk to tape backups, and the reads were 100% sequential, I would consider more options for additional backplane bandwidth. Since this isn't a concern at this time and this system will be used primarily for VM storage, and our workloads show a pretty substantial random write access pattern (67 write/ 33 read is pretty much the norm, fully random) the probability of saturating the SAS bus is greatly reduced.
  • sfw - Thursday, October 14, 2010 - link

    Concerning random IO you are surely right and the impressing numbers of your box prove this. But even if you don't have sequential workload there is still "zpool scrub" or the possible need to resilver one or more drives which will fill your bandwidth.

    I've checked the options at Supermicro and beside the SC846E1 they are offering E16, E2 and E26 versions with improved backplane bandwidth. The difference in price tag isn't that huge and should not have much impact if your are thinking of 15k SAS or SSD drives.
  • Mattbreitbach - Thursday, October 14, 2010 - link

    The E2 and E26 are both dual-controller designs, which are meant for dual SAS controllers so that you can have failover capabilities.

    The E16 is the same system, but with SAS 2.0 support, which doubles the available bandwidth. I can definately see the E16 or the E26 as being a very viable option for anyone needing more bandwidth.
  • solori - Thursday, October 21, 2010 - link

    Actually (perhaps you meant to say this), the E1 and E16 are single SAS expander models, with the E16 supporting SAS2/6G. The E2 and E26 as dual SAS expander models, with the E26 supporting SAS2/6G.

    The dual expander design allows for MPxIO to SAS disks via the second SAS port on those disks. The single expander version is typical of SATA-only deployments. Each expander has auto-sensing SAS ports (typically SFF8088) that can connect to HBA or additional SAS expanders (cascade.)

    With SAS disks, MPxIO is a real option: allowing for reads and writes to take different SAS paths. Not so for SATA - I know of no consumer SATA disk with a second SATA port.

    As for the 90MB/s average bandwidth of a desktop drive: you're not going to see that in a ZFS application. When ZIL writes happen without an SLOG device, they are written to the pool immediately looking much like small block, random writes. Later, when the transaction group commits, the same ZIL data is written again with the transaction group (but never re-read from the original ZIL pool write since it's still in ARC). For most SATA mechanisms I've tested, there is a disproportionate hit on read performance in the presence of these random writes (i.e. 10% random writes may result in 50%+ drop in sequential read performance).

    Likewise, (and this may be something to stress in a follow-up), the behavior of the ZFS transaction group promises to create a periodic burst of sequential write behavior when committing transactions groups. This has the effect of creating periods of very little activity - where only ZIL writes to the pool take place - followed by a large burst of writes (about every 20-30 seconds). This is where workload determines the amount of RAM/ARC space your ZFS device needs.

    In essence, you need 20-30 seconds of RAM. Writing target 90MB/s (sequential)? You need 2GB additional RAM to do that. Want to write 1200MB/s (assume SAS2 mirror limit)? You'll need 24GB of additional RAM to do that (not including OS footprint and other ARC space for DDT, MRU and MFU data). Also, the ARC is being used for read caching as well, so you'll want enough memory for the read demand as well.

    There are a lot of other reasons why your "mythical" desktop sequential limits will rarely be seen: variable block size, raid level (raidz/z2/z3/mirror) and metadata transactions. SLOG, L2ARC and lots of RAM can reduce the "pressure" on the disks, but there always seem to be enough pesky, random reads and writes to confound most SATA firmware from delivering its "average" rated performance. On average, I expect to see about 30-40% of "vendor specified average bandwidth" in real world applications without considerable tuning; and then, perhaps 75-80%.
  • dignus - Sunday, October 17, 2010 - link

    It's still early sunday morning over here, but I'm missing something. You have 26 disks in your setup, yet your mainboard has only 14 sata connectors. How are your other disks connected to the mainboard?
  • Mattbreitbach - Sunday, October 17, 2010 - link

    The 24 drives in the front of the enclosure are connected via a SAS expander. That allows you to add additional ports without having to have a separate cable for each individual drive.
  • sor - Sunday, February 20, 2011 - link

    I know this is old, but it wasn't mentioned that you can choose between gzip and lz type compression. The lz was particularly interesting to us because we hardly noticed the cpu increase, while performance improved slightly and we got almost as good compression as the fastest gzip option.
  • jwinsor566 - Wednesday, February 23, 2011 - link

    Thanks guy's for an excellent post on your ZFS SAN/NAS testing. I am in the process of building my own as well. I was wondering if there has been any further testing or if you have invested in new hardware and ran the benchmarks again?

    Also Do you think this would be a good solution for Disk Backup? Would backup software make use of the ZIL you think when writing to the NAS/SAN?

    Thanks
  • shriganesh - Thursday, February 24, 2011 - link

    I have read many great articles at Anandtech. But this is the best so far! I loved the way you have presented it. It's very natural and you have mentioned most of the pit falls. It's a splendid article and keep more like these coming!

    PS: I wanted to congratulate the author for this great work. Just for thanking you, I joined Anandtech ;) Though I wanted to share a thought or two previously, I was just compelled enough to go through the boring process of signing up :D

Log in

Don't have an account? Sign up now