After the exhaustive building and testing process, we've found several areas where we could have improved the original build.

Improved CPU

When we initially decided which hardware components to use, we thought we would not need very much CPU.  While we are not doing any type of parity with our storage, we neglected to account for the checksumming that ZFS does to maintain data integrity.  This checksumming consumes significantly more processor time than we had originally anticipated.  Many tests were using 70% or more of the CPU.  We believe that at this high of CPU utilization that there is significant IO contention.  Our next ZFS based storage system will probably be based on a dual socket platform and higher clocked (possibly more cores also) CPU's, giving additional headroom for the checksumming and allowing you to use more advanced features that consume CPU resources like Deduplication and Compression.  It is not a noticeable problem when testing with gigabit Ethernet speeds.  We have been doing some additional benchmarking using 20Gbps InfiniBand, and we have been able to max out the CPU in the ZFS server well before we approached the limits of 20Gbps networking.

More Memory

Going into this project, we did not really know how much main memory we would need in the ZFS SAN, or how well the system would perform with more main memory.  After doing some tests on smaller datasets that fit entirely into main memory, we decided that our next build would be 48GB of RAM or more.  As a general rule, ZFS will benefit from as much RAM as you can afford to give it.  The ARC (main memory) cache of Nexenta and OpenSolaris both function great when the dataset fits entirely into the main cache, and the performance benefits gained from having significant amounts of main memory are huge.  At some point you will run into diminishing returns.  If you're working with a dataset that is able to fit into main memory and is mainly reads, having more memory for the ARC cache will significantly improve performance.  We saw numbers in the 100's of thousands of IOPS when working just out of main memory for random reads.  On the flip side of the coin, if your workload is mainly writes then adding 48GB of RAM or more may not give you any noticeable performance advantage.

SAS drives

We thought ZFS's advanced software could overcome some of the inherent problems with slow spindle speeds, and it did up to a certain point.  ZFS on OpenSolaris was able to outperform the Promise M610i at basically the same price point.  However, we feel we left a lot more performance on the table.  Next time we deploy a ZFS server, we plan to use 15k RPM SAS drives instead of 7200 RPM SATA drives as the primary storage.  We suspect that we could have easily doubled the performance of our ZFS box in certain tests by using 15k RPM SAS drives.  The downside of the SAS drives will be increased cost and decreased capacity, but those tradeoffs will be worthwhile for us if we can double the IOPS, especially on write operations where all transactions have to be committed to disk as quickly as possible.  Reads may not be affected as much since many of the reads are coming from SSD storage already, and having SAS drives feed the SSD's would probably not increase overall performance unless your working set is large enough to exceed the total capacity of the SSD drives.

SSD Drives

In the ZFS project, we used SLC style SSD drives for ZIL and MLC style SSD drives for L2ARC.  If the price on MLC style SSD drives continues to fall, we will eventually omit the L2ARC and simply use MLC style SSD drives for all of the primary storage.  When that day comes, we will also need to use multiple SAS controllers and a much faster CPU in each ZFS box to keep up with all of the IO that it will be able to deliver.  Our only concern would be the wear leveling on the MLC drives and the ability of the drives to sustain writes over an extended period of time.  Only time will tell if the drives will be able to handle the sustained writes in an L2ARC role or as a primary storage role.

If you decide to use MLC SSD drives for actual storage instead of using SATA or SAS hard drives, then you don’t need to use cache drives. Since all of the storage drives would already be ultra fast SSD drives, there would be no performance gained from also running cache drives. You would still need to run SLC SSD drives for ZIL drives, though, as that would reduce wear on the MLC SSD drives that were being used for data storage.

If you plan to attach a lot of SSD drives, remember to use multiple SAS controllers. The SAS controller in the motherboard for our ZFS Build project is based on the LSI 1068e chipset.  We could not find specific numbers for our integrated SAS controller, but another LSI 1068 based standalone card the LSI SAS3080X-R is able sustain 140,000 IOPS. If you use enough SSD drives, you could actually saturate the  SAS controller. As a general rule of thumb, you may want to have one additional SAS controller for every 24 MLC style SSD drives.  Of course, we have not tested with 24 MLC style SSD's, that number could be higher or lower, but based on our initial performance numbers and the percieved performance of our SAS controller, we believe that 24 would be a good starting point.

Shortcomings of OpenSolaris Conclusion
Comments Locked

102 Comments

View All Comments

  • MGSsancho - Tuesday, October 5, 2010 - link

    I haven't tried this myself yet but how about using 8kb blocks and using jumbo frames on your network? possibly lower through padding to fill the 9mb packet in exchange for lower latency? I have no idea as this is just a theory. dudes in the #opensolaris irc chan have always recommended 128K or 64K depending on the data.
  • solori - Wednesday, October 20, 2010 - link

    One easy way to check this would be to export the pool from OpenSolaris and directly import it to NexentaStor and re-test. I think you'll find that the differences - as your benchmarks describe - are more linked to write caching at the disk level than partition alignment.

    NexentaStor is focused on data integrity, and tunes for that very conservatively. Since SATA disks are used in your system, NexentaStor will typically disable disk write cache (write hit) and OpenSolaris may typically disable device cache flush operations (write benefit). These two feature differences can provide the benchmark differences you're seeing.

    Also, some "workstation" tuning includes the disabling of ZIL (performance benefit). This is possible - but not recommended - in NexentaStor but has the side effect of risking application data integrity. Disabling the ZIL (in the absence of SLOG) will result in synchronous writes being committed only with transaction group commits - similar performance to having a very fast SLOG (lots of ARC space helpful too).
  • fmatthew5876 - Tuesday, October 5, 2010 - link

    I'd be very interested to see how FreeBSD ZFS benchmark results would compare to Nexenta and Open Solaris.
  • mbreitba - Tuesday, October 5, 2010 - link

    We have benchmarked FreeNAS's implimentation of ZFS on the same hardware, and the performance was abysmal. We've considered looking into the latest releases of FreeBSD but have not completed any of that testing yet.
  • jms703 - Tuesday, October 5, 2010 - link

    Have you benchmarked FreeBSD 8.1? There were a huge number of performance fixes in 8.1.

    Also, when was this article written? OpenSolaris was killed by Sun on August 13th, 2010.
  • mbreitba - Tuesday, October 5, 2010 - link

    There was a lot of work on this article just prior to the official announcement. The development of the Illumos foundation and subsequent OpenIndiana has been so rapidly paced that we wanted to get this article out the door before diving in to OpenIndiana and any other OpenSolaris derivatives. We will probably add more content talking about the demise of OpenSolaris and the Open Source alternatives that have started popping up at a later date.
  • MGSsancho - Tuesday, October 5, 2010 - link

    Not to mention that projects like illumos are currently not recommended for production, Currently only meant as a base for other distros (OpenIndiana.) Then there is Solaris 11 due soon. I'll try out the express version when its released.
  • cdillon - Tuesday, October 5, 2010 - link

    FreeNAS 0.7.x is still using FreeBSD 7.x, and the ZFS code is a bit dated. FreeBSD 8.x has newer ZFS code (v15). Hopefully very soon FreeBSD 9.x will have the latest ZFS code (v24).
  • piroroadkill - Tuesday, October 5, 2010 - link

    This is relevant to my interests, and I've been toying with the idea of setting up a ZFS based server for a while.

    It's nice to see the features it can use when you have the hardware for it.
  • cgaspar - Tuesday, October 5, 2010 - link

    You say that all writes go to a log in ZFS. That's just not true. Only synchronous writes below a certain size go into the log (either built into the pool, or a dedicated log device). All writes are held in memory in a transaction group, and that transaction group is written to the main pool at least every 10 seconds by default (in OpenSolaris - it used to be 30 seconds, and still is in Solaris 10 U9). That's tunable, and commits will happen more frequently if required, based on available ARC and data churn rate. Note that _all_ writes go into the transaction group - the log is only ever used if the box crashes after a synchronous write and before the txg commits.

    Now for the caution - you have chosen SSDs for your SLOG that don't have a backup power source for their on board caches. If you suffer power loss, you may lose data. Several SLC SSDs have recently been released that have a supercapacitor or other power source sufficient to write cache data to flash on power loss, but the current Intel like up doesn't have it. I believe the next generation Intel SSDs will.

Log in

Don't have an account? Sign up now