Benchmarks

After running our tests on the ZFS system (both under Nexenta and OpenSolaris) and the Promise M610i, we came up with the following results.  All graphs have IOPS on the Y-Axis, and Disk Que Lenght on the X-Axis.

4k Sequential Reads

 

In the 4k Sequential Read test, we see that the OpenSolaris and Nexenta systems both outperform the Promise M610i by a significant margin when the disk queue is increased.  This is a direct effect of the L2ARC cache.  Interestingly enough the OpenSolaris and Nexenta systems seem to trend identically, but the Nexenta system is measurably slower than the OpenSolaris system.  We are unsure as to why this is, as they are running on the same hardware and the build of Nexenta we ran was based on the same build of OpenSolaris that we tested.  We contacted Nexenta about this performance gap, but they did not have any explanation.  One hypothesis that we had is that the Nexenta software is using more memory for things like the web GUI, and maybe there is less ARC available to the Nexenta solution than to a regular OpenSolaris solution.   

 

4k Random Write

 

In the 4k Random Write test, again the OpenSolaris and Nexenta systems come out ahead of the Promise M610i.  The Promise box seems to be nearly flat, an indicator that it is reaching the limits of its hardware quite quickly.  The OpenSolaris and Nexenta systems write faster as the disk queue increases.  This seems to indicate a better re-ordering of data to make the writes more sequential the disks.

  

4k Random 67% Write 33% Read

 

The 4k 67% Write 33% Read test again gives the edge to the OpenSolaris and Nexenta systems, while the Promise M610i is nearly flat lined.  This is most likely a result of both re-ordering writes and the very effective L2ARC caching.

  

4k Random Reads

 

4k Random Reads again come out in favor of the OpenSolaris and Nexenta systems.  While the Promise M610i does increase its performance as the disk queue increases, it's nowhere near the levels of performance that the OpenSolaris and Nexenta systems can deliver with their L2ARC caching.

  

8k Random Read

 

8k Random Reads indicate a similar trend to the 4k Random Reads with the OpenSolaris and Nexenta systems outperforming the Promise M610i.  Again, we see the OpenSolaris and Nexenta systems trending very similarly but with the OpenSolaris system significantly outperforming the Nexenta system.

  

8k Sequential Read

 

 8k Sequential reads have the OpenSolaris and Nexenta systems trailing at the first data point, and then running away from the Promise M610i at higher disk queues.  It's interesting to note that the Nexenta system outperforms the OpenSolaris system at several of the data points in this test.

   

8k Random Write

 

  8k Random writes play out like most of the other tests we've seen with the OpenSolaris and Nexenta systems taking top honors, with the Promise M610i trailing.  Again, OpenSolaris beats out Nexenta on the same hardware.

  

8k Random 67% Write 33% Read

 

8k Random 67% Write 33% Read again favors the OpenSolaris and Nexenta systems, with the Promise M610i trailing.  While the OpenSolaris and Nexenta systems start off nearly identical for the first 5 data points, at a disk queue of 24 or higher the OpenSolaris system steals the show.

  

16k Random 67% Write 33% Read

 

 16k Random 67% Write 33% read gives us a show that we're familiar with.  OpenSolaris and Nexenta both soundly beat the Promise M610i at higher disk ques.  Again we see the pattern of the OpenSolaris and Nexenta systems trending nearly identically, but the OpenSolaris system outperforming the Nexenta system at all data points.

  

16k Random Write

 

 16k Random write shows the Promise M610i starting off faster than the Nexenta system and nearly on par with the OpenSolaris system, but quickly flattening out.  The Nexenta box again trends higher, but cannot keep up with the OpenSolaris system.

  

16k Sequential Read

 

 The 16k Sequential read test is the first test that we see where the Promise M610i system outperforms OpenSolaris and Nexenta at all data points.  The OpenSolaris system and the Nexenta system both trend upwards at the same rate, but cannot catch the M610i system.

  

16k Random Read

 

The 16k Random Read test goes back to the same pattern that we've been seeing, with the OpenSolaris and Nexenta systems running away from the Promise M610i.  Again we see the OpenSolaris system take top honors with the Nexenta system trending similarly, but never reaching the performance metrics seen on the OpenSolaris system.

  

32k Random 67% Write 33% Read

 

 32k Random 67% Write 33% read has the OpenSolaris system on top, with the Promise M610i in second place, and the Nexenta system trailing everything.  We're not really sure what to make of this, as we expected the Nexenta system to follow similar patterns to what we had seen before.

  

32k Random Read

 

 32k Random Read has the OpenSolaris system running away from everything else.  On this test the Nexenta system and the Promise M610i are very similar, with the Nexentaq system edging out the Promise M610i at the highest queue depths.

  

32k Sequential Read

 

 32k Sequential Reads proved to be a strong point for the Promise M610i.  It outperformed the OpenSolaris and Nexenta systems at all data points.  Clearly there is something in the Promise M610i that helps it excel at 32k Sequential Reads.  

 

32k Random Write

 

  

32k random writes have the OpenSolaris system on top again, with the Promise M610i in second place, and the Nexenta system trailing far behind.  All of the graphs trend similarly, with little dips and rises, but not ever moving much from the initial reading. 

 After all the tests were done, we had to sit down and take a hard look at the results and try to formulate some ideas about how to interpret this data.  We will discuss this in our conclusion.

Test Blade Configuration Demise of OpenSolaris
Comments Locked

102 Comments

View All Comments

  • sfw - Wednesday, October 13, 2010 - link

    I'm just wondering about SAS bandwidth. If you connect the backplane via 4 SAS lanes you have a theoretical peak throughput of around 1,200MB/s. The RE3 has an average read/write spead of around 90MB/s so you could already saturate the backplane connection with about thirteen RE3s at average speed. Given the fact you also connect the SSDs this seems to a bottleneck you may wish to consider on your "areas where we could have improved the original build" list.

    By the way: really great article! Thanks for it...
  • Mattbreitbach - Wednesday, October 13, 2010 - link

    While in pure sequential reads (from all drives at the same time) would yield a bottleneck, I don't know of any instances where you would actually encounter that in our environment. Throw in one random read, or one random write, and suddenly the heads in the drives are seeking and delivering substantially lower performance than in a purely sequential read situation.

    If this was purely a staging system for disk to tape backups, and the reads were 100% sequential, I would consider more options for additional backplane bandwidth. Since this isn't a concern at this time and this system will be used primarily for VM storage, and our workloads show a pretty substantial random write access pattern (67 write/ 33 read is pretty much the norm, fully random) the probability of saturating the SAS bus is greatly reduced.
  • sfw - Thursday, October 14, 2010 - link

    Concerning random IO you are surely right and the impressing numbers of your box prove this. But even if you don't have sequential workload there is still "zpool scrub" or the possible need to resilver one or more drives which will fill your bandwidth.

    I've checked the options at Supermicro and beside the SC846E1 they are offering E16, E2 and E26 versions with improved backplane bandwidth. The difference in price tag isn't that huge and should not have much impact if your are thinking of 15k SAS or SSD drives.
  • Mattbreitbach - Thursday, October 14, 2010 - link

    The E2 and E26 are both dual-controller designs, which are meant for dual SAS controllers so that you can have failover capabilities.

    The E16 is the same system, but with SAS 2.0 support, which doubles the available bandwidth. I can definately see the E16 or the E26 as being a very viable option for anyone needing more bandwidth.
  • solori - Thursday, October 21, 2010 - link

    Actually (perhaps you meant to say this), the E1 and E16 are single SAS expander models, with the E16 supporting SAS2/6G. The E2 and E26 as dual SAS expander models, with the E26 supporting SAS2/6G.

    The dual expander design allows for MPxIO to SAS disks via the second SAS port on those disks. The single expander version is typical of SATA-only deployments. Each expander has auto-sensing SAS ports (typically SFF8088) that can connect to HBA or additional SAS expanders (cascade.)

    With SAS disks, MPxIO is a real option: allowing for reads and writes to take different SAS paths. Not so for SATA - I know of no consumer SATA disk with a second SATA port.

    As for the 90MB/s average bandwidth of a desktop drive: you're not going to see that in a ZFS application. When ZIL writes happen without an SLOG device, they are written to the pool immediately looking much like small block, random writes. Later, when the transaction group commits, the same ZIL data is written again with the transaction group (but never re-read from the original ZIL pool write since it's still in ARC). For most SATA mechanisms I've tested, there is a disproportionate hit on read performance in the presence of these random writes (i.e. 10% random writes may result in 50%+ drop in sequential read performance).

    Likewise, (and this may be something to stress in a follow-up), the behavior of the ZFS transaction group promises to create a periodic burst of sequential write behavior when committing transactions groups. This has the effect of creating periods of very little activity - where only ZIL writes to the pool take place - followed by a large burst of writes (about every 20-30 seconds). This is where workload determines the amount of RAM/ARC space your ZFS device needs.

    In essence, you need 20-30 seconds of RAM. Writing target 90MB/s (sequential)? You need 2GB additional RAM to do that. Want to write 1200MB/s (assume SAS2 mirror limit)? You'll need 24GB of additional RAM to do that (not including OS footprint and other ARC space for DDT, MRU and MFU data). Also, the ARC is being used for read caching as well, so you'll want enough memory for the read demand as well.

    There are a lot of other reasons why your "mythical" desktop sequential limits will rarely be seen: variable block size, raid level (raidz/z2/z3/mirror) and metadata transactions. SLOG, L2ARC and lots of RAM can reduce the "pressure" on the disks, but there always seem to be enough pesky, random reads and writes to confound most SATA firmware from delivering its "average" rated performance. On average, I expect to see about 30-40% of "vendor specified average bandwidth" in real world applications without considerable tuning; and then, perhaps 75-80%.
  • dignus - Sunday, October 17, 2010 - link

    It's still early sunday morning over here, but I'm missing something. You have 26 disks in your setup, yet your mainboard has only 14 sata connectors. How are your other disks connected to the mainboard?
  • Mattbreitbach - Sunday, October 17, 2010 - link

    The 24 drives in the front of the enclosure are connected via a SAS expander. That allows you to add additional ports without having to have a separate cable for each individual drive.
  • sor - Sunday, February 20, 2011 - link

    I know this is old, but it wasn't mentioned that you can choose between gzip and lz type compression. The lz was particularly interesting to us because we hardly noticed the cpu increase, while performance improved slightly and we got almost as good compression as the fastest gzip option.
  • jwinsor566 - Wednesday, February 23, 2011 - link

    Thanks guy's for an excellent post on your ZFS SAN/NAS testing. I am in the process of building my own as well. I was wondering if there has been any further testing or if you have invested in new hardware and ran the benchmarks again?

    Also Do you think this would be a good solution for Disk Backup? Would backup software make use of the ZIL you think when writing to the NAS/SAN?

    Thanks
  • shriganesh - Thursday, February 24, 2011 - link

    I have read many great articles at Anandtech. But this is the best so far! I loved the way you have presented it. It's very natural and you have mentioned most of the pit falls. It's a splendid article and keep more like these coming!

    PS: I wanted to congratulate the author for this great work. Just for thanking you, I joined Anandtech ;) Though I wanted to share a thought or two previously, I was just compelled enough to go through the boring process of signing up :D

Log in

Don't have an account? Sign up now