Test Blade Configuration

 

Our bladecenters are full of high performance blades that we use to run a virtualized hosting environment at this time. Since the blades that are in those systems are in production, we couldn’t very well use them to test the performance of our ZFS system. As such, we had to build another blade. We wanted the blade to be similar in spec to the blades that we were using, but we also wanted to utilize some of the new technology that has come out since we put many of our blades into production. Our current environment is mixed with blades that are running Dual Xeon 5420 processors w/ 32GB RAM and dual 250GB SATA hard drives, some systems that are running Dual Xeon 5520 processors w/48GB RAM and dual 32GB SAS HDD’s.  We use the RAID1 volumes in each blade as boot volumes. All of our content is stored on RAID10 SANs.

Following that tradition we decided to use the SuperMicro SBI-7126T-S6 as our base blade. We populated it with Dual Xeon 5620 processors (Intel Xeon Nehalem/Westmere based 32nm quad core), 48GB Registered ECC DDR3 memory, dual Intel X-25V SSD drives (for boot in a RAID1 mirror) and a SuperMicro AOC-IBH-XDD InfiniBand Mezzanine card.


Click to enlarge

Front panel of the SBI-7126T-S6 Blade Module


Click to enlarge

Intel X25-V SSD boot drives installed


Click to enlarge


Dual Xeon 5620 processors, 48GB Registered ECC DDR3 memory, Infiniband DDR Mezzanine card installed

Our tests will be run using Windows 2008R2 and Iometer. We will be testing iSCSI connections over gigabit Ethernet, as this is what most budget SAN builds are based around.  Our blades also offer us connectivity options in the form of 10Gb Ethernet and 20Gb Infiniband but those connections are out of the scope of this article.

 

Price OpenSolaris box

 The OpenSolaris box, as tested was quite inexpensive for the amount of hardware added to it.  Overall costs for the OpenSolaris system was $6765.  The breakdown is here :

Part

Number

Cost

Total

Chassis

1

$1,199.00

$1,199.00

RAM

2

$166.00

$332.00

Motherboard

1

$379.00

$379.00

Processor

1

$253.00

$253.00

HDD - SLC - Log

2

$378.00

$756.00

HDD - MLC - Cache

2

$414.00

$828.00

HDD - MLC - Boot 40GB

2

$109.00

$218.00

HDD - WD 1TB RE3

20

$140.00

$2,800.00

Total

 

 

$6,765.00

Price of Nexenta

 While OpenSolaris is completely free, Nexenta is a bit different, as there are software costs to consider when building a Nexenta system.  There are three versions of Nexenta you can choose from if you decide to use Nexenta instead of OpenSolaris.  The first is Nexenta Core Platform, which allows unlimited storage, but does not have the GUI interface.  The second is Nexenta Community Edition, which supports up to 12TB of storage and a subset of the features.  The third is their high end solution, Nexenta Enterprise.  Nexenta Enterprise is a paid-for product that has a broad feature set and support, accompanied by a price tag.

The hardware costs for the Nexenta system are identical to the OpenSolaris system.  We opted for the trial Enterprise license for testing (unlimited storage, 45 days) as we have 18TB of billable storage.  Nexenta charges you based on the number of TB that you have in your storage array.  As configured the Nexenta license for our system would cost $3090, bringing the total cost of a Nexenta Enterprise licensed system to $9855.

Price of Promise box

The Promise M610i is relatively simple to calculate costs on.  You have the cost of the chassis, and the cost of the drives.  The breakdown of those costs is below.

Part

Number

Cost

Total

Promise M610i

1

4170

$4,170.00

HDD - WD 1TB RE3

16

$140.00

$2,240.00

Total

 

 

$6,410.00

How we tested with Iometer

Our tests are all run from Iometer, using a custom configuration of Iometer.  The .icf configuration file can be found here.  We ran the following tests, starting at a queue depth of 9, ending with a queue depth of 33, stepping by a queue depth of 3.  This allows us to run tests starting below a queue depth of 1 per drive, to a queue depth of around 2 per drive (depending on the storage system being tested).

The tests were run in this order, and each test was run for 3 minutes at each queue depth.

4k Sequential Read

4k Random Write

4k Random 67% write 33% read

4k Random Read

8k Random Read

8k Sequential Read

8k Random Write

8k Random 67% Write 33% Read

16k Random 67% Write 33% Read

16k Random Write

16k Sequential Read

16k Random Read

32k Random 67% Write 33% Read

32k Random Read

32k Sequential Read

32k Random Write

These tests were not organized in any particular order to bias the tests.  We created the profile, and then ran it against each system.  Before testing, a 300GB iSCSI target was created on each system.  Once the iSCSI target was created, it was formatted with NTFS defaults, and then Iometer was started.  Iometer created a 25GB working set, and then started running the tests.

While running these tests, bear in mind that the longer the tests run, the better the performance should be on the OpenSolaris and Nexenta systems.  This is due to the L2ARC caching.  The L2ARC populates slowly to reduce the amount of wear on MLC SSD drives (approximately 7MB/sec).  When you run a test over a significant amount of time the caching should improve the number of IOPS that the OpenSolaris and Nexenta systems are able to achieve.

Building the System Benchmark Results
Comments Locked

102 Comments

View All Comments

  • sfw - Wednesday, October 13, 2010 - link

    I'm just wondering about SAS bandwidth. If you connect the backplane via 4 SAS lanes you have a theoretical peak throughput of around 1,200MB/s. The RE3 has an average read/write spead of around 90MB/s so you could already saturate the backplane connection with about thirteen RE3s at average speed. Given the fact you also connect the SSDs this seems to a bottleneck you may wish to consider on your "areas where we could have improved the original build" list.

    By the way: really great article! Thanks for it...
  • Mattbreitbach - Wednesday, October 13, 2010 - link

    While in pure sequential reads (from all drives at the same time) would yield a bottleneck, I don't know of any instances where you would actually encounter that in our environment. Throw in one random read, or one random write, and suddenly the heads in the drives are seeking and delivering substantially lower performance than in a purely sequential read situation.

    If this was purely a staging system for disk to tape backups, and the reads were 100% sequential, I would consider more options for additional backplane bandwidth. Since this isn't a concern at this time and this system will be used primarily for VM storage, and our workloads show a pretty substantial random write access pattern (67 write/ 33 read is pretty much the norm, fully random) the probability of saturating the SAS bus is greatly reduced.
  • sfw - Thursday, October 14, 2010 - link

    Concerning random IO you are surely right and the impressing numbers of your box prove this. But even if you don't have sequential workload there is still "zpool scrub" or the possible need to resilver one or more drives which will fill your bandwidth.

    I've checked the options at Supermicro and beside the SC846E1 they are offering E16, E2 and E26 versions with improved backplane bandwidth. The difference in price tag isn't that huge and should not have much impact if your are thinking of 15k SAS or SSD drives.
  • Mattbreitbach - Thursday, October 14, 2010 - link

    The E2 and E26 are both dual-controller designs, which are meant for dual SAS controllers so that you can have failover capabilities.

    The E16 is the same system, but with SAS 2.0 support, which doubles the available bandwidth. I can definately see the E16 or the E26 as being a very viable option for anyone needing more bandwidth.
  • solori - Thursday, October 21, 2010 - link

    Actually (perhaps you meant to say this), the E1 and E16 are single SAS expander models, with the E16 supporting SAS2/6G. The E2 and E26 as dual SAS expander models, with the E26 supporting SAS2/6G.

    The dual expander design allows for MPxIO to SAS disks via the second SAS port on those disks. The single expander version is typical of SATA-only deployments. Each expander has auto-sensing SAS ports (typically SFF8088) that can connect to HBA or additional SAS expanders (cascade.)

    With SAS disks, MPxIO is a real option: allowing for reads and writes to take different SAS paths. Not so for SATA - I know of no consumer SATA disk with a second SATA port.

    As for the 90MB/s average bandwidth of a desktop drive: you're not going to see that in a ZFS application. When ZIL writes happen without an SLOG device, they are written to the pool immediately looking much like small block, random writes. Later, when the transaction group commits, the same ZIL data is written again with the transaction group (but never re-read from the original ZIL pool write since it's still in ARC). For most SATA mechanisms I've tested, there is a disproportionate hit on read performance in the presence of these random writes (i.e. 10% random writes may result in 50%+ drop in sequential read performance).

    Likewise, (and this may be something to stress in a follow-up), the behavior of the ZFS transaction group promises to create a periodic burst of sequential write behavior when committing transactions groups. This has the effect of creating periods of very little activity - where only ZIL writes to the pool take place - followed by a large burst of writes (about every 20-30 seconds). This is where workload determines the amount of RAM/ARC space your ZFS device needs.

    In essence, you need 20-30 seconds of RAM. Writing target 90MB/s (sequential)? You need 2GB additional RAM to do that. Want to write 1200MB/s (assume SAS2 mirror limit)? You'll need 24GB of additional RAM to do that (not including OS footprint and other ARC space for DDT, MRU and MFU data). Also, the ARC is being used for read caching as well, so you'll want enough memory for the read demand as well.

    There are a lot of other reasons why your "mythical" desktop sequential limits will rarely be seen: variable block size, raid level (raidz/z2/z3/mirror) and metadata transactions. SLOG, L2ARC and lots of RAM can reduce the "pressure" on the disks, but there always seem to be enough pesky, random reads and writes to confound most SATA firmware from delivering its "average" rated performance. On average, I expect to see about 30-40% of "vendor specified average bandwidth" in real world applications without considerable tuning; and then, perhaps 75-80%.
  • dignus - Sunday, October 17, 2010 - link

    It's still early sunday morning over here, but I'm missing something. You have 26 disks in your setup, yet your mainboard has only 14 sata connectors. How are your other disks connected to the mainboard?
  • Mattbreitbach - Sunday, October 17, 2010 - link

    The 24 drives in the front of the enclosure are connected via a SAS expander. That allows you to add additional ports without having to have a separate cable for each individual drive.
  • sor - Sunday, February 20, 2011 - link

    I know this is old, but it wasn't mentioned that you can choose between gzip and lz type compression. The lz was particularly interesting to us because we hardly noticed the cpu increase, while performance improved slightly and we got almost as good compression as the fastest gzip option.
  • jwinsor566 - Wednesday, February 23, 2011 - link

    Thanks guy's for an excellent post on your ZFS SAN/NAS testing. I am in the process of building my own as well. I was wondering if there has been any further testing or if you have invested in new hardware and ran the benchmarks again?

    Also Do you think this would be a good solution for Disk Backup? Would backup software make use of the ZIL you think when writing to the NAS/SAN?

    Thanks
  • shriganesh - Thursday, February 24, 2011 - link

    I have read many great articles at Anandtech. But this is the best so far! I loved the way you have presented it. It's very natural and you have mentioned most of the pit falls. It's a splendid article and keep more like these coming!

    PS: I wanted to congratulate the author for this great work. Just for thanking you, I joined Anandtech ;) Though I wanted to share a thought or two previously, I was just compelled enough to go through the boring process of signing up :D

Log in

Don't have an account? Sign up now