ZFS - Building, Testing, and Benchmarking
by Matt Breitbach on October 5, 2010 4:33 PM EST- Posted in
- IT Computing
- Linux
- NAS
- Nexenta
- ZFS
Test Blade Configuration
Our bladecenters are full of high performance blades that we use to run a virtualized hosting environment at this time. Since the blades that are in those systems are in production, we couldn’t very well use them to test the performance of our ZFS system. As such, we had to build another blade. We wanted the blade to be similar in spec to the blades that we were using, but we also wanted to utilize some of the new technology that has come out since we put many of our blades into production. Our current environment is mixed with blades that are running Dual Xeon 5420 processors w/ 32GB
Following that tradition we decided to use the SuperMicro SBI-7126T-S6 as our base blade. We populated it with Dual Xeon 5620 processors (Intel Xeon Nehalem/Westmere based 32nm quad core), 48GB Registered
Front panel of the SBI-7126T-S6 Blade Module
Intel X25-V
Dual Xeon 5620 processors, 48GB Registered
Our tests will be run using Windows 2008R2 and Iometer. We will be testing iSCSI connections over gigabit Ethernet, as this is what most budget
Price OpenSolaris box
The OpenSolaris box, as tested was quite inexpensive for the amount of hardware added to it. Overall costs for the OpenSolaris system was $6765. The breakdown is here :
Part |
Number |
Cost |
Total |
1 |
$1,199.00 |
$1,199.00 |
|
2 |
$166.00 |
$332.00 |
|
1 |
$379.00 |
$379.00 |
|
1 |
$253.00 |
$253.00 |
|
2 |
$378.00 |
$756.00 |
|
2 |
$414.00 |
$828.00 |
|
2 |
$109.00 |
$218.00 |
|
20 |
$140.00 |
$2,800.00 |
|
Total |
|
|
$6,765.00 |
Price of Nexenta
While OpenSolaris is completely free, Nexenta is a bit different, as there are software costs to consider when building a Nexenta system. There are three versions of Nexenta you can choose from if you decide to use Nexenta instead of OpenSolaris. The first is Nexenta Core Platform, which allows unlimited storage, but does not have the GUI interface. The second is Nexenta Community Edition, which supports up to 12TB of storage and a subset of the features. The third is their high end solution, Nexenta Enterprise. Nexenta Enterprise is a paid-for product that has a broad feature set and support, accompanied by a price tag.
The hardware costs for the Nexenta system are identical to the OpenSolaris system. We opted for the trial Enterprise license for testing (unlimited storage, 45 days) as we have 18TB of billable storage. Nexenta charges you based on the number of TB that you have in your storage array. As configured the Nexenta license for our system would cost $3090, bringing the total cost of a Nexenta Enterprise licensed system to $9855.
Price of Promise box
The Promise M610i is relatively simple to calculate costs on. You have the cost of the chassis, and the cost of the drives. The breakdown of those costs is below.
Part |
Number |
Cost |
Total |
1 |
4170 |
$4,170.00 |
|
16 |
$140.00 |
$2,240.00 |
|
Total |
|
|
$6,410.00 |
How we tested with Iometer
Our tests are all run from Iometer, using a custom configuration of Iometer. The .icf configuration file can be found here. We ran the following tests, starting at a queue depth of 9, ending with a queue depth of 33, stepping by a queue depth of 3. This allows us to run tests starting below a queue depth of 1 per drive, to a queue depth of around 2 per drive (depending on the storage system being tested).
The tests were run in this order, and each test was run for 3 minutes at each queue depth.
4k Sequential Read
4k Random Write
4k Random 67% write 33% read
4k Random Read
8k Random Read
8k Sequential Read
8k Random Write
8k Random 67% Write 33% Read
16k Random 67% Write 33% Read
16k Random Write
16k Sequential Read
16k Random Read
32k Random 67% Write 33% Read
32k Random Read
32k Sequential Read
32k Random Write
These tests were not organized in any particular order to bias the tests. We created the profile, and then ran it against each system. Before testing, a 300GB iSCSI target was created on each system. Once the iSCSI target was created, it was formatted with NTFS defaults, and then Iometer was started. Iometer created a 25GB working set, and then started running the tests.
While running these tests, bear in mind that the longer the tests run, the better the performance should be on the OpenSolaris and Nexenta systems. This is due to the L2ARC caching. The L2ARC populates slowly to reduce the amount of wear on MLC
102 Comments
View All Comments
MGSsancho - Tuesday, October 5, 2010 - link
I haven't tried this myself yet but how about using 8kb blocks and using jumbo frames on your network? possibly lower through padding to fill the 9mb packet in exchange for lower latency? I have no idea as this is just a theory. dudes in the #opensolaris irc chan have always recommended 128K or 64K depending on the data.solori - Wednesday, October 20, 2010 - link
One easy way to check this would be to export the pool from OpenSolaris and directly import it to NexentaStor and re-test. I think you'll find that the differences - as your benchmarks describe - are more linked to write caching at the disk level than partition alignment.NexentaStor is focused on data integrity, and tunes for that very conservatively. Since SATA disks are used in your system, NexentaStor will typically disable disk write cache (write hit) and OpenSolaris may typically disable device cache flush operations (write benefit). These two feature differences can provide the benchmark differences you're seeing.
Also, some "workstation" tuning includes the disabling of ZIL (performance benefit). This is possible - but not recommended - in NexentaStor but has the side effect of risking application data integrity. Disabling the ZIL (in the absence of SLOG) will result in synchronous writes being committed only with transaction group commits - similar performance to having a very fast SLOG (lots of ARC space helpful too).
fmatthew5876 - Tuesday, October 5, 2010 - link
I'd be very interested to see how FreeBSD ZFS benchmark results would compare to Nexenta and Open Solaris.mbreitba - Tuesday, October 5, 2010 - link
We have benchmarked FreeNAS's implimentation of ZFS on the same hardware, and the performance was abysmal. We've considered looking into the latest releases of FreeBSD but have not completed any of that testing yet.jms703 - Tuesday, October 5, 2010 - link
Have you benchmarked FreeBSD 8.1? There were a huge number of performance fixes in 8.1.Also, when was this article written? OpenSolaris was killed by Sun on August 13th, 2010.
mbreitba - Tuesday, October 5, 2010 - link
There was a lot of work on this article just prior to the official announcement. The development of the Illumos foundation and subsequent OpenIndiana has been so rapidly paced that we wanted to get this article out the door before diving in to OpenIndiana and any other OpenSolaris derivatives. We will probably add more content talking about the demise of OpenSolaris and the Open Source alternatives that have started popping up at a later date.MGSsancho - Tuesday, October 5, 2010 - link
Not to mention that projects like illumos are currently not recommended for production, Currently only meant as a base for other distros (OpenIndiana.) Then there is Solaris 11 due soon. I'll try out the express version when its released.cdillon - Tuesday, October 5, 2010 - link
FreeNAS 0.7.x is still using FreeBSD 7.x, and the ZFS code is a bit dated. FreeBSD 8.x has newer ZFS code (v15). Hopefully very soon FreeBSD 9.x will have the latest ZFS code (v24).piroroadkill - Tuesday, October 5, 2010 - link
This is relevant to my interests, and I've been toying with the idea of setting up a ZFS based server for a while.It's nice to see the features it can use when you have the hardware for it.
cgaspar - Tuesday, October 5, 2010 - link
You say that all writes go to a log in ZFS. That's just not true. Only synchronous writes below a certain size go into the log (either built into the pool, or a dedicated log device). All writes are held in memory in a transaction group, and that transaction group is written to the main pool at least every 10 seconds by default (in OpenSolaris - it used to be 30 seconds, and still is in Solaris 10 U9). That's tunable, and commits will happen more frequently if required, based on available ARC and data churn rate. Note that _all_ writes go into the transaction group - the log is only ever used if the box crashes after a synchronous write and before the txg commits.Now for the caution - you have chosen SSDs for your SLOG that don't have a backup power source for their on board caches. If you suffer power loss, you may lose data. Several SLC SSDs have recently been released that have a supercapacitor or other power source sufficient to write cache data to flash on power loss, but the current Intel like up doesn't have it. I believe the next generation Intel SSDs will.