ZFS Features

 

ZFS includes two exciting features that dramatically improve the performance of read operations. I’m talking about ARC and L2ARC. ARC stands for adaptive replacement cache. ARC is a very fast block level cache located in the server’s memory (RAM). The amount of ARC available in a server is usually all of the memory except for 1GB.

For example, our ZFS server with 12GB of RAM has 11GB dedicated to ARC, which means our ZFS server will be able to cache 11GB of the most accessed data. Any read requests for data in the cache can be served directly from the ARC memory cache instead of hitting the much slower hard drives. This creates a noticeable performance boost for data that is accessed frequently.

As a general rule, you want to install as much RAM into the server as you can to make the ARC as big as possible. At some point adding more memory becomes cost prohibitive, which is where the L2ARC becomes important. The L2ARC is the second level adaptive replacement cache. The L2ARC is often called “cache drives” in the ZFS systems.

These cache drives are physically MLC style SSD drives. These SSD drives are slower than system memory, but still much faster than hard drives. More importantly, the SSD drives are much cheaper than system memory. Most people compare the price of SSD drives with the price of hard drives, and this makes SSD drives seem expensive. Compared to system memory, MLC SSD drives are actually very inexpensive.

When cache drives are present in the ZFS pool, the cache drives will cache frequently accessed data that did not fit in ARC. When read requests come into the system, ZFS will attempt to serve those requests from the ARC. If the data is not in the ARC, ZFS will attempt to serve the requests from the L2ARC. Hard drives are only accessed when data does not exist in either the ARC or L2ARC. This means the hard drives receive far fewer requests, which is awesome given the fact that the hard drives are easily the slowest devices in the overall storage solution.

In our ZFS project, we added a pair of 160GB Intel X25-M MLC SSD drives for a total of 320GB of L2ARC. Between our ARC of 11GB and our L2ARC of 320GB, our ZFS solution can cache over 300GB of the most frequently accessed data! This hybrid solution offers considerably better performance for read requests because it reduces the number of accesses to the large, slow hard drives.

 

Things to Keep in Mind

There are a few things to remember. The cache drives don’t get mirrored. When you add cache drives, you cannot set them up as mirrored, but there is no need to since the content is already mirrored on the hard drives. The cache drives are just a cheap alternative to RAM for caching frequently access content.

Another thing to remember is you still need to use SLC SSD drives for the ZIL drives.  ZIL stands for  "ZFS Intent Log", and acts as an intermediary for write caching.  Not having ZIL drives severely slows down write access.  By adding the ZIL drives you significantly increase write speeds.  This is still not as fast as a RAM based write cache on a RAID card, but it is much better than not having anything. Solaris ZFS Best Practices For Log Devices The SLC SSD drives used for ZIL drives dramatically improve the performance of write actions. The MLC SSD drives used as cache drives are used to improve read performance.

It is also important to remember that the L2ARC will require some memory to operate.  A portion of the ARC will be used to index and manage the content located in the L2ARC.  A general rule of thumb is that 1-2GB of ARC will be used for every 100GB of L2ARC.  With a 300GB L2ARC, we will give up 3-6GB of ARC.  This will leave us with 5-8GB of ARC memory to use to cache the most frequently accessed files.

 

Effective Caching to Virtualized Environments

At this point, you are probably wondering how effectively the two levels of caching will be able to cache the most frequently used data, especially when we are talking about 9TB of formatted RAID10 capacity. Will 11GB of ARC and 320GB L2ARC make a significant difference for overall performance? It will depend on what type of data is located on the storage array and how it is being accessed. If it contained 9TB of files that were all accessed in a completely random way, the caching would likely not be effective. However, we are planning to use the storage for virtual machine file systems and this will cache very effectively for our intended purpose.

When you plan to deploy hundreds of virtual machines, the first step is to build a base template that all of the virtual machines will start from. If you were planning to host a lot of Linux virtual machines, you would build the base template by installing Linux. When you get to the step where you would normally configure the server, you would shut off the virtual machine. At that point, you would have the base template ready. Each additional virtual machine would simply be chained off the base template. The virtualization technology will keep the changes specific to each virtual machine in its own child or differencing file.

When the virtualization solution is configured this way, the base template will be cached quite effectively in the ARC (main system memory). This means the main operating system files and cPanel files should deliver near RAM-disk performance levels. The L2ARC will be able to effectively cache the most frequently used content that is not shared by all of the virtual machines, such as the content of the files and folders in the most popular websites or MySQL databases. The least frequently accessed content will be pulled from the hard drives, but even that should show solid performance since it will be RAID10 across 18 drives and none of the frequently accessed read requests will need to burden the RAID10 volume since they are already served from ARC or L2ARC.

 

Testing the L2ARC

We thought it would be fun to actually test the L2ARC and build a chart of the performance as a function of time.  To test and graph usefulness of L2ARC, we set up an iSCSI share on the ZFS server and then ran Iometer from our test blade in our blade center.  We ran these tests over gigabit Ethernet.

 Iometer Test Details:

25GB working set

4k blocks

100% random

100% read

load 32 (constant)

four hour test

Every ten minutes during the test, we grabbed the “Last performance” values (IOPS, MB/sec) from Iometer and wrote them down to build a performance chart.  Our goal was to be able to graph the performance as a function of time so we could illustrate the usefulness of the L2ARC.

We ran the same test using the Promise M610i (16 1TB WD RE3 drives in RAID10) box to get a comparison graph.  The Promise box is not a ZFS style solution and does not have any L2ARC style caching feature.  We expected the ZFS box to outperform the Promise box, and we expected the ZFS box to increase performance as a function of time because the L2ARC would become more populated the longer the test ran. 

The Promise box consistently delivered 2200 to 2300 IOPS every time we checked performance during the entire 4 hour test.  The ZFS box started by delivering 2532 IOPS at 10 minutes into the test and delivered 20873 IOPS by the end of the test.

Here is the chart of the ZFS box performance results:

L2ARC Performance

 

Initially, the two SAN boxes deliver similar performance, with the Promise box at 2200 IOPS and the ZFS box at 2500 IOPS.  The ZFS box with a L2ARC is able to magnify its performance by a factor of ten once the L2ARC is completely populated! 

Notice that ZFS limits how quickly the L2ARC is populated to reduce wear on the cache drives.  It takes a few hours to populate the L2ARC and achieve maximum performance.  That seems like a long time when running benchmarks, but it is actually a very short period of time in the life cycle of a typical SAN box.

What is ZFS? Other Cool ZFS Features
POST A COMMENT

103 Comments

View All Comments

  • Penti - Wednesday, October 06, 2010 - link

    And a viable alternative still isn't available how is Nexenta and the community suppose to get driver support and support for new hardware there, when Oracle has closed the development kernel (SXDE is closed source), meaning that they maybe just maybe can use the retail Solaris 11 kernel if it's released in a functioning form that can be piped in with existing software and distro. They aren't going to develop it themselves and the vendors have no reason giving the code/drivers to anybody but Oracle. Continuing the OpenSolaris kernel means creating a new operating system. It means you won't get the latest ZFS updates and tools any more, at least not till they are in the normal S11 release. Means you can't expect the latest driver updates and so on either. You can continue to use it on todays hardware, but tomorrow it might be useless, you might not find working configurations.

    It's not clear that Nexenta actually can develop their own operating system, rather then just a distro, it means they have to create their own OS with their own kernel eventually. With their own drivers and so on. And it's not clear how much code Oracle will let slip out. It's just clear that they will keep it under wraps till official releases. It's however clear that there won't be any distro for them to base it on and any and all forks would be totally dependent on what Nexenta (Illuminos) manage to do. It will quickly get outdated without updates flowing all the time, and they came from Sun.
    Reply
  • andersenep - Wednesday, October 06, 2010 - link

    OpenIndiana/Illumos runs the same latest and greatest pool/zfs versions as the most recent Solaris 10 update.

    Work continues on porting newer pool/ZFS versions to FreeBSD which has plenty of driver support (better than OpenSolaris ever did).

    A stated goal of the Illumos project is to maintain 100% binary compatibility with Solaris. If Oracle decides the break that compatibility, intentionally or not, it will truly become a fork. Development will still continue.

    Even if no further development is made on ZFS, it's still an absolutely phenomenal filesystem. How many years now has Apple been using HFS+? FAT is still around in everything. If all development on ZFS stopped today, it would still remain an absolutely viable filesystem for many years to come. There is nothing else currently out there that even comes close to its feature set.

    I don't see how ZFS being under Oracle's control makes it any worse than any other open source filesystem. The source is still out there, and people are free to do what they want with it within the CDDL terms.

    This idea that just because the OpenSolaris DISTRO has been discontinued, that everything that went into it is no longer viable is silly. It is like calling Linux dead because Mandriva is dead.
    Reply
  • Guspaz - Wednesday, October 06, 2010 - link

    Thanks for mentioning OpenIndiana. I've been eagerly awaiting IllumOS to be built into an actual distribution to give me an upgrade path for my home OpenSolaris file server, and I look forward to upgrading to the first stable build of OpenIndiana.

    I'm currently running a dev build of OpenSolaris since the realtek network driver was broken in the latest stable build of OpenSolaris (for my chipset, at least).
    Reply
  • Mattbreitbach - Wednesday, October 06, 2010 - link

    I believe all of the current Hypervisors support this. Hyper-V does, as does XenServer. I have not done extensive testing with ESXi, but I would imagine that it supports it also. Reply
  • joeribl - Wednesday, October 06, 2010 - link

    "Nexenta is to OpenSolaris what OpenFiler or FreeNAS is to Linux."

    FreeNAS has always been FreeBSD based, not Linux. It does however provide ZFS support.
    Reply
  • Mattbreitbach - Wednesday, October 06, 2010 - link

    I should have caught that - thanks for the info. I've edited the article to reflect as such. Reply
  • vermaden - Wednesday, October 06, 2010 - link

    ... with deduplication and other features, here You can grab an ISO build or a VirtualBox apliance here: http://blog.vx.sk/archives/9-Pomozte-testovat-ZFS-...

    It would be great to see how FreeBSD performs (8.1 and 9-CURRENT) on that hardware, I can help You configure FreeBSD for these tests if You would like to, for example, by default FreeBSD does not enables AHCI mode for SATA drives which increases random performance a lot.

    Anyway, great article about ZFS performance on nice piece of hardware.
    Reply
  • Mattbreitbach - Wednesday, October 06, 2010 - link

    In Hyper-V it is called a Differencing disk - you have a parent disk that you build, and do not modify. You then create a "differencing disk". That disk uses the parent disk as it's source, and writes any changes out to the differencing disk. This way you can maintain all core OS files in one image, and write any changes out to child disks. This allows the storage system to cache any core OS components once, and any access to those core components comes directly from the cache.

    I believe that Xen calls it a differencing disk also, but I do not currently have a Xen Hypervisor running anywhere that I can check quickly.
    Reply
  • gea - Wednesday, October 06, 2010 - link

    new: Version 0.323
    napp-it ZFS appliance with Web-UI and online-installer for NexentaCore and Openindiana

    Napp-it, a project to build a free "ready to run" ZFS- Web und NAS-Appliance with Web-UI and Online-Installer now supports NexentaCore and OpenIndiana (free successor of OpenSolaris) up from Version 0.323. With its online Installer, you will have your ZFS-Server running with all services and tools within minutes.

    Features
    NAS Fileserver with AFP (incl. Time Maschine and Zero Config), SMB with ACLs, AD-Support and User/ Groups
    SAN Server with iSCSI (Comstar) and NFS forr XEN or Vmware esxi
    Web-Server, FTP
    Database-Server
    Backup-Server
    newest ZFS-Features (highest security with parity and Copy On Write, Deduplication, Raid-Z3, unlimited Snapshots via Windows previous Version, working ACLs, Online Pooltest with Datarefresh, Hybridpools, expandable Datapools=simply add Controller or Disks,............)

    included Tools:
    bonnie Pool-Performancetest
    iperf Net-Performancetest
    midnight commander
    ndmpcopy Backup
    rsync
    smartmontools
    socat
    unzip

    Management:
    remote via Web-UI and Browser

    Howto with NexentaCore:
    1. insert NexentaCore CD and install
    2. login as root and enter:

    wget -O - www.napp-it.org/nappit | perl

    During First-Installation you have to enter a mySQL Passwort angeben and select Apache with space-key

    Howto with OpenIndiana (free successor of OpenSolaris):
    1. Insert OpenIndiana CD and install
    2. login as admin, open a terminal and enter su to get root permissions and enter:

    wget -O - www.napp-it.org/nappit | perl

    AFP-Server is currently installed only on Nexenta.

    thats all, no step 3!
    You can now remotely manage this Mac/PC NAS appliance via Browser

    Details
    www.napp-it.org

    running Installation
    www.napp-it.org/pop_en.html
    Reply
  • Mattbreitbach - Wednesday, October 06, 2010 - link

    Very neat - I am installing OpenIndiana on our hardware right now and will test out the Napp-it application. Reply

Log in

Don't have an account? Sign up now