Solving the Virtualization I/O Puzzle

Step One: IOMMU, VT-d

The solution for the high CPU load, higher latency, and lower throughput comes in three steps. The first solution was to bypass the hypervisor and assign a certain NIC directly to the network intensive VM. This approach gives several advantages. The VM has direct access to a native device driver, and as such can use every hardware acceleration feature that is available. The NIC is also not shared, thus all the queues and buffers are available to one VM.

However, even though the hypervisor lets the VM directly access the native driver, a virtual machine cannot bypass the hypervisor’s memory management. The guest OS inside that VM does not have access to the real physical memory, but to a virtual memory map that is managed by the hypervisor. So when the hypervisor sends out addresses to the driver, it sends out virtual addresses instead of the expected physical ones (the white arrow).

Intel solved this with VT-d, AMD with the “new” (*) IOMMU. The I/O hub translates the virtual or “guest OS fake physical” addresses (purple) into real physical addresses (blue). This new IOMMU also isolates the different I/O devices from each other by allocating different subsets of physical memory to the different devices.

Very few virtualized servers use this feature as it made virtual machine migration impossible. Instead of decoupling the virtual machine from the underlying hardware, direct assignment firmly connected the VM to the underlying hardware. So the AMD IOMMU and Intel VT-d technology is not that useful alone. It is just one of the three pieces of the I/O Virtualization puzzle.

(*) There is also an "old IOMMU" or Graphics Address Remapping Table which did address translations for letting the graphics card access the main memory.

Step Two: Multiple Queues

The next step was making the NIC a lot more powerful. Instead of letting the hypervisor sort all the received packages and send them off to the right VM, the NIC becomes a complete hardware switch that sorts all packages into multiple queues, one for each VM. This gives a lot of advantages.

Less interrupts and CPU load. If you let the hypervisor handle the packet switching, it means that CPU 0 (which is in most cases is the one doing the hypervisor tasks) is interrupted and has to examine the received package and determine the destination VM. That destination VM and the associated CPU has to be interrupted. With a hardware switch in the NIC, the package is immediately sent into the right queue, and the right CPU is immediately interrupted to come and get the package.

Less latency. A single queue for multiple VMs that receive and transmit packages can get overwhelmed and drop packets. By giving each VM its own queue, throughput is higher and latency is lower.

Although Virtual Machine Devices Queues solves a lot of problems, there is still some CPU overhead left. Every time the CPU of a certain VM is interrupted, the hypervisor has to copy the data from the hypervisor space into the VM memory space.

Consolidating to the Rescue The Final Piece of the Puzzle: SR-IOV
Comments Locked

38 Comments

View All Comments

  • Kahlow - Friday, November 26, 2010 - link

    Great article! The argument between fiber and 10gig E is interesting but from what I have seen it is extremely application and workload dependant that you would have to have a 100 page review to be able to figure out what media is better for what workload.
    Also, in most cases your disk arrays are the real bottleneck and max’ing your 10gig E or your FC isn’t the issue.

    It is good to have a reference point though and to see what 10gig translates to under testing.

    Thanks for the review,
  • JohanAnandtech - Friday, November 26, 2010 - link

    Thanks.

    I agree that it highly depends on the workload. However, there are lots and lots of smaller setups out there that are now using unnecessarily complicated and expensive setups (several physical separated GbE and FC). One of objective was to show that there is an alternative. As many readers have confirmed, a dual 10GbE can be a great solution if your not running some massive databases.
  • pablo906 - Friday, November 26, 2010 - link

    It's free and you can get it up and running in no time. It's gaining a tremendous amount of users because of the recent Virtual Desktop licensing program Citrix pushed. You could double your XenApp (MetaFrame Presentation Server) license count and upgrade them to XenDesktop for a very low price, cheaper than buying additonal XenApp licenses. I know of at least 10 very large organizations that are testing XenDesktop and preparing rollouts right now.

    What gives. VMWare is not the only Hypervisor out there.
  • wilber67 - Sunday, November 28, 2010 - link

    Am I missing something in some of the comments?
    Many are discussing FCoE and I do not believe any of the NICs tested were CNAs, just 10GE NICs.
    FCoE requires a CNA (Converged Network Adapter). Also, you cannot connect them to a garden variety 10GE switch and use FCoE. . And, don't forget that you cannot route FCoE.
  • gdahlm - Sunday, November 28, 2010 - link

    You can use software initiators on switches which support 802.3X flow control. Many web managed switches do support 802.3X as do most 10GE adapters.

    I am unsure how that would effect performance at in a virtualized shared environment as I believe it pauses on the port level.

    If you workload is not storage or network bound it would work but I am betting that when you hit that hard knee in your performance curve that things get ugly pretty quick.
  • DyCeLL - Sunday, December 5, 2010 - link

    To bad HP virtual connect couldn't be tested (a blade option).
    It splits the 10GB nics in a max of 8 Nics for the blades. It can do it for fiber and ethernet.
    Check: http://h18004.www1.hp.com/products/blades/virtualc...
  • James5mith - Friday, February 18, 2011 - link

    I still think that 40Gbps Infiniband is the best solution. By far it seems to be the best $/Gbps ratio of any of the platforms. Not to mention it can pass pretty much any traffic type you want.
  • saah - Thursday, March 24, 2011 - link

    I loved the article.

    I just reminded myself that VMware published official drivers for the ESX4 recently: http://downloads.vmware.com/d/details/esx4x_intel_...
    The ixgbe version is 3.1.17.1.
    Since the post says that "enables support for products based on the Intel 82598 and 82599 10 Gigabit Ethernet Controllers." I would like to see the test redone with an 82599-based card and recent drivers.
    Would it be feasible?

Log in

Don't have an account? Sign up now