In the second quarter of this year, we’ll have affordable servers with up to 48 cores (AMD’s Magny-cours) and 64 threads (Intel Nehalem EX). The most obvious way to wield all that power is to consolidate massive amounts of virtual machines on those powerhouses. Typically, we’ll probably see something like 20 to 50 VMs on such machines. Port aggregation with a quad-port gigabit Ethernet card is probably not going to suffice. If we have 40 VMs on a quad-port Ethernet, that is less than 100Mbit/s per VM. We are back in the early Fast Ethernet days. Until virtualization took over, our network intensive applications would get a gigabit pipe; now we will be offering them 10 times less? This is not acceptable.

Granted, few applications actually need a full 1Gbit/s pipe. Database servers need considerably less, only a few megabits per second. Even at full load, the servers in our database tests rarely go beyond 10Mbit/s. Web servers are typically satisfied with a few tens of Mbit/s, but AnandTech's own web server is frequently bottlenecked by its 100Mbit connection. Fileservers can completely saturate Gbit links. Our own fileserver in the Sizing Servers Lab is routinely transmitting 120MB/s (a saturated 1Gbit/s link). The faster the fileserver is, the shorter the waiting time to deploy images and install additional software. So if we want to consolidate these kinds of workloads on the newest “über machines”, we need something better than one or two gigabit connections for 40 applications.

Optical 10Gbit Ethernet – 10GBase-SR/LR - saw the light of day in 2002. Similar to optical fibre channel in the storage world, it was very expensive technology. Somewhat more affordable, 10G on “Infiniband-ish” copper cable (10GBase-CX4) was born in 2004. In 2006, 10Gbit Ethernet via UTP cable (10GBase-T) held the promise that 10G Ethernet would become available on copper UTP cables. That promise has still not materialized in 2010; CX4 is by far the most popular copper based 10G Ethernet. The reason is that the 10GBase-T PHYs need too much power. The early 10GBase-T solutions needed up to 15W per port! Compare this to the 0.5W that a typical gigabit port needs, and you'll understand why you find so few 10GBase-T ports in servers. Broadcom reported a breakthrough just a few weeks ago: Broadcom claims that their newest 40nm PHYs use less than 4W per port. Still, it will take a while before the 10GBase-T conquers the world, as this kind of state-of-the art technology needs some time to mature.

We decided to check out the some of the more mature CX4-based solutions as they are decently priced and require less power. For example, a dual-port CX4 card goes as low as 6W… that is 6W for the controller, two ports and the rest of the card. So a complete dual-port NIC needs considerably less than one of the early 10GBase-T ports. But back to our virtualized server: can 10Gbit Ethernet offer something that the current popular quad-port gigabit NICs can’t?

Adapting the network layers for virtualization

When lots of VMs are hitting the same NIC, quite a few performance problems may arise. First, one network intensive VM may completely fill up the transmit queues and block the access to the controller for some time. This will increase the network latency that the other VMs see. The hypervisor has to emulate a network switch that sorts and routes the different packets of the various active VMs. Such an emulated switch costs quite a bit of processor performance, and this emulation and other network calculations might all be running on one core. In that case, the performance of this one core might limit your network bandwidth and raise network latency. That is not all, as moving data around without being able to use DMA means that the CPU has to handle all memory move/copy actions too. In a nutshell, a NIC with one transmit/receive queue and a software emulated switch is not an ideal combination if you want to run lots of network intensive VMs: it will reduce the effective bandwidth, raise the NIC latency and increase the CPU load significantly.


Without VMDQ, the hypervisor has to emulate a software switch. (Source: Intel VMDQ Technology)
 

Several companies have solved this I/O bottleneck by making use of the multiple queues". Intel calls it VMDq; Neterion calls it IOV. A single NIC controller is equipped with different queues. Each receive queue can be assigned to a virtual NIC of your VM and mapped to the guest memory of your VM. Interrupts are load balanced across several cores, avoiding the problem that one CPU is completely overwhelmed by the interrupts of tens of VMs.


With VMDq, the NIC becomes a Layer 2 switch with many different Rx/Tx queues. (Source: Intel VMDQ Technology)
 

When packets arrive at the controller, the NIC’s Layer 2 classifier/sorter sorts the packets and places them (based on the virtual MAC addresses) in the queue assigned to a certain VM. Layer 2 routing is thus done in hardware and not in software anymore. The hypervisor looks in the right queue and then routes those packets towards the right VM. Packets that have to go out of your physical server are placed in the transmit queues of each VM. In the ideal situation, each VM has its own queue. Packets are sent to the physical wire in a round-robin fashion.

The hypervisor has to support this and your NIC vendor must of course have an “SR-IOV” capable driver for the hypervisor. VMware ESX 3.5 and 4.0 have support for VMDq and similar technologies, calling it “NetQueue”. Microsoft Windows 2008 R2 supports this too, under the name “VMQ”.

Benchmark Configuration
POST A COMMENT

49 Comments

View All Comments

  • radimf - Wednesday, March 10, 2010 - link

    HI,
    thanks for article!
    Btw I am reading your site because of your virtualization articles.

    I planned almost 3 years ago for IT project with only a 1/5 of complete budget for small virtualization scenario.
    If you want redundancy, It can´t get much simplier than that:
    - 2 ESX servers
    - one SAN + one NFS/iSCSI/potentially FC storage for D2D backup
    - 2 TCP switches, 2 FC switches

    world moved, IT changed, EU dotation took too long to process - we finished last summer what was planned years ago...

    My 2 cents from small company finishing small IT virtualization project?
    FC saved my ass.

    iSCSI was on my list (DELL gear), but went FC instead(HP) for lower total price (thanks crisis :-)

    HP hardware looked sweet on specs sheets, and actual HW is superb, BUT.... FW sucked BIG TIME.
    IT took HP half year to fix it.

    HP 2910al switches do have option for up to 4 10gbit ports - that was the reason I bought them last summer.
    Coupled with DA cables - very cheap solution how to get 10gbit to your small VMware cluster. (viable 100% now)

    But unfortunatelly FW (that time) sucked so much, that 3 out of 4 supplied DA cables did not work at all (out of the box).
    Thanks to HP - they changed our DA for 10gbit SFP+ SR optics! :-)

    After installation we had several issues with "dead ESX cluster".
    Not even ping worked!
    FC worked flawlessly through these nightmares.
    Swithces again...
    Spanning tree protocol bug ate our cluster.

    Now we are happy finally. Everything works as advertised.
    10gbit primary links are backed up by 1gbit stand-by.
    Insane backup speeds of whole VMs compared to our legacy SMB solution to nexenta storage appliance.







    Reply
  • JohanAnandtech - Monday, March 08, 2010 - link

    Thank you. Very nice suggestion especially since we already started to test this out :-). Will have to wait until April though, as we got a lot of server CPU launches this month; Reply
  • Lord 666 - Monday, March 08, 2010 - link

    Aren't the new 32nm Intel server platforms coming with standard 10gbe nics? After my SAN project, going to phase in the new 32nm cpu servers mainly for AES-NI. The 10gbe nics would be an added bonus. Reply
  • hescominsoon - Monday, March 08, 2010 - link

    It's called xsigo(pronounced zee-go) and solves the i/o issue you are tying to solve here for vm i/o bandwidth. Reply
  • JohanAnandtech - Monday, March 08, 2010 - link

    Basically, it seems like using infiniband to connect each server to an infinibandswitch. And that infiniband connection is then used by a software which offers both a virtual HBA and a virtual NIC. Right? Innovative, but starting at $100k, looks expensive to me. Reply
  • vmdude - Monday, March 08, 2010 - link

    "Typically, we’ll probably see something like 20 to 50 VMs on such machines."

    That would be a low vm per core count in my environment. I typically have 40 vms or more running on a 16 core host that is populated with 96 GB of Ram.
    Reply
  • ktwebb - Sunday, March 21, 2010 - link

    Agreed. With Nahalems it's about a 2 VM's per core ratio in our environment. And that's conservative. At least with vSphere and overcommit capabilities. Reply
  • duploxxx - Monday, March 08, 2010 - link

    All depends on design and application type, we typically have 5-6 VM's on a 12 core 32GB machine and about 350 of those, running in a constant 60-70% CPU utilization range. Reply
  • switcher - Thursday, July 29, 2010 - link

    Great article and comments.

    Sorry I'm so late to this thread, but I was curious to know what the vSwitch is doing during the benchmark? How is it configured? @emuslin notes that SR-IOV is more than just VMDq, and AFAIK the Intel 82598EB doesn't support SR-IOV so what we're seeing it the boost from NetQueue. What support for SR-IOV is there in ESX these days?

    I'd be nice to see SR-IOV data too.
    Reply

Log in

Don't have an account? Sign up now