Last but not least in our discussion, the use of proper software, and the configuration thereof. ESX offers a veritable waterfall of settings for those who are willing to dig in and tweak them, but they should definitely be used with care. Furthermore, it has quite a few surprises, both good and bad (like the sequential read performance drop discussed earlier) for heavy consolidaters that warrant a closer look.

First of all, though we mentioned before that the Monitor remains the same across VMware’s virtualization products, not all of them can be put to use for the same purposes. VMware Server and Workstation, sturdy products though they may be, are not in any capacity meant to rival ESX in performance and scalability, and yet quite often perfectly viable testing setups are discarded due to their inferior performance. Hosted virtualization products are forced to comply with the existing OS’s scheduling mechanics, making them adequate enough to set up proof of concepts and development sandboxes, but not meant at all as a high-performance alternative to native situations.

Secondly, there are some very important choices to make when installing and running new VM’s. Though the “Guest OS” drop down list when setting up a new VM may seem like an unnecessary extra, it is actually responsible for the choice of monitor type and a plethora of optimizations and settings, including the choice of storage adapter, and the specific type of VMware Tools that will be installed. For that reason it’s important to choose the correct OS, or at least one that is as close as possible to the one that needs to be installed. A typical pitfall situation could be to select Windows 2000 as the guest operating system, but installing Windows 2003. This would force Windows 2003 to run with Binary Translation and Shadow Page Tables, even though hardware-assisted virtualization is available.


Preventing interrupts

Thirdly, as it turns out, not all OS’s are equally fit to be virtualized. As an OS operates through the use of timed interrupts to maintain control of a system, knowing how sensitive ESX is to interrupts, we need to make sure we are not using an OS that pushes this over the top. A standard Windows installation will send about 100 interrupts per second to every vCPU assigned to it, but some Linux 2.6 distributions (for example RHEL 5) have been known to send over 1000 per second, per vCPU. This generates quite some extra load for ESX, which has to keep up with the VM's demands through the use of a software-based timer rather than a hardware-based one. Luckily, this issue has been taken care of in later releases of Red Hat (5.1 onward), where a divider can be configured to reduce the amount of interrupts initiated.

For the same purpose, avoid adding any hardware devices to the VM that it doesn’t really need (be they USB or CDROM). All of these cause interrupts that could be done without, and even though their costs have reduced significantly, taking steps to prevent them go a long way.

Scheduling

A fun part of letting multiple systems make use of the same physical platform is thinking about the logistics required to make that happen, and to better grasp the way ESX makes its decisions scheduling wise, it might be interesting to dig a bit deeper into the underlying architecture. Though obviously it is built to be as robust as possible, there are still some ways to give it a hand.

Earlier in this article, we discussed NUMA, and how to make sure a VM is not making unnecessary node switches. The VMkernel’s scheduler is built to support NUMA as well as possible, but how does that work, and why are node switches impossible to prevent from time to time?

Up till ESX 3.5, it has been impossible to create a VM with over 4 vCPU’s in ESX. Why is that? Because VMware locks them into so-called cells, that force the vCPU’s to “live” together on a single socket. These cells are in reality no more than a construct, grouping physical CPU’s into a limited group which prevents scheduling outside the “cell”. In ESX 3.5, the standard cell size is 4 physical CPU’s, with one cell usually corresponding to one socket. This means that in dual core systems, a cell of size 4 would span 2 sockets.

The upside of a cell size of 4 on a quadcore NUMA system is that VM’s will never accidentally get scheduled on a “remote” socket. Because one cell is bound to one socket, and the VM can never leave its assigned cell, this prevents the potential overhead involved with socket migrations.

The downside of cell sizing is that they can really limit the scheduling options provided to ESX when the amount of physical cores available is no longer a power of 2, or the cell sizes get too cramped to allow for the scheduling of several VM’s in a single timeslot.

With standard settings, a dual socket 6-core Intel Dunnington or AMD Istanbul system would be divided up into 3 possible cell configurations. One cell bound to each socket, and one cell spanning the two sockets. This puts the VM’s stationed into the latter at a disadvantage, due to the required inter-vCPU communications slowing down, which would make scheduling “unfair”.


Luckily, it is possible to change the standard cell size to better suit these hexcores, by going into Advanced Settings on a VI client, selecting VMkernel and setting VMkernel.Boot.cpuCellSize to 6. The change should be implemented as soon as the ESX host is rebooted, and allowing 4-way VM’s to be scheduled a lot more freely on a single socket, without allowing it to migrate.

Changing the cell size to better reflect the amount of cores for Istanbul boosted performance in our vApus Mark I test by up to 25%. This performance improvement is easily explained by the large amount of scheduling possibilities added for the scheduler: When trying to fit a 4-way VM into a cell comprised of 4 physical cores, there is only ever one scheduling choice. When trying to fit that same VM into a cell comprised of 6 physical cores, there are suddenly 15 different ways to schedule the VM inside that cell, allowing the scheduler to choose the most optimal configuration in any given situation.



People who have made the switch to vSphere may have noticed there is no longer a possibility to change the cell size, as VMware has decided to tweak the way the scheduler operates. It will now configure itself automatically to best handle the socket size.

Diving into storage "Magic" memory!
Comments Locked

13 Comments

View All Comments

  • zdzichu - Tuesday, June 30, 2009 - link

    True, for for quite some time Linux is tickless and doesn't generate uneeded timer interrupts. This change went into 2.6.21, which was released TWO YEARS ago. http://kernelnewbies.org/Linux_2_6_21#head-8547911...">http://kernelnewbies.org/Linux_2_6_21#h...47911895...
  • yknott - Tuesday, June 30, 2009 - link

    Technically Linux is NOT tickless, dynaticks only mean that when there are no interrupts occurring and the cpu is idle, there are no timer interrupts fired. When the CPU is in use, tick interrupts are still fired at 1000hz.

    To your point, this is still a huge advantage when it comes to virtualization. Most of the time CPUs are idle and not having the underlying VM hypervisor process ticks from each VM that is idle will allow for more processing power for the VMs who DO need the CPU time.

    I also agree that RedHat definitely needs to keep up with the kernel patches. I understand that there is some lag due to regression testing etc, but two years seems a bit much.
  • yknott - Monday, June 29, 2009 - link

    Thornburg,

    I think what Liz was talking about has to do with the tick interrupt under Linux. Since the 2.6.x kernel, this was set to a default of 1000hz or 1000 times a second.

    I don't believe you shouldn't use linux, as you can change this tick rate either in the kernel or at boot time. For example, under RHEL 5, just set divider=10 in your boot options to get a 100hz tick rate.

    You can read more about this on VMware's timekeeping article here: http://www.vmware.com/pdf/vmware_timekeeping.pdf">http://www.vmware.com/pdf/vmware_timekeeping.pdf

    Checkout page 11/12 for more info.

    Liz, while that paragraph makes sense, perhaps it doesnt tell the whole story about tick rate and interrupts under vmware. While I agree that running at a lower tickrate is ideal, perhaps mentioning that the interrupt rate is adjustable on most OSes.

Log in

Don't have an account? Sign up now