VMware's Fault Tolerance feature explainedby Liz van Dijk on February 27, 2009 12:00 AM EST
- Posted in
Now that the actual conference is behind us, and we've found our way back to the lab, it's time to finish what we started. First off, an apology for our radio silence on day 3: our schedule turned out to be quite a bit more packed than we thought it was, so finding our way to the quiet of the press room proved to be more of a hurdle than originally expected.
Since our main objective in attending the conference was to learn as much about virtualization as possible, rather than simply cover news flashes, we spent a lot of time in the breakout sessions, and I'm hoping to pour those into an article (or series of blogposts) for you as soon as possible.
On with the show! Last blog, I wrote about the first part of VMware's cloud strategy, being vCenter and vSphere, the continuations of today's Virtual Center and Virtual Infrastructure. Back then, I wondered just how exactly Fault Tolerance would be implemented, and in case you missed the comments of reader duploxxx and my own, I'll repeat what we learned here.
Essentially, most of the Fault Tolerance technology was leveraged from the Record/Replay feature present in VMware Workstation 6, allowing users to accurately record and reproduce a certain set of actions on a VM perfectly. As Lionel Cavalliere explained to us, what it comes down to is the hypervisor logging every single CPU instruction happening in the primary VM, while a floating IP (think of failover clusters) helps vCenter's virtual switch pass traffic on to the correct machine. In between the two machines, a private (preferrably as fast as possible) network should be set up for the primary vSphere to send the recorded instructions to the one carrying the shadow VM. In the breakout session, it was explained that no IO is ever performed by the primary VM, without the shadow VM first acknowledging the instructions. Both primary and shadow VM then both perform the IO, but the shadow's actions are suppressed by its hypervisor.
As vCenter's task is to monitor the state of all the vSpheres in the network, it will notice when the primary VM goes down due to a hardware failure and will issue a broadcast on the network for all traffic to be rerouted to the now operational shadow VM.
Thanks to Tijl Deneut for this image of the Fault Tolerance module in vCenter!
As expected from this sort of heavy duty logging, there is to be quite a noticeable performance hit, and at this point, complexity issues have made it impossible to enable this feature on a virtual machine running on more than a single vCPU, leaving quite a lot of room for improvement.
We've been asked before whether this feature, when fully functional, will remove the need for any other High Availability measures. The answer to that is a pretty conclusive "no". VMware told us they have no interest in making their software "intrusive" to the point where they are able to provide a failover solution for applications. Fault Tolerance is meant to keep the VM safe from an unexpected hardware failure. Software failures will simply be reproduced on the shadow VM, rendering it useless for recovery. Clustering applications will at this point still be necessary, it seems.
Check back soon for part 2 of the second day's keynote!