Analysis: "Nehalem" vs. "Shanghai"

The Xeon X5570 outperforms the best Opterons by 20% and 17% of the gain comes from Hyper-Threading. That's decent but not earth shattering. Let us first set expectations. What should we have expected from the Xeon X5570? We can get a first idea by looking at the "native" (non-virtualized) scores of the individual workloads. Our last Server CPU roundup showed us that the Xeon X5570 2.93GHz is (compared to a Xeon E5450 3GHz):

  • 94% faster in Oracle Calling Circle
  • 107% faster in a OLAP SQL Server benchmark
  • 36% faster on the MCS eFMS web portal test

If we would simply take a geometric mean of these benchmarks and forget we are running on top of a hypervisor, we would expect a 65% advantage for the Xeon X5570. Our virtualization benchmark shows a 31% advantage for the Xeon X5570 over the Xeon 5450. What happened?

It seems like all the advantages of the new platforms such as fast CPU interconnects, NUMA, integrated memory controllers, and L3 caches for fast syncing have evaporated. In a way, this is the case. You have probably noticed the second flaw (besides ignoring the hypervisor) in the reasoning above. That second flaw consists in the fact that the "native scores" in our server CPU roundup are obtained on eight (16 logical) physical cores. Assuming that four virtual CPUs will show the same picture is indeed inaccurate. The effect of fast CPU interconnects, NUMA, and massive bandwidth increases will be much less in a virtualized environment where you limit each application to four CPUs. In this situation, if the ESX scheduler is smart (and that is the case) it will not have to sync between L3 caches and CPU sockets. In our native benchmarks, the application has to scale to eight CPUs and has to keep the caches coherent over two sockets. This is the first reason for the less than expected performance gain: the Xeon 5570 cannot leverage some of its advantages such as much quicker "syncing".

The fact that we are running on a hypervisor should give the Xeon X5570 a boost. The Nehalem architecture switches about 40% quicker back and forth to the hypervisor than the Xeon 54xx. It cannot leverage its best weapon though: Extended Page Tables are not yet supported in ESX 3.5 Update 4. They are supported in vSphere's ESX 4.0, which immediately explains why OEMs prefer to run VMmark on ESX 4.0. Most of our sources tell us that EPT gives a boost of about 25%. To understand this fully, you should look at our Hardware virtualization: the nuts and bolts article. The table below tells what mode the VMM (Virtual Machine Monitor), a part of the hypervisor, runs. To refresh your memory:

  • SVM: Secure Virtual Machine, hardware virtualization for the AMD Opteron
  • VT-x: Same for the Intel Xeon
  • RVI: also called nested paging or hardware assisted paging (AMD)
  • EPT: Extended Page Tables or hardware assisted paging (Intel)
  • Binary Translation: well tweaked software virtualization that runs on every CPU, developed by VMware
Hypervisor VMM Mode
ESX 3.5 Update 4 64-bit OLTP & OLAP VMs 32-bit Web portal VM
Quad-core Opterons SVM + RVI SVM + RVI
Xeon 55xx VT-x Binary Translation
Xeon 53xx, 54xx VT-x Binary Translation
Dual-core Opterons Binary Translation Binary Translation
Dual-core Xeon 50xx VT-x Binary Translation

Thanks to being first with hardware-assisted paging, AMD gets a serious advantage in ESX 3.5: it can always leverage all of its virtualization technologies. Intel can only use VT-x with the 64-bit Guest OS. The early VT-x implementations were pretty slow, and VMware abandoned VT-x for 32-bit guest OS as binary translation was faster in a lot of cases. The prime reason why VMware didn't ditch VT-x altogether was the fact that Intel does not support segments -- a must for binary translation -- in x64 (EM64T) mode. This makes VT-x or hardware virtualization the only option for 64-bit guests. Still, the mediocre performance of VT-x on older Xeons punishes the Xeon X5570 in 32-bit OSes, which is faster with VT-x than with binary translation as we will see further.

So how much performance does the AMD Opteron extract from the improved VMM modes? We checked by either forcing or forbidding the use of "Hardware Page Table Virtualization", also called Hardware Virtualized MMU, EPT, NPT, RVI, or HAP.


Let's first look at the AMD Opteron 8389 2.9GHz. When you disable RVI, memory page management is handled the same as all the other "privileged instructions" with hardware virtualization: it causes exceptions that make the hypervisor intervene. Each time you get a world switch towards the hypervisor. Disabling RVI makes the impact of world switches more important. When you enable RVI, the VMM exposes all page tables (Virtual, Guest Physical, and "machine" physical) to the CPU. It is no longer necessary to generate (costly) exceptions and switches to the hypervisor code.

However, filling the TLB is very costly with RVI. When a certain logical page address or virtual address misses the TLB, the CPU performs a lookup in the guest OS page tables. Instead of the right physical address, you get a "Guest Physical address", which is in fact a virtual address. The CPU has to search the Nested Pages ("Guest Physical" to "Real Physical") for the real physical address, and it does this for each table lookup.

To cut a long story short, it is very important to keep the percentage of TLB hits as high as possible. One way to do this is to decrease the number of memory pages with "large pages". Large pages mean that your memory is divided into 2MB pages (x86-64, x86-32 PAE) instead of 4KB. This means that Shanghai's L1 TLB can cover 96MB data (48 entries times 2MB) instead of 192 KB! Therefore, if there are a lot of memory management operations, it might be a good idea to enable large pages. Both the application and the OS must support this to give good results.

Large Pages and RVI on AMD Opteron 8389 -- vApus Mark I

The effect of RVI is pretty significant: it improves our vApus Mark I score by almost 20%. The impact of large pages is rather small (3%), and this is probably a result of Shanghai's large TLB, consisting of a 96 entry (48 data, 48 instructions) L1 and a 512 entry L2 TLB. You could say there is less of a need for large pages in the case of the Shanghai Opteron.

Heavy Virtualization Benchmarking Inquisitive Minds Want to Know
Comments Locked

66 Comments

View All Comments

  • tshen83 - Thursday, May 21, 2009 - link

    Jarred:

    Let's not fool each other. Johan's AMD bias is disgusting.

    My assertion that HardOCP killed the GPU market is simply trying to show you the effect of invalidating industry standard benchmarks. Architecturally, Nvidia's GPU bigger monolithic cores are far more advanced than ATI's cores right now. In GPGPU applications, it is not even close. The problem with gaming FPS benchmark as I have said is that developers are typically happy once the FPS reaches parity. It does not show architectural superiority.

    vApus? There are a ton of questions unanswered.
    1. Who wrote the software?(I assume European)
    2. Does the software scale linearly? And does the software scale on both AMD and Intel architecuture?
    3. Why benchmark 4 Core Virtual machines when we know that VMware doesn't really scale that well themselves in SMP setup?
    4. Seriously? Nieuws.be OLAP database? How many real world people run Nieuws.be?

    I usually don't respond to Anandtech articles unless the article is disgustingly stupid. I also don't understand why you guys can't accept the fact that Nehalem is in fact 100% performance/watt improved vs the previous generation Xeon. It is backed by data from more than one industry standard benchmark.

    Is AMD worth a look today? No, absolutely not. If you are still considering anything AMD today, you are an idiot. (The world is full of idiots) AMD's only chance is if they can release the G34 socket platform within a TDP range that is acceptable before they run out of cash.

    Before you call me a troll, remind yourself this: usually the troll is smarter than the people he/she is trolling. So ask yourself this question: did Johan deserve the negative critism?
  • JarredWalton - Thursday, May 21, 2009 - link

    You criticize every one of his articles, often because I'm not sure your reading comprehension is up to snuff. His "AMD bias" is not disgusting, though I'm quite sure your Intel bias is far worse than his AMD bias. The reason 3DMark has been largely invalidated is that it doesn't show realistic performance - though some of the latest versions scale similarly to some games, at best 3DMark measures 3DMark performance. Similarly, VMmark measures VMmark performance. Unless your workload is the same as VMmark, it doesn't really tell you much.

    1 - Who wrote the software? According to the article, "vApus or Virtual Application Unique Stresstest is a stress test developed by Dieter Vandroemme, lead developer of the Sizing Server Lab at the University College of West-Flanders." His being European has nothing to do with anything at all, unless you're a racist, bigoted fool.

    2 - 2-tile and 3-tile testing is in the works. It will take time.

    3 - Perhaps because there are companies looking for exactly that sort of solution. I guess we should only test situations where VMware performs optimally?

    4 - The source of the database is not so critical as the fact that it is a real-world database. Whether Johan uses a DB from Nieuws.be, AnandTech.com, Cnet.com, or some other source isn't particularly meaningful. It is a real setup used outside of benchmarking, and he had access to the site.

    I usually don't respond to trolls unless they are disgustingly stupid as well. I don't understand why you can't accept the fact that Nehalem isn't a panacea that fixes all the world's woes. That is backed by the world around us which continues to have all sorts of problems, and a "greener" CPU isn't going to save the environment any more than unplugging millions of cell phone charges that each consume 0.5W of power or less.

    AMD is certainly worth a *look* today. Will you actually end up purchasing AMD? That depends largely on your intended use. I have old Athlon 64/X2 systems that do everything that they need to do. For a small investment, you can build a much better AMD HTPC than Intel - mostly because the cheap Intel platform boards are garbage. I'd take a lesser CPU with a better motherboard any day over a top-end CPU with a crappy motherboard. If you want a system for less than $300, the motherboards alone would make me tend towards AMD.

    Of course, that completely misses the point that this isn't even remotely related to that market. Servers are in another realm, and features and support are critical. If you have a choice between AMD quad socket and Intel dual socket, and the price is the same, you might want the AMD solution. If you have existing hardware that can be upgraded to Shanghai without changing anything other than the CPU, you might want AMD. If you're buying new, you'd want to look at as much data as possible.

    Xeon X5570 still surpasses AMD in the initial tests by over 30%, which is not insignificant. If that extends to 50% or more in 2-tile and 3-tile setups, it's even more in Intel's favor. However, a 30% advantage is hardly out of line with the rest of the computing world. SYSmark 2007 shows the i7 965 beating the Phenom II 955 by 26.6%. Photoshop CS4 shows a 48.7% difference. DivX is 35.3%, xVid is 15.9% pass1 and 65.4% pass2, and WME9 is 25%. 3dsmax is 55.8%, CINEBENCH is 42%, and POV-ray is 65.3%.

    Which of those tests is a best indication of true potential for Core i7? Well, ALL OF THEM ARE! What's the best virtualization performance metric out there? Or the best server benchmark out there? They're ALL important and useful. vApus is just one more item to look at, and it still shows a good lead for Intel.

    Where is the 100% perf/watt boost compared to last generation? Well, it's in an application where i7 can stretch its eight threaded muscles. Compared to AMD, the performance/watt benefit for an entire system is more like 40% on servers. For QX9770, i7 965 is 32% more perf/watt in Cinebench, or 37.6% in Xvid. I doubt you can find a 100% increase in performance/watt without cherry-picking the benchmark and CPUs in question, but that's what you're already determined to do. That, my friend, is true bias - when you can't even admit that anything from the competition might be noteworthy, you are obviously wearing blinders.
  • Zstream - Thursday, May 21, 2009 - link

    Umm based on your two rants this means you have ZERO knowledge working with virtual desktops/terminal servers/virtual applications.

    I feel I need to make two corrections.

    One: ATI's die size is roughly 75% of Nvidia's, how do you conclude that Nvidia is better? Well honestly you can not because if you scale the performance and had the same die size of Nvidia, then ATI would be killing them.

    Second: Majority of enterprise's run AMD and Intel, in fact not till Neh. did Intel really come into the virtualization market.
  • tshen83 - Thursday, May 21, 2009 - link

    "Umm based on your two rants this means you have ZERO knowledge working with virtual desktops/terminal servers/virtual applications. "

    Really? Just how did you come up with this revelation?

    "One: ATI's die size is roughly 75% of Nvidia's, how do you conclude that Nvidia is better? Well honestly you can not because if you scale the performance and had the same die size of Nvidia, then ATI would be killing them. "

    You don't know shit about GPUs.

    "Second: Majority of enterprise's run AMD and Intel, in fact not till Neh. did Intel really come into the virtualization market. "

    True. That's what I am saying too, if you listened. I said, "no one should be considering AMD today because Nehalem is here".
  • Zstream - Thursday, May 21, 2009 - link

    I came to that conclusion based on your incoherent rants.

    Why would you say I do not know shit about GPU's? I provided you a fact, your illogical thinking does not change the matter. It comes down to die size and ATI wins performance/DIE. If you would like to argue that claim with then please do so.

    Who would consider Neh in todays market? Very few, unless you are a self proclaimed millionaire who crazily spends or needing the extra performance boost in some applications like exchange.
  • Viditor - Thursday, May 21, 2009 - link

    Guys, it's tshen...nobody over the age of 12 listens to his rants anyway, so don't feed the troll (or ban him if you can...).
  • leexgx - Thursday, May 21, 2009 - link

    LOL nice rant

    3dmark cant be used any more as its not an 3dmark any more its more like an 3d gpu/cpu mark the CPU can sway the total result

    AMD cpus have been using dedicated bus that talks to each other cpu socket and has direct access to the ram, allso AMD does have V-amd as well on all amd64 am2 cpus as well as optrons an (baring sempron)
  • Makaveli - Thursday, May 21, 2009 - link

    Ya what is the post all about.

    HardOCP killed the GPU market? I don't know about you but I never bought a videocard because of its 3dmark score. It's one benchmark that both companies cater to but is of little importance. Hardocp review method has much more valuable data for me than one benchmark.

    Let me ask you this when you are buying a car or anything of siginicant value. Do you not do your homework is one review being either positive or negative enough to drop your hard earned cash?

    If so Bestbuy is that way!

    As for the rest of your post the personal attacks and childish language cleary show your not even worth taking seriously. Sounds more like the ramblings of a Highschool child who is trying to get attention.

    Good day to you sir,

    Godspeed
  • Zstream - Thursday, May 21, 2009 - link

    You have no idea what you are talking about. The benchmark software can be downloaded. It is not our fault you are to poor to pay for a product.

    The rest I have to say "LOL".
  • DeepThought86 - Thursday, May 21, 2009 - link

    Wow, just wow.

Log in

Don't have an account? Sign up now