Will Nehalem conquer the server world by storm?by Johan De Gelas on February 12, 2009 12:00 AM EST
- Posted in
- IT Computing general
A dramatic turn of events is the best way to describe what we'll witness in a few weeks. But let us first talk about the current situation. As we pointed out in our last server CPU comparison, AMD latest quadcore Opteron was a very positive surprise. Sure, you can show a few server benchmarks where the Intel CPU wins like Black Scholes or some exotic HPC benchmark but the server applications that really make the difference like webservers, database servers run faster on the latest AMD "Shanghai" CPU. All depends on what kind of application is important for you of course. But let us look at the complete picture: performing more than 30% faster in Virtualization benchmarks is the final proof that AMD's latest is overall the best server CPU at this point in time.
But a few weeks from now, that will all change. As always we can not disclose benchmark information before a certain date, but if you look around here at this site, you have been able to discern the omens. The K10 architecture of Shanghai is a well rounded architecture, but one that misses really crucial weapons to keep up with the Nehalem:
- Simultaneous Hyperthreading offers performance boost that IPC Improvements are not capable of delivering (up to 45%!).
- Memory latency. Nehalem's memory latency is up to 40% lower
- Memory bandwidth: 3 channels is complete overkill for desktop apps, but it does wonders for many HPC and in a lesser degree server applications.
- a really aggressive integer engine
Nehalem will use somewhat more expensive DDR-3 DIMMs, which hardly offer any real performance boosts (as compared to DDR-2). So moving to DDR-3 will not help AMD much.
The details on the six-core Istanbul are still sketchy. But the dual socket Xeon "Westmere" will get six cores too and will appear in the same timeframe as AMD's hexacore. Only if AMD added SMT very secretly to Istanbul, they will be able to turn the tide. Considering that this would be a first for AMD, it is very unlikely SMT made it to Istanbul.
A dent in Nehalem's armour?
Does AMD have a chance in the server market in 2009 (and possibly 2010)? I must say it was not easy to find a weakness in Nehalem's architecture. The challenge made it very attractive to search anyway :-). So what follows is a big "IF- iF" story and you should take it with a big grain of salt ... as you should always do with forward looking articles.
There is one market where AMD has really been the leader and that is virtualization thanks to the IMC and the support for segments (four privilege levels) in the AMD64 Instruction Set Architecture. AMD's performance running VMware ESX in the "good old" ESX Binary translating mode (software virtualization) was better than running an Intel on the latest hardware virtualization hypervisor. VMware only uses hardware virtualization on an AMD server if NPT (or RVI or HAP) is present . In contrast, hardware virtualization slowed the Xeons of 2005 and 2006 a bit down but was absolutely necessary to run 64 bit guests on a hypervisor on top of a Xeon server.
Nehalem is catching up with EPT and VPID (see here), and while it was well implemented, one thing is lacking: the TLB is rather small. I have been pointing out this out about a year ago: while the TLB got AMD a lot of bad press, it will probably be the one thing that keeps AMD somewhat in Intel's slipstream. Let me make that more clear:
|| L1 TLB Data
|| L1 TLB Instr
|| L2 TLB
| AMD Shanghai/ Opteron 238x or 838x
48 (4 KB)
| 48 (4 KB)
512 (4 KB)
| Intel Penryn / Xeon 54xx
16 (4 KB)
| 128(4 KB)
256 (4 KB)
| Intel Nehalem / Xeon 55xx
128 (4 KB)
512 (4 KB)
Notice that in case you use large pages, the Nehalem TLB has few entries. So, let us now do a thought experiment. Currently, most of the virtualization benchmarks like VMmark (VMware) and VConsolidate (Intel) use relatively small VMs. VMs are for example a small Apache webserver and Mysql server which get between 512 MB and 2 GB of RAM. As a result most of them run with large pages off (Page size = 4 KB). These benchmark are very similar to the daily practice of an enterprise which uses IT mostly for "infrastructure purposes" such as authentificating it's employees and giving them access to mail, ftp, fileserver, print serving and web browsing.
It becomes totally different when you are an IT firm that offers it's services to a relatively large amount of customers on the internet. You need a large database with many probably pretty heavy webportals which offer a good interactive experience.
So you are not going to consolidate something like 84 (14 tiles x 6 VMs) tiny VMs on one physical machine, but rather 5 to 10 "fat" VMs. With fat VMs I mean VMs that get 4 GB and more of RAM, 2 to 4 vCPUs, run a 64 bit guest OS and so on.
Those applications also open tons of connections, which they have to destroy and recreate after some time. In other words, lots of memory activity going on.
EPT and NPT can offer between 10 and 35% better performance when lots of memory management activity is going on. Compared to the shadow page table technique, each change in the page tables does not cause a trap and the associated overhead (which can be 1000s of cycles). So you could say that going to the TLB of your CPU is a lot smoother. But if the TLB fails to deliver, the hardware page walk is very costly.
In search of the real page table
A hardware page walk consists of searching in several tables which allow the CPU to find the real physical address as the running software always supplies a virtual address. With a normal OS, the OS has set the CR3 register to contain a physical address where the first table is located.The first table converts the first part of the virtual address into a physical one, a pointer towards the physical address where the next table is located. With large pages, it takes about 3 steps to translate the virtual address to the physical one.
With EPT/NPT, the Guest OS gives a (CR3) address which in fact virtual and which must be converted into a real physical address. All the Guest OS tables contain pointers to a virtual addresses. So each table gives you a virtual address towards the other table. But the next table is not located at this virtual address, so we need to go out and search for the real address. So instead of 3 accesses to the memory, we need 3x3 accesses. If this happens too many times, EPT will actually reduce performance instead of improving it!
It is a good practice to use large pages with large database. Now remember we are moving towards a datacenter where almost everything is virtualized, databases included. In that case, Nehalem's TLB can only make sure that about 32 x 2 MB or only 64 MB of data and 28 MB of code is covered by the TLB. As a result, lots of relatively heavy hardware page walks will happen. Luckily, Intel caches the real physical page tables in the L3-cache, so it should not be too painful.
The latest quadcore Opteron has a much more potent TLB. As instructions take a lot less space than data, it is safe to say that the data TLB can cover up 176 (48 + 128) times 2 MB or 352 MB of data. Considering that virtualized machines have easily between 32 and 128 GB and are much better utilized (60-80% CPU load), it is clear that the AMD chip has an advantage there. How much difference can this make? We have to measure it, but based on our profiling and early benchmarking we believe that "an overflowing TLB" can decrease virtualized performance by as much 15%. To be honest: it is to early to tell, but we are pretty sure it is not peanuts in some important applications.
So what are we saying? Well, it is possible that the Opteron might be able to do some "damage control" compared to Nehalem when we try out a benchmark with large and fat VMs (Like we have done here). But there are a lot of "IF"s. Firstly, AMD must also cache the page tables in the caches. If for some reason they keep the page tables out of the caches, the advantage will probably be partly negated. Secondly, if the applications running on the physical machine demand a lot of bandwidth, the fact that the Nehalem platform has up to 70% more bandwidth might spoil the advantage too.
The last AMD Stronghold?
So Should Intel worry about this? Most likely not. For simplicity sake, let us assume that both cores - Shanghai and Nehalem- offer equal crunching power. They more or less do when it comes to pure raw FP power, but SpecInt makes it clear that Nehalem is faster in integer loads.
But let us forget that, as most server applications are unable to use all that superscalar power anyway. The AMD chip is still disadvantaged by the fact that it does not have SMT. Considering that most server apps have ample threads and that virtualization makes it easier to load each logical CPU up to 80% that remains a hard to close gap. Secondly, many of these applications do not fit entirely in the cache, so the fact that AMD's memory latency is up to 40% higher is not helping either. Thirdly, all top Xeons (2.66 GHz and higher) are capable of adding 2 extra speedbins even if all 4 cores are busy (like it was the case in SAP). It will be interesting to see how much power this costs, and if Turbo mode is possible with a 80% loaded virtualized machine.
In a nutshell: expect Nehalem with it's ample bandwidth and EPT to do very well in VMmark. However, we think that AMD might stay in the slipstream of the Intel flagship in some virtualization setups. It is possible that AMD counters with an even better optimized memory controller in Istanbul, but it is going to be tough.
Return to Linpack
The benchmarks where AMD will be able to stay close should have no use for massive amounts of memory bandwidth, SMT or Turbo mode. Feel free to educate us, but so far we have only found one benchmark that answers this profile: Linpack. Linpack achieves the highest IPC rates of probably almost all softwares. That means the Nehalem Xeon will be consuming peak power, and will not be able to use Turbo mode. Linpack (with MKL or ACML) is also so carefully optimized that it runs almost completely in the caches, and SMT or hyperthreading is only disturbing the carefully placed code lines. Considering that a 2.7 GHz Shanghai CPU with registered RAM was only a tiny bit slower than a Nehalem CPU with non registered RAM, you may expect to see both CPUs very close in this benchmark.
Outlook to 2009
The AMD quadcore is now the server CPU to get, but it is not going to stay that way very long. Until AMD comes up with SMT or another form of multi-threading and a faster memory controller, Intel's newest platform and CPU will force AMD to make the quadcore opteron very cheap. We expect that the AMD quadcore will only be competitive in Linpack and some virtualization scenario's.
And unless Istanbul has a very nice surprise for us, it is not going to change soon. Agreed, to our loyal readers, this does not come as a surprise...