Opening the Kimono: Intel Details Nehalem and Tempts with Larrabeeby Anand Lal Shimpi on March 17, 2008 5:00 PM EST
- Posted in
Nehalem will support 2-way SMT (two threads per core), much like the Pentium 4 did before it. With a shorter pipeline than NetBurst and a greater ability to get data to the cores, there's more opportunity for increased parallelism (and thus performance) thanks to SMT on Nehalem than on Pentium 4.
The cache subsystem of Nehalem is almost entirely changed from Penryn. While Nehalem has the same 32KB L1 instruction and data caches of Penryn, the L2 and L3 caches are brand new. Each core in a quad-core Nehalem now has a smaller 256KB L2 cache, which Intel is calling "low latency" (potentially lower latency than Penryn thanks to a smaller cache size). While ditching the shared L2, Intel equipped Nehalem with a large 8MB fully-shared L3 cache that can be used by all cores.
This setup seems very similar to AMD's Phenom architecture, obviously built on Intel's Core 2 base however - the major difference here is that the cache hierarchy is inclusive and not exclusive like AMD's. The inclusive architecture means that each level of cache has a copy of data from the lower cache levels.
Nehalem effectively includes the only remaining advantages AMD held over Intel with respect to memory performance and interconnect speed - you can expect a tremendous performance increase going from Penryn to Nehalem because of this. Intel is expecting memory accesses to be around twice the speed in Nehalem as they are in Penryn, which thanks to its aggressive prefetchers are already incredibly fast. If you think Intel's performance advantage is significant today, Nehalem should completely redefine your perspective - AMD needs its Bobcat and Bulldozer cores if it is going to want to compete.
Intel has also added a new 2nd level TLB in Nehalem, similar in approach to its new 2nd level branch predictor. The first level TLB does a good job of keeping the cores fed quickly, but if there isn't a physical/virtual address mapping found in the first level TLB Nehalem can now look in the second level TLB instead of looking in the cache to keep performance high and latency low.
The TLB enhancements in particular look to be particularly great at server workloads, we suspect that Intel may be looking to really take on Opteron with Nehalem.
Above you see examples of the first Nehalem platforms - they should look very familiar to block diagrams of AMD K8 platforms we've seen for years now. The first high end desktop Nehalem parts will have an integrated 3-channel DDR3 memory controller supporting DDR3-800, 1066 and 1333.
On the server side you'll see registered memory support from Nehalem's IMC.