The Magic Inside the Uncore

We were already been spoiled by Ivy Bridge EP as it implemented a pretty complex uncore architecture. With Haswell EP, the communication between memory controllers, LLC, and cores has become even more intricate.

The Sandy Bridge EP CPU consisted of two columns of cores and LLC slices, connected by a single ring bus. The top models of the Ivy Bridge EP had three columns connected by a dual ring bus, with outer and inner rings as pictured above. The rings move data in opposite directions (clockwise/counter-clockwise) in order to reduce latency by allowing data to take the shortest path to the destination. As data is brought onto the ring infrastructure, it must be scheduled so that it does not collide with previous data.

The 14 and 18 core SKUs now have four columns of cores and LLC slices, and as a result scheduling gets very complicated. Intel has now segregated the dual ring buses and integrated two buffered switches to simplify scheduling. It's somewhat comparable with the way an Ethernet switch divides a network into segments. Each ring can act independently, and as result the effective bandwidth increases, which is especially helpful when FMA/AVX instructions are working on 256-bit chunks of data.

In total there are now three different die configurations. The first one, from four up to eight cores, is very similar to the lower count Ivy Bridge EPs. It has one dual ring, two columns of cores, and only one memory controller. The LLC cache is smaller on this die and has a lower latency.

The second configuration supports 10-12 cores and is a smaller version of the third die configuration that we described above. These dies have two memory controllers. The blue points indicate where data can jump onto the ring buses. Note that the die configurations are not symmetrical. For example an 18-core CPU has 8 cores (4-4) and 20MB LLC on one side, with 25MB LLC and 10 cores on the other. The middle configuration drops six to eight of the cores on the right ring, with an associated amount of LLC.

Data/instructions of one core are not stored in the adjacent cache slice. This could have lowered latency in some cases but it can create hotspots. Data is stored based on the physical address, ensuring all LCC cache slices are uniformly accessed. Transactions take the shortest path.

Rings are one of the entities that work on a separate voltage and frequency, just like cores. So if more I/O or coherency messaging is going on than processing, power can be dynamically allocated to speed up the rings.

Cache Coherency

The Home Agents are used for cache coherency and requests to DRAM. In dies that have two memory controllers, each home agent will use two channels. In dies that have one memory controller, each home agent will address four channels. While the smaller dies have faster LLC caches, Intel estimates that the second memory controller will extract 5% to 10% more bandwidth.

The two socket Haswell EP supports three snooping modes as you can see below. The first, Early Snoop, was available starting with Sandy Bridge EP models. With Ivy Bridge EP a second mode, Home Snoop, was introduced. Haswell EP now adds a third mode, Cluster on Die.

These snoop modes can be set in the BIOS.

Ivy Bridge used home snooping and had a directory in memory. The latest Xeon has directory caches (about 14KB) in each Home Agent. This directory cache keeps track of the contested cache lines to lower cache-to-cache transfer latencies. Another result is that directory updates in memory are less frequent and there are less broadcast snoops. Cluster On Die mode is the latest addition to the coherency protocols.

Cluster On Die can be understood as if you split the CPU and LLC into two parts that behave like two CPUs in NUMA. The OS is presented two affinity domains. As a result, the latency of LLC is lowered, but the hitrate is slightly lower. However if your application is NUMA aware, data and instructions are kept close to the part of the CPU that is processing them.

Higher QPI speeds, also notice the "COD" and "Early snoop" option.

And finally, QPI has been sped up to 9.6 GT/s, from 8 GT/s (as you can see in the BIOS shot).

More improvements

The list of (small) improvements is long and we have not been able to test all of them out. But here is an overview of what also improved

  • Lower VM Entry/exit latency. The latency of going and forth to the Hypervisor has been improved compared to Westmere. Sandy Bridge slightly increase this compared to Westmere. 
  • VMCS shadowing. De VM Control Structure can be exposed to hypervisors running on top of the main hypervisor. So you get VT-x inside your nested hypervisor
  • EPT Access and Dirty Bits. This makes it easier to move memory pages around, which is essiential for Live Migration / vMotion
  • Cache monitoring (CMT) & allocation technology (CAT). CMT allow you to "measure" if a certain Virtual machine hogs the LLC . In certain SKUs is possible have control over the placement of data in the last-level cache. 

Most of the improvements listed are specific for virtualized servers. However, cache allocation monitoring is also available for "native" OS.

Next Stop: the Uncore Power Optimizations
Comments Locked

85 Comments

View All Comments

  • cmikeh2 - Monday, September 8, 2014 - link

    In the SKU comparison table you have the E5-2690V2 listed as a 12/24 part when it is in fact a 10/20 part. Just a tiny quibble. Overall a fantastic read.
  • KAlmquist - Monday, September 8, 2014 - link

    Also, the 2637 v2 is 4/8, not 6/12.
  • isa - Monday, September 8, 2014 - link

    Looking forward to a new supercomputer record using these behemoths.
  • Bruce Allen - Monday, September 8, 2014 - link

    Awesome article. I'd love to see Cinebench and other applications tests. We do a lot of rendering (currently with older dual Xeons) and would love to compare these new Xeons versus the new 5960X chips - software license costs per computer are so high that the 5960X setups will need much higher price/performance to be worth it. We actually use Cinema 4D in production so those scores are relevant. We use V-Ray, Mental Ray and Arnold for Maya too but in general those track with the Cinebench scores so they are a decent guide. Thank you!
  • Ian Cutress - Monday, September 8, 2014 - link

    I've got some E5 v3 Xeons in for a more workstation oriented review. Look out for that soon :)
  • fastgeek - Monday, September 8, 2014 - link

    From my notes a while back... two E5-2690 v3's (all cores + turbo enabled) under 2012 Server yielded 3,129 for multithreaded and 79 for single.

    While not Haswell, I can tell you that four E5-4657L V2's returned 4,722 / 94 respectively.

    Hope that helps somewhat. :-)
  • fastgeek - Monday, September 8, 2014 - link

    I don't see a way to edit my previous comment; but those scores were from Cinebench R15
  • wireframed - Saturday, September 20, 2014 - link

    You pay for licenses for render Nodes? Switch to 3DS, and you get 9999 nodes for free (unless they changed the licensing since I last checked). :)
  • Lone Ranger - Monday, September 8, 2014 - link

    You make mention that the large core count chips are pretty good about raising their clock rate when only a few cores are active. Under Linux, what is the best way to see actual turbo frequencies? cpuinfo doesn't show live/actual clock rate.
  • JohanAnandtech - Monday, September 8, 2014 - link

    The best way to do this is using Intel's PCM. However, this does not work right now (only on Sandy and Ivy, not Haswel) . I deduced it from the fact that performance was almost identical and previous profiling of some of our benchmarks.

Log in

Don't have an account? Sign up now