The Magic Inside the Uncore

We were already been spoiled by Ivy Bridge EP as it implemented a pretty complex uncore architecture. With Haswell EP, the communication between memory controllers, LLC, and cores has become even more intricate.

The Sandy Bridge EP CPU consisted of two columns of cores and LLC slices, connected by a single ring bus. The top models of the Ivy Bridge EP had three columns connected by a dual ring bus, with outer and inner rings as pictured above. The rings move data in opposite directions (clockwise/counter-clockwise) in order to reduce latency by allowing data to take the shortest path to the destination. As data is brought onto the ring infrastructure, it must be scheduled so that it does not collide with previous data.

The 14 and 18 core SKUs now have four columns of cores and LLC slices, and as a result scheduling gets very complicated. Intel has now segregated the dual ring buses and integrated two buffered switches to simplify scheduling. It's somewhat comparable with the way an Ethernet switch divides a network into segments. Each ring can act independently, and as result the effective bandwidth increases, which is especially helpful when FMA/AVX instructions are working on 256-bit chunks of data.

In total there are now three different die configurations. The first one, from four up to eight cores, is very similar to the lower count Ivy Bridge EPs. It has one dual ring, two columns of cores, and only one memory controller. The LLC cache is smaller on this die and has a lower latency.

The second configuration supports 10-12 cores and is a smaller version of the third die configuration that we described above. These dies have two memory controllers. The blue points indicate where data can jump onto the ring buses. Note that the die configurations are not symmetrical. For example an 18-core CPU has 8 cores (4-4) and 20MB LLC on one side, with 25MB LLC and 10 cores on the other. The middle configuration drops six to eight of the cores on the right ring, with an associated amount of LLC.

Data/instructions of one core are not stored in the adjacent cache slice. This could have lowered latency in some cases but it can create hotspots. Data is stored based on the physical address, ensuring all LCC cache slices are uniformly accessed. Transactions take the shortest path.

Rings are one of the entities that work on a separate voltage and frequency, just like cores. So if more I/O or coherency messaging is going on than processing, power can be dynamically allocated to speed up the rings.

Cache Coherency

The Home Agents are used for cache coherency and requests to DRAM. In dies that have two memory controllers, each home agent will use two channels. In dies that have one memory controller, each home agent will address four channels. While the smaller dies have faster LLC caches, Intel estimates that the second memory controller will extract 5% to 10% more bandwidth.

The two socket Haswell EP supports three snooping modes as you can see below. The first, Early Snoop, was available starting with Sandy Bridge EP models. With Ivy Bridge EP a second mode, Home Snoop, was introduced. Haswell EP now adds a third mode, Cluster on Die.

These snoop modes can be set in the BIOS.

Ivy Bridge used home snooping and had a directory in memory. The latest Xeon has directory caches (about 14KB) in each Home Agent. This directory cache keeps track of the contested cache lines to lower cache-to-cache transfer latencies. Another result is that directory updates in memory are less frequent and there are less broadcast snoops. Cluster On Die mode is the latest addition to the coherency protocols.

Cluster On Die can be understood as if you split the CPU and LLC into two parts that behave like two CPUs in NUMA. The OS is presented two affinity domains. As a result, the latency of LLC is lowered, but the hitrate is slightly lower. However if your application is NUMA aware, data and instructions are kept close to the part of the CPU that is processing them.

Higher QPI speeds, also notice the "COD" and "Early snoop" option.

And finally, QPI has been sped up to 9.6 GT/s, from 8 GT/s (as you can see in the BIOS shot).

More improvements

The list of (small) improvements is long and we have not been able to test all of them out. But here is an overview of what also improved

  • Lower VM Entry/exit latency. The latency of going and forth to the Hypervisor has been improved compared to Westmere. Sandy Bridge slightly increase this compared to Westmere. 
  • VMCS shadowing. De VM Control Structure can be exposed to hypervisors running on top of the main hypervisor. So you get VT-x inside your nested hypervisor
  • EPT Access and Dirty Bits. This makes it easier to move memory pages around, which is essiential for Live Migration / vMotion
  • Cache monitoring (CMT) & allocation technology (CAT). CMT allow you to "measure" if a certain Virtual machine hogs the LLC . In certain SKUs is possible have control over the placement of data in the last-level cache. 

Most of the improvements listed are specific for virtualized servers. However, cache allocation monitoring is also available for "native" OS.

Next Stop: the Uncore Power Optimizations
Comments Locked

85 Comments

View All Comments

  • SuperVeloce - Tuesday, September 9, 2014 - link

    Oh, nevermind... I unknowingly caught an error.
  • JohanAnandtech - Tuesday, September 9, 2014 - link

    thx! Fixed. Sorry for the late reaction, jetlagged and trying to get to the hectic pace of IDF :-)
  • hescominsoon - Tuesday, September 9, 2014 - link

    As long as AMD continues it's idiotic two integer units sharing an fpu design they will be an afterthought in the cpu department.
  • nils_ - Sunday, September 14, 2014 - link

    Serious competition for Intel will not come from AMD any time soon, but possibly IBM with the POWER8, Tyan even came out with a single socket board for that CPU so it might make it's way into the same market soon.
  • ScarletEagle - Tuesday, September 16, 2014 - link

    Any feel for the relative HPC performance of the E5-2680v3 with respect to the E5-2650Lv3? I am looking at purchasing a PowerEdge 730 with two of these and the 2133MHz RAM. My guess is that the higher base clock speed should make somewhat of an improvement?

Log in

Don't have an account? Sign up now