The Magic Inside the Uncore

We were already been spoiled by Ivy Bridge EP as it implemented a pretty complex uncore architecture. With Haswell EP, the communication between memory controllers, LLC, and cores has become even more intricate.

The Sandy Bridge EP CPU consisted of two columns of cores and LLC slices, connected by a single ring bus. The top models of the Ivy Bridge EP had three columns connected by a dual ring bus, with outer and inner rings as pictured above. The rings move data in opposite directions (clockwise/counter-clockwise) in order to reduce latency by allowing data to take the shortest path to the destination. As data is brought onto the ring infrastructure, it must be scheduled so that it does not collide with previous data.

The 14 and 18 core SKUs now have four columns of cores and LLC slices, and as a result scheduling gets very complicated. Intel has now segregated the dual ring buses and integrated two buffered switches to simplify scheduling. It's somewhat comparable with the way an Ethernet switch divides a network into segments. Each ring can act independently, and as result the effective bandwidth increases, which is especially helpful when FMA/AVX instructions are working on 256-bit chunks of data.

In total there are now three different die configurations. The first one, from four up to eight cores, is very similar to the lower count Ivy Bridge EPs. It has one dual ring, two columns of cores, and only one memory controller. The LLC cache is smaller on this die and has a lower latency.

The second configuration supports 10-12 cores and is a smaller version of the third die configuration that we described above. These dies have two memory controllers. The blue points indicate where data can jump onto the ring buses. Note that the die configurations are not symmetrical. For example an 18-core CPU has 8 cores (4-4) and 20MB LLC on one side, with 25MB LLC and 10 cores on the other. The middle configuration drops six to eight of the cores on the right ring, with an associated amount of LLC.

Data/instructions of one core are not stored in the adjacent cache slice. This could have lowered latency in some cases but it can create hotspots. Data is stored based on the physical address, ensuring all LCC cache slices are uniformly accessed. Transactions take the shortest path.

Rings are one of the entities that work on a separate voltage and frequency, just like cores. So if more I/O or coherency messaging is going on than processing, power can be dynamically allocated to speed up the rings.

Cache Coherency

The Home Agents are used for cache coherency and requests to DRAM. In dies that have two memory controllers, each home agent will use two channels. In dies that have one memory controller, each home agent will address four channels. While the smaller dies have faster LLC caches, Intel estimates that the second memory controller will extract 5% to 10% more bandwidth.

The two socket Haswell EP supports three snooping modes as you can see below. The first, Early Snoop, was available starting with Sandy Bridge EP models. With Ivy Bridge EP a second mode, Home Snoop, was introduced. Haswell EP now adds a third mode, Cluster on Die.

These snoop modes can be set in the BIOS.

Ivy Bridge used home snooping and had a directory in memory. The latest Xeon has directory caches (about 14KB) in each Home Agent. This directory cache keeps track of the contested cache lines to lower cache-to-cache transfer latencies. Another result is that directory updates in memory are less frequent and there are less broadcast snoops. Cluster On Die mode is the latest addition to the coherency protocols.

Cluster On Die can be understood as if you split the CPU and LLC into two parts that behave like two CPUs in NUMA. The OS is presented two affinity domains. As a result, the latency of LLC is lowered, but the hitrate is slightly lower. However if your application is NUMA aware, data and instructions are kept close to the part of the CPU that is processing them.

Higher QPI speeds, also notice the "COD" and "Early snoop" option.

And finally, QPI has been sped up to 9.6 GT/s, from 8 GT/s (as you can see in the BIOS shot).

More improvements

The list of (small) improvements is long and we have not been able to test all of them out. But here is an overview of what also improved

  • Lower VM Entry/exit latency. The latency of going and forth to the Hypervisor has been improved compared to Westmere. Sandy Bridge slightly increase this compared to Westmere. 
  • VMCS shadowing. De VM Control Structure can be exposed to hypervisors running on top of the main hypervisor. So you get VT-x inside your nested hypervisor
  • EPT Access and Dirty Bits. This makes it easier to move memory pages around, which is essiential for Live Migration / vMotion
  • Cache monitoring (CMT) & allocation technology (CAT). CMT allow you to "measure" if a certain Virtual machine hogs the LLC . In certain SKUs is possible have control over the placement of data in the last-level cache. 

Most of the improvements listed are specific for virtualized servers. However, cache allocation monitoring is also available for "native" OS.

Next Stop: the Uncore Power Optimizations
POST A COMMENT

84 Comments

View All Comments

  • coburn_c - Monday, September 08, 2014 - link

    MY God - It's full of transistors! Reply
  • Samus - Monday, September 08, 2014 - link

    I wish there were socket 1150 Xeon's in this class. If I could replace my quad core with an Octacore... Reply
  • wireframed - Saturday, September 20, 2014 - link

    If you can afford an 8-core CPU, I'm sure you can afford a S2011 board - it's like 15% of the price of the CPU, so the cost relative to the rest of the platform is negligible. :)
    Also, s1150 is dual-channel only. With that many cores, you'll want more bandwidth.
    Reply
  • peevee - Wednesday, March 25, 2015 - link

    For many, if not most workloads it will be faster to run 4 fast (4GHz) cores on 4 fast memory channels (DDR4-2400+) than 8 slow (2-3GHz) cores on 2 memory channels. Of course, if your workload consists of a lot of trigonometry (sine/cosine etc), or thread worksets completely fit into 2nd level cache (only 256k!), you may benefit from 8/2 config. But if you have one of those, I am eager to hear what it is. Reply
  • tech6 - Monday, September 08, 2014 - link

    The 18 core SKU is great news for those trying to increase data center density. It should allow VM hosts with 512Gb+ of memory to operate efficiently even under demanding workloads. Given the new DDR4 memory bandwidth gains I wonder if the 18 core dual socket SKUs will make quad socket servers a niche product? Reply
  • Kevin G - Monday, September 08, 2014 - link

    In fairness, quad socket was already a niche market.

    That and there will be quad socket version of these chips: E5-4600v3's.
    Reply
  • wallysb01 - Monday, September 08, 2014 - link

    My lord. My thought is that this really shows that v3 isn’t the slouch many thought it would be. An added 2 cores over v2 in the same price range and turbo boosting that appears to functioning a little better, plus the clock for clock improvements and move to DDR4 make for a nice step up when all combined.

    I’m surprised Intel went with an 18 core monster, but holy S&%T, if they can squeeze it in and make it function, why not.
    Reply
  • Samus - Monday, September 08, 2014 - link

    I feel for AMD, this just shows how far ahead Intel is :\ Reply
  • Thermogenic - Monday, September 08, 2014 - link

    Intel isn't just ahead - they've already won. Reply
  • olderkid - Monday, September 08, 2014 - link

    AMD saw Intel behind them and they wondered how Intel fell so far back. But really Intel was just lapping them. Reply

Log in

Don't have an account? Sign up now