I Keep My Cache Private

As mentioned in the original Skylake-X announcements, the new Skylake-SP cores have shaken up the cache hierarchy compared to previous generations. What used to be simple inclusive caches have now been adjusted in size, policy, latency, and efficiency, which will have a direct impact on performance. It also means that Skylake-S and Skylake-SP will have different instruction throughput efficiency levels. They could be the difference between chalk and cheese and a result, or the difference between stilton and aged stilton.

Let us start with a direct compare of Skylake-S and Skylake-SP.

Comparison: Skylake-S and Skylake-SP Caches
Skylake-S Features Skylake-SP
32 KB
8-way
4-cycle
4KB 64-entry 4-way TLB
L1-D 32 KB
8-way
4-cycle
4KB 64-entry 4-way TLB
32 KB
8-way
4KB 128-entry 8-way TLB
L1-I 32 KB
8-way
4KB 128-entry 8-way TLB
256 KB
4-way
11-cycle
4KB 1536-entry 12-way TLB
Inclusive
L2 1 MB
16-way
11-13 cycle
4KB 1536-entry 12-way TLB
Inclusive
< 2 MB/core
Up to 16-way
44-cycle
Inclusive
L3 1.375 MB/core
11-way
77-cycle
Non-inclusive

The new core keeps the same L1D and L1I cache structures, both implementing writeback 32KB 8-way caches for each. These caches have a 4-cycle access latency, but differ in their access support: Skylake-S does 2x32-byte loads and 1x32-byte store per cycle, whereas Skylake-SP offers double on both.

The big changes are with the L2 and the L3. Skylake-SP has a 1MB private L2 cache with 16-way associativity, compared to the 256KB private L2 cache with 4-way associativity in Skylake-S. The L3 changes to an 11-way non-inclusive 1.375MB/core, from a 20-way fully-inclusive 2.5MB/core arrangement.

That’s a lot to unpack, so let’s start with inclusivity:


Inclusive Caching

An inclusive cache contains everything in the cache underneath it and has to be at least the same size as the cache underneath (and usually a lot bigger), compared to an exclusive cache which has none of the data in the cache underneath it. The benefit of an inclusive cache means that if a line in the lower cache is removed due it being old for other data, there should still be a copy in the cache above it which can be called upon. The downside is that the cache above it has to be huge – with Skylake-S we have a 256KB L2 and a 2.5MB/core L3, meaning that the L2 data could be replaced 10 times before a line is evicted from the L3.

A non-inclusive cache is somewhat between the two, and is different to an exclusive cache: in this context, when a data line is present in the L2, it does not immediately go into L3. If the value in L2 is modified or evicted, the data then moves into L3, storing an older copy. (The reason it is not called an exclusive cache is because the data can be re-read from L3 to L2 and still remain in the L3). This is what we usually call a victim cache, depending on if the core can prefetch data into L2 only or L2 and L3 as required. In this case, we believe the SKL-SP core cannot prefetch into L3, making the L3 a victim cache similar to what we see on Zen, or Intel’s first eDRAM parts on Broadwell. Victim caches usually have limited roles, especially when they are similar in size to the cache below it (if a line is evicted from a large L2, what are the chances you’ll need it again so soon), but some workloads that require a large reuse of recent data that spills out of L2 will see some benefit.

So why move to a victim cache on the L3? Intel’s goal here was the larger private L2. By moving from 256KB to 1MB, that’s a double double increase. A general rule of thumb is that a doubling of the cache increases the hit rate by 41% (square root of 2), which can be the equivalent to a 3-5% IPC uplift. By doing a double double (as well as doing the double double on the associativity), Intel is effectively halving the L2 miss rate with the same prefetch rules. Normally this benefits any L2 size sensitive workloads, which some enterprise environments such as databases can be L2 size sensitive (and we fully suspect that a larger L2 came at the request of the cloud providers).

Moving to a larger cache typically increases latency. Intel is stating that the L2 latency has increased, from 11 cycles to ~13, depending on the type of access – the fastest load-to-use is expected to be 13 cycles. Adjusting the latency of the L2 cache is going to have a knock-on effect given that codes that are not L2 size sensitive might still be affected.

So if the L2 is larger and has a higher latency, does that mean the smaller L3 is lower latency? Unfortunately not, given the size of the L2 and a number of other factors – with the L3 being a victim cache, it is typically used less frequency so Intel can give the L3 less stringent requirements to remain stable. In this case the latency has increased from 44 in SKL-X to 77 in SKL-SP. That’s a sizeable difference, but again, given the utility of the victim cache it might make little difference to most software.

Moving the L3 to a non-inclusive cache will also have repercussions for some of Intel’s enterprise features. Back at the Broadwell-EP Xeon launch, one of the features provided was L3 cache partitioning, allowing limited size virtual machines to hog most of the L3 cache if it was running a mission-critical workflow. Because the L3 cache was more important, this was a good feature to add. Intel won’t say how this feature has evolved with the Skylake-SP core at this time, as we will probably have to wait until that launch to find out.

As a side note, it is worth noting here that Broadwell-E was a 256KB private L2 but 8-way, compared to Skylake-S which was a 256KB private L2 but 4-way. Intel stated that the Skylake-S base core went down in associativity for several reasons, but the main one was to make the design more modular. In this case it means the L2 in both size and associativity are 4x from Skylake-S by design, and shows that there may be 512KB 8-way variants in the future.

Microarchitecture Analysis: Adding in AVX-512 and Tweaks to Skylake-S Intel Makes a Mesh: New Core-to-Core Communication Paradigm
Comments Locked

264 Comments

View All Comments

  • Gothmoth - Monday, June 19, 2017 - link

    i don´t care about powerdraw that much if i can COOL the CPU and keep the cooling quiet.

    but in this case the powerdraw is high and the heat is crazy.

    and all because of intel insisting to save a few dollar on a 1000 dollar CPU and use TIM?

    WTF....
  • Ej24 - Monday, June 19, 2017 - link

    I wish amd would have released Threadripper closer to ryzen. That way amd wouldn't make comparisons of ryzen to Intel x99/x299. They kind of shot themselves in the foot. AM4 is only directly comparable to lga115x as a platform. R3, 5 and 7 are only really intended to compete with i3, 5, and 7 consumer parts. Amd simply doubled the core count per dollar at the consumer line. It's merely coincidental at this point that ryzen core count lines up with Intel HEDT. The platforms are not comparablein use case or intent. All these comparisons will be null when Threadripper/x399 is released as that is AMD's answer to x299.
  • Ej24 - Monday, June 19, 2017 - link

    how is the 7740x, 112w tdp only drawing 80w at full load? I understand that tdp isn't power draw but thermal dissipation. However the two values are usually quite close. In my experience, max turbo power consumption surpasses the tdp rating in watts.
    For example, my 88w tdp 4790k consumes 130w at max 4 core turbo. My 4790S a 65w tdp consumes 80w at max 4 core turbo. My 4790t, 45w tdp, consumes 55w at max 4 core turbo. So how is it the 7740x consumed 80W at max utilization??
  • AnandTechReader2017 - Tuesday, June 20, 2017 - link

    Agreed as on http://www.anandtech.com/show/10337/the-intel-broa... the all-core load for the i7 6950X the all-core load is 135W yet on this graph it's 110W. Something is wrong with those load numbers.
  • Ian Cutress - Tuesday, June 20, 2017 - link

    It's consumer silicon running a single notch up the voltage/frequency curve. Probably binned a bit better too. 112W is just a guide to make sure you put a stonking great big cooler on it. But given the efficiency we saw with Kaby Lake-S processors to begin with, it's not that ludicrous.
  • Flying Aardvark - Monday, June 19, 2017 - link

    This is an interesting time (finally), again in CPUs. To answer the question you posed, "Ultimately a user can decide the following". I decided to go mini-ITX this time. Chose Ryzen for this, and initially the 1800X. Had to downgrade to the 1700 due to heat/temps, but overall I don't think anything competes against AMD at all in the Node202 today.

    That's one area where Intel is MIA. Coffeelake will be 6C/12T, 7700K is 4C/8T. R7-1700 is 65W and 8C/16T. Works great. I paired mine with a 1TB 960 Pro and Geforce 1060 Founders Edition.

    If I moved to anything else, it would be all the way to 16C/32T Threadripper. I'm really unimpressed by this new Intel lineup, power consumption and heat are simply out of control. Dead on arrival.
  • Gothmoth - Monday, June 19, 2017 - link

    what mobo and ram did you use? is your ryzen build really stable?

    i need full load stability 24/7.
  • Flying Aardvark - Monday, June 19, 2017 - link

    What, you don't need just 60% stability? Yes it's stable.

    I did have one bluescreen and it was the Nvidia driver. I think it's unlikely most people would run into whatever caused it, because I use a triple monitor setup and lots of programs / input switching, and it crashed upon a DisplayPort redetection.

    I bought the Geforce 1060 because it was the most efficient and well-built blower fan cooled GPU I could find. But buying again, I'd go for the best Radeon 480/580 that I could find.

    I never had a bluescreen for decade running Intel CPUs and AMD GPUs so I dislike changing to AMD CPUs and Nvidia GPUs.. but I think it's safest to run a Radeon. Just less likely to have an issue IMO.
    Other than that, no problems at all. Rock solid stable. I used the Biostar board and G.Skill "Ryzen" RAM kit.
  • Gothmoth - Tuesday, June 20, 2017 - link

    it´s something different if as system is stable for 2-3 hours under load or 24/7 under load.. capiche? :-)
  • Gothmoth - Tuesday, June 20, 2017 - link

    btw... thanks for your answer.

    i use a triple monitor setup and use many programs at once... what sense would a 8-10 core make otherwise. :-)

Log in

Don't have an account? Sign up now