Announcement Three: Skylake-X's New L3 Cache Architecture

(AKA I Like Big Cache and I Cannot Lie)

SKU madness aside, there's more to this launch than just the number of cores at what price. Deviating somewhat from their usual pattern, Intel has made some interesting changes to several elements of Skylake-X that are worth discussing. Next is how Intel is implementing the per-core cache.

In previous generations of HEDT processors (as well as the Xeon processors), Intel implemented an three stage cache before hitting main memory. The L1 and L2 caches were private to each core and inclusive, while the L3 cache was a last-level cache covering all cores and that also being inclusive. This, at a high level, means that any data in L2 is duplicated in L3, such that if a cache line is evicted into L2 it will still be present in the L3 if it is needed, rather than requiring a trip all the way out to DRAM. The sizes of the memory are important as well: with an inclusive L2 to L3 the L3 cache is usually several multiplies of the L2 in order to store all the L2 data plus some more for an L3. Intel typically had 256 kilobytes of L2 cache per core, and anywhere between 1.5MB to 3.75MB of L3 per core, which gave both caches plenty of room and performance. It is worth noting at this point that L2 cache is closer to the logic of the core, and space is at a premium.

With Skylake-X, this cache arrangement changes. When Skylake-S was originally launched, we noted that the L2 cache had a lower associativity as it allowed for more modularity, and this is that principle in action. Skylake-X processors will have their private L2 cache increased from 256 KB to 1 MB, a four-fold increase. This comes at the expense of the L3 cache, which is reduced from ~2.5MB/core to 1.375MB/core.

With such a large L2 cache, the L2 to L3 connection is no longer inclusive and now ‘non-inclusive’. Intel is using this terminology rather than ‘exclusive’ or ‘fully-exclusive’, as the L3 will still have some of the L3 features that aren’t present in a victim cache, such as prefetching. What this will mean however is more work for snooping, and keeping track of where cache lines are. Cores will snoop other cores’ L2 to find updated data with the DRAM as a backup (which may be out of date). In previous generations the L3 cache was always a backup, but now this changes.

The good element of this design is that a larger L2 will increase the hit-rate and decrease the miss-rate. Depending on the level of associativity (which has not been disclosed yet, at least not in the basic slide decks), a general rule I have heard is that a double of cache size decreases the miss rate by the sqrt(2), and is liable for a 3-5% IPC uplift in a regular workflow. Thus here’s a conundrum for you: if the L2 has a factor 2 better hit rate, leading to an 8-13% IPC increase, it’s not the same performance as Skylake-S. It may be the same microarchitecture outside the caches, but we get a situation where performance will differ.

Fundamental Realisation: Skylake-S IPC and Skylake-X IPC will be different.

This is something that fundamentally requires in-depth testing. Combine this with the change in the L3 cache, and it is hard to predict the outcome without being a silicon design expert. I am not one of those, but it's something I want to look into as we approach the actual Skylake-X launch.

More things to note on the cache structure. There are many ‘ways’ to do it, one of which I imagined initially is a partitioned cache strategy. The cache layout could be the same as previous generations, but partitions of the L3 were designated L2. This makes life difficult, because then you have a partition of the L2 at the same latency of the L3, and that brings a lot of headaches if the L2 latency has a wide variation. This method would be easy for silicon layout, but hard to implement. Looking at the HCC silicon representation in our slide-deck, it’s clear that there is no fundamental L3 covering all the cores – each core has its partition. That being the case, we now have an L2 at approximately the same size as the L3, at least per core. Given these two points, I fully suspect that Intel is running a physical L2 at 1MB, which will give the design the high hit-rate and consistent low-latency it needs. This will be one feather in the cap for Intel.

Announcement Two: High Core Count Skylake-X Processors Announcement Four: The Other Stuff (AVX-512, Favored Core)
Comments Locked

203 Comments

View All Comments

  • mdw9604 - Tuesday, May 30, 2017 - link

    FYCK Intel. If AMD had not come out with Ryzen, they would still be sticking with 4 Core desktop processors and 8 cores on the HEDT machines and charging $1K plus for them. They are trying to make sure AMD can't compete. I'm buying AMD, I am not continuing to supporting Intel's monopolistic x86 stranglehold.
  • Bullwinkle J Moose - Wednesday, May 31, 2017 - link

    Preferred Core / Turbo 3 needs another update for the upcoming Cannon Lake

    Even if a single core could run @ 4.8Ghz single thread continuous while the second best core might reach 4.7 and another 4.6, why not let the core cool off while temporarily boosting the clocks "above" their "continuous" max speed on single threaded apps?

    Cycle the cores to max "temporary" clock speed 5.2 / 5.1 / 5.0Ghz while the previous main core is cooling down

    Turbo 4?
  • Ej24 - Wednesday, May 31, 2017 - link

    It's worth noting that ryzen 7 is akin to lga115x. It's mainstream. Motherboards will cost half of what x299 will cost. There should be no comparisons made between am4 and lga2066. They're two different market segments. People keep making the comparison b/c core counts but it doesn't make sense. The Intel HEDT should be compared to Threadripper. Amd literally doubled our core count per dollar at the mainstream. Intel still hasn't.
  • SanX - Wednesday, May 31, 2017 - link

    Billion is very scary word for unwashed. For 100+ billion market cap company it is a change.
  • Notmyusualid - Thursday, June 1, 2017 - link

    @ SanX

    I feel like 'the unwashed' this morning, I better move my @ss...

    :)
  • SanX - Wednesday, May 31, 2017 - link

    I wrote this in respond to the two trolls who think that the cost of the fab is not included into the price of the chips.

    /* Anandtech, fix your obsolete discussion forum which does not have Edit function and slips posts to the end from the threads if use Android Google browser with JS off.
  • close - Thursday, June 1, 2017 - link

    Dude, you're the one who calculated that:
    "$2000 for 18 cores is $100 per core.
    This is approximately 20x the production cost."

    And concluded that:
    "It is always good for monopoly to be a monopoly."

    Don't be surprised that people take a p*ss at you for what you write. You are the one who suggested the relationship between the production price per core and the retail price is somehow relevant. Why not "per transistor"? Or "per mm^2"?

    You chose an irrelevant metric (price/core when the CPU has additional components that you ignored), you ignored that there re many objective factors that make such a CPU more expensive (like yields which are worse the bigger the chip), you assumed everything is linear and can be quickly presented as a simple napkin calculation, and you tried to sell it. This isn't how any of this works so now it's easy to question your understanding on these topics. Maybe you're too washed...
  • tamalero - Wednesday, May 31, 2017 - link

    140W TDP? jesus...
  • Hrel - Wednesday, May 31, 2017 - link

    I can't believe they released to consumers at all, what consumer pay $2000 for a CPU? Who is this for?

    Hell, I had a hard time getting a fortune 10 company to agree to pay more than $1000/CPU for the servers that ran their own network and directory.

    I truly cannot imagine any consumer spending that much on a CPU. This baffles my mind.

    Someone hit me up when Anandtech does a review of 200-$300 CPU's, as anything beyond that better be for a fucking server.
  • Morawka - Wednesday, May 31, 2017 - link

    dude you dont know how many rich kids and benchmarkers there are in the world. As Ian noted, the top end Extreme Edition is always the best selling CPU out of all of them.. That was even true for Broadwell's 10c $1750 CPU last year.

Log in

Don't have an account? Sign up now