Announcement Three: Skylake-X's New L3 Cache Architecture

(AKA I Like Big Cache and I Cannot Lie)

SKU madness aside, there's more to this launch than just the number of cores at what price. Deviating somewhat from their usual pattern, Intel has made some interesting changes to several elements of Skylake-X that are worth discussing. Next is how Intel is implementing the per-core cache.

In previous generations of HEDT processors (as well as the Xeon processors), Intel implemented an three stage cache before hitting main memory. The L1 and L2 caches were private to each core and inclusive, while the L3 cache was a last-level cache covering all cores and that also being inclusive. This, at a high level, means that any data in L2 is duplicated in L3, such that if a cache line is evicted into L2 it will still be present in the L3 if it is needed, rather than requiring a trip all the way out to DRAM. The sizes of the memory are important as well: with an inclusive L2 to L3 the L3 cache is usually several multiplies of the L2 in order to store all the L2 data plus some more for an L3. Intel typically had 256 kilobytes of L2 cache per core, and anywhere between 1.5MB to 3.75MB of L3 per core, which gave both caches plenty of room and performance. It is worth noting at this point that L2 cache is closer to the logic of the core, and space is at a premium.

With Skylake-X, this cache arrangement changes. When Skylake-S was originally launched, we noted that the L2 cache had a lower associativity as it allowed for more modularity, and this is that principle in action. Skylake-X processors will have their private L2 cache increased from 256 KB to 1 MB, a four-fold increase. This comes at the expense of the L3 cache, which is reduced from ~2.5MB/core to 1.375MB/core.

With such a large L2 cache, the L2 to L3 connection is no longer inclusive and now ‘non-inclusive’. Intel is using this terminology rather than ‘exclusive’ or ‘fully-exclusive’, as the L3 will still have some of the L3 features that aren’t present in a victim cache, such as prefetching. What this will mean however is more work for snooping, and keeping track of where cache lines are. Cores will snoop other cores’ L2 to find updated data with the DRAM as a backup (which may be out of date). In previous generations the L3 cache was always a backup, but now this changes.

The good element of this design is that a larger L2 will increase the hit-rate and decrease the miss-rate. Depending on the level of associativity (which has not been disclosed yet, at least not in the basic slide decks), a general rule I have heard is that a double of cache size decreases the miss rate by the sqrt(2), and is liable for a 3-5% IPC uplift in a regular workflow. Thus here’s a conundrum for you: if the L2 has a factor 2 better hit rate, leading to an 8-13% IPC increase, it’s not the same performance as Skylake-S. It may be the same microarchitecture outside the caches, but we get a situation where performance will differ.

Fundamental Realisation: Skylake-S IPC and Skylake-X IPC will be different.

This is something that fundamentally requires in-depth testing. Combine this with the change in the L3 cache, and it is hard to predict the outcome without being a silicon design expert. I am not one of those, but it's something I want to look into as we approach the actual Skylake-X launch.

More things to note on the cache structure. There are many ‘ways’ to do it, one of which I imagined initially is a partitioned cache strategy. The cache layout could be the same as previous generations, but partitions of the L3 were designated L2. This makes life difficult, because then you have a partition of the L2 at the same latency of the L3, and that brings a lot of headaches if the L2 latency has a wide variation. This method would be easy for silicon layout, but hard to implement. Looking at the HCC silicon representation in our slide-deck, it’s clear that there is no fundamental L3 covering all the cores – each core has its partition. That being the case, we now have an L2 at approximately the same size as the L3, at least per core. Given these two points, I fully suspect that Intel is running a physical L2 at 1MB, which will give the design the high hit-rate and consistent low-latency it needs. This will be one feather in the cap for Intel.

Announcement Two: High Core Count Skylake-X Processors Announcement Four: The Other Stuff (AVX-512, Favored Core)
Comments Locked

203 Comments

View All Comments

  • ddriver - Tuesday, May 30, 2017 - link

    They can do anything between 8 and 16 in the threadripper design, if the market should call for it.
  • ddriver - Tuesday, May 30, 2017 - link

    They have 4 6 and 8 core zen dies, but they gotta have some with 5 or 7 working cores. Those could be sold as 4 and 6 cores with one disable core, or they can be slapped on the same chip for 10 and 14 core products, no working cores wasted.
  • ilt24 - Thursday, June 1, 2017 - link

    ddriver..."They have 4 6 and 8 core zen dies"

    Actually AMD only have one 8 core Zen die. All of this years Ryzen 7/5/3 chips are made from that die they just disable cores to get the lower core counts. Threadripper is made from a pair of these die in an MCM package, with some cores disabled for some SKUs. The EPYC processor will be made from 4 of these die in a single package. If AMD for some reason wanted to match Intel's 18 cores they would need to make a three die MCM chip with 6 die disabled.
  • ddriver - Friday, June 2, 2017 - link

    "Cores" as in "active/working cores".

    I doubt they'll be making 3 die MCM solutions, that is too asymmetric. Also, they will be throwing I/O away, as the HEDT socket doesn't have the pins to facilitate it.

    It is unlikely that AMD will have a 18 core SKU in either HEDT or server. Threadripper will stop at 16. Epyc will start at 20, if the market calls for it. That's 4 modules with 5 active cores each. I doubt Amd will produce asymmetric designs. So that's 20, 24, 28 and 32 cores for Epyc. And half of that for threadripper - 10, 12, 14 and 16 cores.

    Lacking an 18 core solution is not a big whoop, and most certainly not worth the R&D money.
  • ilt24 - Friday, June 2, 2017 - link

    "I doubt they'll be making 3 die MCM solutions,"

    I agree, which was why I said "If AMD for some reason wanted..."

    "Epyc will start at 20"

    I think AMD will also have lower core count EPYC chips, as a good part of the market uses them. ...and a 16 core version made from a pair of die will cost them quite a bit less then a 20 core+ version made from 4 die.
  • XabanakFanatik - Tuesday, May 30, 2017 - link

    The reason the price jumps from 8 to 10 cores like that is you are paying from the 28 lane skus into the 44 lane skus along with the extra cores. Once there, you are only paying for the increased core count.
  • theuglyman0war - Thursday, June 8, 2017 - link

    that's a tuff sell... the 8 core's higher base clock and 44 lanes at $599 would had been the easy purchase. Now if threadripper 16c actually releases at the rumor $849 I can't see any silver lining for Intel except for 18 core bragging rights. the 10 core benchmarks would have to have some kind of threadripper 16 core beating magic to justify the $1000 price. All the 8 core skylake has to is embarrass the 1800x in benchmarks ( and it could had also done so if it just made that 8 core solution an i9 with 44 lanes )
    Not quite enough to be more awesome?
    The new larger L2 caching seems like a big wrench in the works where direct comparison might easily fall apart if AMD does not have equal advanced on threadripper that trumps infinity fabric woe?
    At the end of the day time for benchmarks and support for AVX 512 etc... will take some time? I ain't close to pulling a trigger till I C. ( tho I B really excited to C! )
  • jjj - Tuesday, May 30, 2017 - link

    Intel reacts to Ryzen and they leave room for AMD to do up to 2x better lol.
  • alamilla - Tuesday, May 30, 2017 - link

    Those prices are HILARIOUS.
    C'mon Intel
  • jjj - Tuesday, May 30, 2017 - link

    I expect the cheapest 16 cores Threadripper to allow users to fit the CPU plus mobo in 1k$ so 799$ for the CPU.

    Intel's play in recent years has been to push ASPs up and offset declining units but AMD has to find the balance between margins and share gains- Intel has no share left to gain. AMD doesn't have yield issues with MCM so they could easily go even lower than 799$ for 16 cores.
    What i am curious about is if AMD has any SKUs with less than 12 cores and how aggressive they get with those.
    And ofc let's see if AMD has a new revision for Ryzen.
    Doubt they launch Threadripper today, before Naples.

Log in

Don't have an account? Sign up now