Announcement Three: Skylake-X's New L3 Cache Architecture

(AKA I Like Big Cache and I Cannot Lie)

SKU madness aside, there's more to this launch than just the number of cores at what price. Deviating somewhat from their usual pattern, Intel has made some interesting changes to several elements of Skylake-X that are worth discussing. Next is how Intel is implementing the per-core cache.

In previous generations of HEDT processors (as well as the Xeon processors), Intel implemented an three stage cache before hitting main memory. The L1 and L2 caches were private to each core and inclusive, while the L3 cache was a last-level cache covering all cores and that also being inclusive. This, at a high level, means that any data in L2 is duplicated in L3, such that if a cache line is evicted into L2 it will still be present in the L3 if it is needed, rather than requiring a trip all the way out to DRAM. The sizes of the memory are important as well: with an inclusive L2 to L3 the L3 cache is usually several multiplies of the L2 in order to store all the L2 data plus some more for an L3. Intel typically had 256 kilobytes of L2 cache per core, and anywhere between 1.5MB to 3.75MB of L3 per core, which gave both caches plenty of room and performance. It is worth noting at this point that L2 cache is closer to the logic of the core, and space is at a premium.

With Skylake-X, this cache arrangement changes. When Skylake-S was originally launched, we noted that the L2 cache had a lower associativity as it allowed for more modularity, and this is that principle in action. Skylake-X processors will have their private L2 cache increased from 256 KB to 1 MB, a four-fold increase. This comes at the expense of the L3 cache, which is reduced from ~2.5MB/core to 1.375MB/core.

With such a large L2 cache, the L2 to L3 connection is no longer inclusive and now ‘non-inclusive’. Intel is using this terminology rather than ‘exclusive’ or ‘fully-exclusive’, as the L3 will still have some of the L3 features that aren’t present in a victim cache, such as prefetching. What this will mean however is more work for snooping, and keeping track of where cache lines are. Cores will snoop other cores’ L2 to find updated data with the DRAM as a backup (which may be out of date). In previous generations the L3 cache was always a backup, but now this changes.

The good element of this design is that a larger L2 will increase the hit-rate and decrease the miss-rate. Depending on the level of associativity (which has not been disclosed yet, at least not in the basic slide decks), a general rule I have heard is that a double of cache size decreases the miss rate by the sqrt(2), and is liable for a 3-5% IPC uplift in a regular workflow. Thus here’s a conundrum for you: if the L2 has a factor 2 better hit rate, leading to an 8-13% IPC increase, it’s not the same performance as Skylake-S. It may be the same microarchitecture outside the caches, but we get a situation where performance will differ.

Fundamental Realisation: Skylake-S IPC and Skylake-X IPC will be different.

This is something that fundamentally requires in-depth testing. Combine this with the change in the L3 cache, and it is hard to predict the outcome without being a silicon design expert. I am not one of those, but it's something I want to look into as we approach the actual Skylake-X launch.

More things to note on the cache structure. There are many ‘ways’ to do it, one of which I imagined initially is a partitioned cache strategy. The cache layout could be the same as previous generations, but partitions of the L3 were designated L2. This makes life difficult, because then you have a partition of the L2 at the same latency of the L3, and that brings a lot of headaches if the L2 latency has a wide variation. This method would be easy for silicon layout, but hard to implement. Looking at the HCC silicon representation in our slide-deck, it’s clear that there is no fundamental L3 covering all the cores – each core has its partition. That being the case, we now have an L2 at approximately the same size as the L3, at least per core. Given these two points, I fully suspect that Intel is running a physical L2 at 1MB, which will give the design the high hit-rate and consistent low-latency it needs. This will be one feather in the cap for Intel.

Announcement Two: High Core Count Skylake-X Processors Announcement Four: The Other Stuff (AVX-512, Favored Core)
Comments Locked

203 Comments

View All Comments

  • Alexvrb - Tuesday, June 6, 2017 - link

    [In the near future:]
    Oh man, they just released a board with THREE M.2 slots! My old board with only TWO (one populated) is now old and outdated!
  • Iketh - Wednesday, June 7, 2017 - link

    You're all technologically ignorant. JKflipflop is most correct here because even tho what ddriver says is true, the cpu must still be designed and traced to work with an existing pin array instead of creating the cpu with a pin array that is efficient to the new cpu architecture. It's not the motherboard anymore, it's the signaling and power routing inside the cpu that matters most.

    In other words, if JKflip had said "Why would you EVER buy a brand new CPU, then immediately castrate its performance across the board by forcing it to route power and signaling in a way that doesn't jive with it's architecture?" he would have been correct.
  • theuglyman0war - Thursday, June 8, 2017 - link

    Still on x58 with an i7 980x and to be honest I just keep upgrading my gpu's and resent incremental cpu advancement. It is actually the chipset loss that keeps my eyes wandering to ddr4 pci 3.0 lanes and nvme not to mention my horrible sata 3 speeds on my rog III rampage ex which are hard to get around and not feel ghetto despite the pascal ti sli.
    :(
    Them chipset features sure do add up after a while.
  • sharath.naik - Thursday, June 8, 2017 - link

    JKflipflop, Iketh you both are brainwashed. If you are not go head and explain how much more you need to pay for boot raid options with x299?(Or you did not know you will have to pay up to 300$ more to unlock features of x299 motherboard)? if you did not know this, then yes brainwashed is the only word that can be used for you two.
  • LithiumFirefly - Friday, June 9, 2017 - link

    What completely baffles me is why an Intel fanboy would defend buying a new Intel high-end desktop line after the last one, x99. The X99 PCH I bought only had six chips made for it, four of them are bonkers price and the other two are gimped. The Broadwell-e update was a joke the older Haswell chips overclocked way better so they were faster than the newer stuff yeah I'm definitely going to try the new Intel stuff after that. /s
  • melgross - Thursday, June 1, 2017 - link

    You can't just double the core count. Where are they going to put those cores? I assume that the silicon isn't just sitting there waiting for them.
  • mickulty - Saturday, June 3, 2017 - link

    All of AMD's high-end CPUs are based on the same 8-core die, "zeppelin". Ryzen is one zeppelin, threadripper is two connected by infinity fabric on a multi-chip module, naples is four again connected by infinity fabric on a MCM. AMD could very easily put out a chip with more zeppelins, although maintaining socket compatibility would mean losing some i/o capability.

    Interestingly this means Ryzen has 32 PCIe lanes on the chip but only 16 are actually available on AM4. Presumably this is something to do with Bristol Ridge and Raven Ridge AM4 compatibility since they have less lanes.
  • theuglyman0war - Thursday, June 8, 2017 - link

    why not? just make the socket bigger and increase my utility bill ( or at least give me the option to suffer power if I wanna )
    Supposedly processing power is only limited by the size of the universe theoretically. :)
  • theuglyman0war - Thursday, June 8, 2017 - link

    isn't silicon just sand?
  • ddriver - Tuesday, May 30, 2017 - link

    AMD will not and doesn't need to launch anything other than 16 core. Intel is simply playing the core count game, much like it played the Mhz game back in the days of pentium4. More cores must be better.

    But at that core count you are already limited by thermal design. So if you have more cores, they will be clocked lower. So it kind of defeats the purpose.

    More cores would be beneficial for servers, where the chips are clocked significantly lower, around 2.5 Ghz, allowing to hit the best power/performance ratio by running defacto underclocked cores.

    But that won't do much good in a HEDT scenario. And AMD does appear to have a slight IPC/watt advantage. Not to mention offering significantly better value due to better price/performance ratio.

    So even if intel were to launch an 18 core design, that's just a desperate "we got two more cores" that will do little to impress potential customers for that market niche. It will be underclocked and expensive, and even if manages to take tangible lead against a 16 core threadripper, it will not be worth the money.

Log in

Don't have an account? Sign up now