Section by Andrei Frumusanu

Cache Architecture: The Effect of Increasing L2 and L3

Although the Willow Cove core doesn’t bring all that many improvements on the actual core microarchitecture, one big update for the design is the new memory subsystem thanks to a quite significant change in the caches of the design.

Intel here has made some big changes in the L2 caches as well as the L3 cache slices: they’ve both grown considerably bigger and have had their cache line exclusivity altered.

Core Cache Comparison
Willow
Cove
AnandTech Sunny
Cove
Cannon
Lake
Skylake AMD
Zen 2
48 KB
12-way
L1-D 48 KB
12-way
32 KB
8-way
32 KB
8-way
32 KB
8-way
32 KB
8-way
L1-I 32 KB
8-way
32 KB
8-way
32 KB
8-way
32 KB
8-way
1280 KB
20-way
L2 512 KB
8-way
256 KB
4-way
256 KB
4-way
512 KB
8-way
3 MB
(<=12MB)
12-way
L3/core
(Max. Total)
2 MB
(<=8MB)
16-way
2 MB
(<=8MB)
16-way
2 MB
(<=20MB)
16-way
4 MB
(<=16MB)
16-way
2304 uOp Cache 2304 1536 1536 4096

The L1-D and L1-I caches on Willow Cove remain the same as the predecessor Sunny Cove design, which means they retain their 48KB 12-way associative designs for the data cache, respectively 32KB 8-way associative design for the instruction cache.

Where things differ significantly is in the L2. This time around Intel has completely redesigned this part of the core and has increased the capacity by 150% by increasing it from 512KB to 1280KB. Furthermore, the actual usable capacity has increased even more between generations as the new design now moves from being inclusive of the L1 caches, to a non-inclusive design.

Compromises that had been made when increasing the cache by this great of an amount is in the associativity, which now increases from 8-way to a 20-way, which likely decreases conflict misses for the structure.

On the L3 side, there’s also been a change in the microarchitecture as the cache slice size per core now increases from 2MB to 3MB, totalling to 12MB for a 4-core Tiger Lake design. Here Intel actually reduced the associativity from 16-way to 12-way, likely increasing cache line conflict misses and decreasing access parallelism.

When looking at the i7-1185G7 in our custom latency test tool, we immediately note the cache structure changes when comparing the results to a previous generation design such as the Ice Lake based i7-1065G7.

First thing to note here about the results is the frequency of the cores as well as the system’s DRAM configurations: The Tiger Lake part clocked up to 4800MHz and featured LPDDR4X-4266 with 36-39-39 timings, while the Ice Lake figures were measured on a Surface Laptop 3, clocking at 3900MHz and LPDDR4X-3733 32-34-34.

On the L1 side of things as expected we don’t see much changes in latency beyond the clock frequency increase which brings access times down from 1.3ns to 1.04ns.

Moving onto the L2 cache is where things become interesting. Absolute access time figures go down from 3.3 to 2.9ns, but the Willow Cove core now extends this access time across a deeper depth up to 1.25MB – exactly as we’d expect given the cache’s larger structure this generation.

The access latencies don’t extend exactly to 12MB because starting from 8MB we’re exceeding the coverage of the L2 TLB at which point the core has to page-walk, incurring heavier latency penalties.

Intel hasn’t changed the TLBs this generation, still maintaining a 64-page L1 TLB which means that starting from 256KB depth (at 4KB pages), we’re seeing an increase in access times for access patterns which miss the first level TLB.

On the L3 we’re getting some interesting results which are both positive and negative. The positive thing of course is the vastly increased depth of the cache which now sees extended good access latencies up around the 10-12MB mark. What’s seemingly not so great is the fact that the absolute latency figures here aren’t really any different to Ice Lake, ending up nearly identical even though the Tiger Lake design clocks up to 23% higher in frequency. This is a sign that the cycle-access latencies of the design have gone up quite a bit this generation.

On deeper depths reaching DRAM, things are massively improved for the new Tiger Lake design: Full random access at an equal 160MB depth here in the graphs improve from 130ns to 98ns. Admittedly, we’re using different DRAM configurations between the two test platforms and the Tiger Lake system is using 14% higher clocked memory, but it does have worse timings. The actual latency improvements are well beyond the theoretical DRAM access latency difference, so what I think is happening here is that Intel has made some improvements to their memory subsystem and memory controllers.

We’re seeing a slight change in the access pattern latencies compared to Ice Lake, especially in the “R per R page” pattern which remains within a single memory page before moving onto the next, with the access latencies being 30% better than on Ice Lake. This does point out to some actual structural changes on the memory controller side, as otherwise the prefetcher behaviour at least doesn’t see any changes at all- with things being pretty much similar to back to what we’ve seen on Skylake.

What’s also interesting for the new design is that straightforward linear streaming patterns have seen a slight degradation, increasing from 3.516ns to 4.277ns on the new core. This is likely a side-effect of the added cache cycles in the lower level caches of the new Willow Cove core.

Translating the latency graph from nanoseconds to core cycles, we’re seeing the generational structural changes between the Sunny Cove and Willow Cove designs. 

Core Cache Latency (in core cycles)
Willow Cove AnandTech Sunny Cove Cannon
Lake
Skylake   AMD
Zen 2
5 L1 5 4 4   4
14 L2 13 12 ~12   12
39-45 L3 30-36   26-37   34

The L1D cache remains the same at 5 cycles latency, which is still a 1-cycle degradation over Skylake cores.

The L2 seemingly has gone up from 13 cycles to 14 cycles in Willow Cove, which isn’t all that bad considering it is now 2.5x larger, and its associativity has gone up. It’s interesting to contrast this against other similarly sized caches in the industry: Arm’s Neoverse N1 core has a 1MB cache coming in at 11-cycle latency, whilst their new X1 core shaves this down to 10 cycles. Of course, Intel’s designs clocks much higher, but the competitor’s design still would end up with better absolute access times.

The L3 cache cycle latency is a bit disappointing as we’re seeing essentially a +9 cycle degradation over the older design. This explains the previous access latencies which essentially just remained the same even though the core clocks in 23% higher.

Finally, having a quick glance at the single-core bandwidth figures we’re looking if there’s been any significant structural changes in this aspect of the design.

On the L1 side of things, things are a bit odd as the figures don’t scale up as expected with the clock frequency, pure load and store bandwidth are indeed higher but the memory copy patterns are less than expected. In the L2 and L3 regions we can clearly see the increased depth of the caches. The L2 scales well with a near 19% increase in bandwidth which is in line with the clock uptick.

The L3 doesn’t scale that well as memory copies between cache lines here are only 5% faster than on Ice Lake, likely due to the increased access latencies of the caches.

In the DRAM region we’re actually seeing a large change in behaviour of the new microarchitecture, with vastly improved load bandwidth from a single core, increasing from 14.8GB/S to 21GB/s. Pure store bandwidth slightly goes down from 14.8GB/s to 13.5GB/s but that’s not quite important as a metric for x86 as the core first has to read out the memory before writing to it, as opposed to some of the non-temporal write optimisations we’ve seen from Arm processors.

More importantly, memory copies between cache lines and memory read-writes within a cache line have respectively improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s. That’s a 35% improvement in copy bandwidth which is quite significant.

Overall, the new Willow Cove cores and the Tiger Lake memory subsystem seems sort of a mixed bag. The increased cache sizes are certainly welcome for workloads that have a larger memory-footprint; however, Intel’s L3 cache changes seem to have come with some larger compromises when it comes to latency. On the positive side, DRAM access latencies and bandwidth seem to have been drastically improved in the new design, and here it seems Intel made some good improvements in the fabric as well as the memory controllers of Tiger Lake.

New Instructions and Updated Security Power Consumption: Intel’s TDP Shenanigans Hurts Everyone
Comments Locked

253 Comments

View All Comments

  • blppt - Saturday, September 26, 2020 - link

    Sure, the box sitting right next to my desk doesn't exist. Nor the 10 or so AMD cards I've bought over the past 20 years.

    1 5970
    2 7970s (for CFX)
    1 Sapphire 290x (BF4 edition, ridiculously loud under load)
    2 XFX 290 (much better cooler than the BF4 290x) mistakenly bought when I thought it would accept a flash to 290x, got the wrong builds, for CFX)
    2 290x 8gb sapphire custom edition (for CFX, much, much quieter than the 290x)
    1 Vega 64 watercooled (actually turned out to be useful for a Hackintosh build)
    1 5700xt stock edition

    Yeah, i just made this stuff up off the top of my head. I guarantee I've had more experience with AMD videocards than the average gamer. Remember the separate CFX CAP profiles? I sure do.

    So please, tell me again how I'm only a Nvidia owner.
  • Santoval - Sunday, September 20, 2020 - link

    If the top-end Big Navi is going to be 30-40% faster than the 2080 Ti then the 3080 (and later on the 3080 Ti, which will fit between the 3080 and the 3090) will be *way* beyond it in performance, in a continuation of the status quo of the last several graphics card generations. In fact it will be even worse this generation, since Big Navi needs to be 52% faster than the 2080 Ti to even match the 3070 in FP32 performance.

    Sure, it might have double the memory of the 3070, but how much will that matter if it's going to be 15 - 20% slower than a supposed "lower grade" Nvidia card? In other words "30-40% faster than the 2080 Ti" is not enough to compete with Ampere.

    By the way, we have no idea how well Big Navi and the rest of the RDNA2 cards will perform in ray-tracing, but I am not sure how that matters to most people. *If* the top-end Big Navi has 16 GB of RAM, it costs just as much as the 3070 and is slightly (up to 5-10%) slower than it in FP32 performance but handily outperforms it in ray-tracing performance then it might be an attractive buy. But I doubt any margins will be left for AMD if they sell a 16 GB card for $500.

    If it is 15-20% slower and costs $100 more noone but those who absolutely want 16 GB of graphics RAM will buy it; and if the top-end card only has 12 GB of RAM there goes the large memory incentive as well..
  • Spunjji - Sunday, September 20, 2020 - link

    @Santoval, why are you speaking as if the 3080's performance characteristics are not already known? We have the benchmarks in now.

    More importantly, why are you making the assumption that AMD need to beat Nvidia's theoretical FP32 performance when it was always obvious (and now extremely clear) that it has very little bearing on the product's actual performance in games?

    The rest of your speculation is knocked out of what by that. The likelihood of an 80CU RDNA 2 card underperforming the 3070 is nil. The likelihood of it underperforming the 3080 (which performs like twice a 5700, non-XT) is also low.
  • Byte - Monday, September 21, 2020 - link

    Nvidia probably has a good idea how it performs with access to PS5/Xbox, they know they had to be aggressive this round with clock speeds and pricing. As we can see 3080 is almost maxed, o/c headroom like that of AMD chips, and price is reasonable decent, in line with 1080 launch prices before minepocalypse.
  • TimSyd - Saturday, September 19, 2020 - link

    Ahh don't ya just love the fresh smell of TROLL
  • evernessince - Sunday, September 20, 2020 - link

    The 5700XT is RDNA1 and it's 1/3rd the size of the 2080 Ti. 1/3rd the size and only 30% less performance. Now imagine a GPU twice the size of the 5700XT, thus having twice the performance. Now add in the node shrink and new architecture.

    I wouldn't be surprised if the 6700XT beat the 2080 Ti, let alone AMD's bigger Navi 2 GPUs.
  • Cooe - Friday, December 25, 2020 - link

    Hahahaha. "Only matching a 2080 Ti". How's it feel to be an idiot?
  • tipoo - Friday, September 18, 2020 - link

    I'd again ask you why a laptop SoC would have an answer for a big GPU. That's not what this product is.
  • dotjaz - Friday, September 18, 2020 - link

    "This Intel Tiger" doesn't need an answer for Big Navi, no laptop chip needs one at all. Big Navi is 300W+, no way it's going in a laptop.

    RDNA2+ will trickle down to mobile APU eventually, but we don't know if Van Gogh can beat TGL yet, I'm betting not because it's likely a 7-15W part with weaker Quadcore Zen2.

    Proper RDNA2+ APU won't be out until 2022/Zen4. By then Intel will have the next gen Xe.
  • Santoval - Sunday, September 20, 2020 - link

    Intel's next gen Xe (in Alder Lake) is going to be a minor upgrade to the original Xe. Not a redesign, just an optimization to target higher clocks. The optimization will largely (or only) happen at the node level, since it will be fabbed with second gen SuperFin (formerly 10nm+++), which is supposed to be (assuming no further 7nm delays) Intel's last 10nm node variant.
    How well will that work, and thus how well 2nd gen Xe will perform, will depend on how high Intel's 2nd gen SuperFin will clock. At best 150 - 200 MHz higher clocks can probably be expected.

Log in

Don't have an account? Sign up now