CPU Tests: Microbenchmarks

Core-to-Core Latency

As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.

But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.

If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.

The core-to-core numbers are interesting, being worse (higher) than the previous generation across the board. Here we are seeing, mostly, 28-30 nanoseconds, compared to 18-24 nanoseconds with the 10700K. This is part of the L3 latency regression, as shown in our next tests.

One pair of threads here are very fast to access all cores, some 5 ns faster than any others, which again makes the layout more puzzling. 

Update 1: With microcode 0x34, we saw no update to the core-to-core latencies.

Cache-to-DRAM Latency

This is another in-house test built by Andrei, which showcases the access latency at all the points in the cache hierarchy for a single core. We start at 2 KiB, and probe the latency all the way through to 256 MB, which for most CPUs sits inside the DRAM (before you start saying 64-core TR has 256 MB of L3, it’s only 16 MB per core, so at 20 MB you are in DRAM).

Part of this test helps us understand the range of latencies for accessing a given level of cache, but also the transition between the cache levels gives insight into how different parts of the cache microarchitecture work, such as TLBs. As CPU microarchitects look at interesting and novel ways to design caches upon caches inside caches, this basic test proves to be very valuable.

Looking at the rough graph of the 11700K and the general boundaries of the cache hierarchies, we again see the changes of the microarchitecture that had first debuted in Intel’s Sunny Cove cores, such as the move from an L1D cache from 32KB to 48KB, as well as the doubling of the L2 cache from 256KB to 512KB.

The L3 cache on these parts look to be unchanged from a capacity perspective, featuring the same 16MB which is shared amongst the 8 cores of the chip.

On the DRAM side of things, we’re not seeing much change, albeit there is a small 2.1ns generational regression at the full random 128MB measurement point. We’re using identical RAM sticks at the same timings between the measurements here.

It’s to be noted that these slight regressions are also found across the cache hierarchies, with the new CPU, although it’s clocked slightly higher here, shows worse absolute latency than its predecessor, it’s also to be noted that AMD’s newest Zen3 based designs showcase also lower latency across the board.

With the new graph of the Core i7-11700K with microcode 0x34, the same cache structures are observed, however we are seeing better performance with L3.

The L1 cache structure is the same, and the L2 is of a similar latency. In our previous test, the L3 latency was 50.9 cycles, but with the new microcode is now at 45.1 cycles, and is now more in line with the L3 cache on Comet Lake.

Out at DRAM, our 128 MB point reduced from 82.4 nanoseconds to 72.8 nanoseconds, which is a 12% reduction, but not the +40% reduction that other media outlets are reporting as we feel our tools are more accurate. Similarly, for DRAM bandwidth, we are seeing a +12% memory bandwidth increase between 0x2C and 0x34, not the +50% bandwidth others are claiming. (BIOS 0x1B however, was significantly lower than this, resulting in a +50% bandwidth increase from 0x1B to 0x34.)

In the previous edition of our article, we questioned the previous L3 cycle being a larger than estimated regression. With the updated microcode, the smaller difference is still a regression, but more in line with our expectations. We are waiting to hear back from Intel what differences in the microcode encouraged this change.

Frequency Ramping

Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.

Intel’s technology is called SpeedShift, although SpeedShift was not enabled until Skylake.

One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.

We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.

We got around the issue by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies. Our Frequency Ramp tool has already been in use in a number of reviews.

Our ramp test shows a jump straight from 800 MHz up to 4900 MHz in around 17 milliseconds, or a frame at 60 Hz. 

Power Consumption: Hot Hot HOT CPU Tests: Office and Science
Comments Locked

541 Comments

View All Comments

  • TheinsanegamerN - Friday, March 5, 2021 - link

    That 14nm chip pulls over twice the power of the 7nm 16 core chip and is consistently slower then 7nm 8 core chip.

    It's not so much "right behind them" but rather "barely keeping up while burning through a nuclear reactor's power output".
  • CiccioB - Friday, March 5, 2021 - link

    Twice the power for being slower?
    If you are referring to the 290W power consumption with AVX-512 test, I'm desolate to inform you that a 7nm CPU with twice the core would not reach those performances and perf/W in that test.

    If you are talking about the 140-160W usage at other "normal" tests, I'm desolate to inform you that a 16 core 7nm CPU does not consumes 80W.

    So stop vomiting meaningless numbers. This is a 14nm CPU and for the process it is build on it is doing miracles. If Intel could ever use an advanced PP like those 7nm by TSMC Zen would still be the underdog.

    For the future just hope the TMSC 5nm are good, early, low cost and really high yielding, because if Intel comes with a decent 7nm I think AMD will not look all that advanced (6 years to surpass a 6 years old architecture and all by the use of a more advanced PP that unfortunately doesn't allow for great deliveries).
  • blppt - Friday, March 5, 2021 - link

    "So stop vomiting meaningless numbers. This is a 14nm CPU and for the process it is build on it is doing miracles. If Intel could ever use an advanced PP like those 7nm by TSMC Zen would still be the underdog."

    Not necessarily. Intel apparently is still behind in IPC/single thread performance, as evidenced by that Cinebench results, so whilst 7nm would let it run with less power (theoretically), it still would lose to its main competitor, the 5800X.
  • CiccioB - Friday, March 5, 2021 - link

    You have missed that with 14nm die area you cannot improve the architecture that much.
    You are still thinking that CPU designed for 7nm, with all the advantages that they would bring, would still be like Skylake which is a 6 years old architecture
    A PP like TSMC 7nm would bring a completely new architecture that would blow Zen away.

    Zen is good because it is based on such a better PP that those Intel has now, but it still struggles at beating Skylake. And to do that it, that is by using such an advanced but production limited PP, it has sacrificed high stock delivery right in the period where demand is much higher than supply.
    Intel can fill the remaining market with whatever it has, being it 9xxx, 10xx or now 11xx generations.
  • barich - Friday, March 5, 2021 - link

    Yes, Intel probably would beat AMD with an imaginary all-new architecture on TSMC's 7nm process. Similarly, I would be a hell of a basketball player if I were a foot taller and had any motor skills.

    Here in reality, Intel has worse performance and worse efficiency. As a consumer, that's what matters to me. What Intel could do with a bunch of "ifs" is irrelevant. I haven't owned an AMD CPU since my Athlon 64 was replaced by a Core 2 Duo. But there's no way my new build this year isn't going to be AMD.
  • CiccioB - Saturday, March 6, 2021 - link

    Yes, what counts is the results, you are right.
    But by that I can't cry for a miracle when I see Zen 3 results as with a much more advanced PP it just can win over Skylake for a few % and all the real advantages it has is smaller power consumption due to the much better PP vs this one 6 years old.

    If you look at the real power consumption, that is not the one with AVX-512 tests where RKL disintegrates Zen for perf/W despite the high power requirements, you'll see that this chip is not that power hungry (though being more power hungry than Zen) and that does make me think that with a better PP this same architecture would be another thing completely, as are the 10nm Tiger Lake which however suffer the not so good power consumption at higher frequency required by desktop SKUs.

    As we are not that distant from finally having something decent that is not the 14nm PP, I will really not call my thought "imaginary". AMD will not be able to pass to 5nm so soon, and seen what this architecture can do, despote the 14nm PP, I think that the future is going to be more interesting that what you hope it ti be (that is, AMD keeps on figure it has better CPUs while not having them in the shelves but what counts for you are.. yes the results... and for these Intel is outselling AMD 5:1).
  • blppt - Saturday, March 6, 2021 - link

    "A PP like TSMC 7nm would bring a completely new architecture that would blow Zen away."

    Based on what, exactly? We've seen die shrinks before without amazing architectural advances from both Intel and AMD. You have an awful lot of confidence in something that doesn't exist.
  • CiccioB - Saturday, March 6, 2021 - link

    Based on the fact that with a 7nm PP AMD still struggles at beating Skylake architecture which is 6 years old and was born on... oh yes, 14nm.
  • schujj07 - Saturday, March 6, 2021 - link

    Zen2 was already faster than Skylake and its derivatives clock for clock by about 7%. While Comet Lake had higher single threaded performance than Zen2, it did so by throwing efficiency and power draw out the window and going for absolute performance. That made it such that Comet Lake could compete in ST applications but it still lost on MT applications against the same thread counted AMD CPUs. Going for absolute performance has been a double edged sword for Intel as the newer architectures hadn't been able to clock as high. Despite the higher IPCs of the newer architectures, absolute performance was no better than a wash due to 20% lower clock speeds.

    Zen3 now has absolute performance dominance over any Skylake architecture CPU. It doesn't "just" beat the older CPU, as in like 2% faster. It is upwards of 20% faster clock for clock and 10%+ faster in absolute ST performance.
  • blppt - Saturday, March 6, 2021 - link

    "Based on the fact that with a 7nm PP AMD still struggles at beating Skylake architecture which is 6 years old and was born on... oh yes, 14nm."

    Struggles? The slightly older AMD chip (5800X) beats the newest and greatest out of Intel, whilst consuming less power, AND hitting lower peak turbo speeds.

    That is complete domination. You could make the same argument about how Intel hadn't even made any significant gains over Sandy Bridge until Skylake, and that was *2* die shrinks.

Log in

Don't have an account? Sign up now