CPU Tests: Microbenchmarks

Core-to-Core Latency

As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.

But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.

If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.

The core-to-core numbers are interesting, being worse (higher) than the previous generation across the board. Here we are seeing, mostly, 28-30 nanoseconds, compared to 18-24 nanoseconds with the 10700K. This is part of the L3 latency regression, as shown in our next tests.

One pair of threads here are very fast to access all cores, some 5 ns faster than any others, which again makes the layout more puzzling. 

Update 1: With microcode 0x34, we saw no update to the core-to-core latencies.

Cache-to-DRAM Latency

This is another in-house test built by Andrei, which showcases the access latency at all the points in the cache hierarchy for a single core. We start at 2 KiB, and probe the latency all the way through to 256 MB, which for most CPUs sits inside the DRAM (before you start saying 64-core TR has 256 MB of L3, it’s only 16 MB per core, so at 20 MB you are in DRAM).

Part of this test helps us understand the range of latencies for accessing a given level of cache, but also the transition between the cache levels gives insight into how different parts of the cache microarchitecture work, such as TLBs. As CPU microarchitects look at interesting and novel ways to design caches upon caches inside caches, this basic test proves to be very valuable.

Looking at the rough graph of the 11700K and the general boundaries of the cache hierarchies, we again see the changes of the microarchitecture that had first debuted in Intel’s Sunny Cove cores, such as the move from an L1D cache from 32KB to 48KB, as well as the doubling of the L2 cache from 256KB to 512KB.

The L3 cache on these parts look to be unchanged from a capacity perspective, featuring the same 16MB which is shared amongst the 8 cores of the chip.

On the DRAM side of things, we’re not seeing much change, albeit there is a small 2.1ns generational regression at the full random 128MB measurement point. We’re using identical RAM sticks at the same timings between the measurements here.

It’s to be noted that these slight regressions are also found across the cache hierarchies, with the new CPU, although it’s clocked slightly higher here, shows worse absolute latency than its predecessor, it’s also to be noted that AMD’s newest Zen3 based designs showcase also lower latency across the board.

With the new graph of the Core i7-11700K with microcode 0x34, the same cache structures are observed, however we are seeing better performance with L3.

The L1 cache structure is the same, and the L2 is of a similar latency. In our previous test, the L3 latency was 50.9 cycles, but with the new microcode is now at 45.1 cycles, and is now more in line with the L3 cache on Comet Lake.

Out at DRAM, our 128 MB point reduced from 82.4 nanoseconds to 72.8 nanoseconds, which is a 12% reduction, but not the +40% reduction that other media outlets are reporting as we feel our tools are more accurate. Similarly, for DRAM bandwidth, we are seeing a +12% memory bandwidth increase between 0x2C and 0x34, not the +50% bandwidth others are claiming. (BIOS 0x1B however, was significantly lower than this, resulting in a +50% bandwidth increase from 0x1B to 0x34.)

In the previous edition of our article, we questioned the previous L3 cycle being a larger than estimated regression. With the updated microcode, the smaller difference is still a regression, but more in line with our expectations. We are waiting to hear back from Intel what differences in the microcode encouraged this change.

Frequency Ramping

Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.

Intel’s technology is called SpeedShift, although SpeedShift was not enabled until Skylake.

One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.

We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.

We got around the issue by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies. Our Frequency Ramp tool has already been in use in a number of reviews.

Our ramp test shows a jump straight from 800 MHz up to 4900 MHz in around 17 milliseconds, or a frame at 60 Hz. 

Power Consumption: Hot Hot HOT CPU Tests: Office and Science
Comments Locked

541 Comments

View All Comments

  • zzzxtreme - Sunday, March 7, 2021 - link

    I wished you would have tested the XE graphics
  • Fman4 - Monday, March 8, 2021 - link

    Am I the only one find that OP plugged 4 RAMs on an X570 ITX motherboard?
  • Fman4 - Monday, March 8, 2021 - link

    @Dr. Ian Cutress
  • zodiacfml - Monday, March 8, 2021 - link

    bored. just here to say this is unsurprising though this strongly reminds me of the time where AMD is releasing new, well designed CPUs but two process node generations behind intel. I think AMD was 32nm and 28nm while Intel is 22 and 14nm. most comments were really harsh with AMD but I reasoned that it is simply due to the manufacturing superiority of Intel
  • blppt - Monday, March 8, 2021 - link

    Bulldozer and Piledriver are not the examples I would put up for "well designed".
  • GeoffreyA - Tuesday, March 9, 2021 - link

    Still, within that mess, AMD did a pretty good job raising Bulldozer's IPC and cutting down its power each generation. But the foundation being fatally flawed, it was hopeless. I believe it taught them a lot about cutting power and so on, and when they poured that into Zen, we saw the result. Bulldozer was a fantastic training ground, if one looks at it humorously.
  • Oxford Guy - Tuesday, March 9, 2021 - link

    No, AMD did an extremely poor job.

    Firstly, Bulldozer had worse IPC than Phenom. No engineers with brains release a CPU to replace the entire line while giving it worse IPC. The trap of going for high clocks was a lesson shown to the entire industry via Netburst. AMD's engineers knew all about it, yet someone at the company decided to try Netburst 2.0.

    Secondly, AMD was so sloppy and lazy that Piledriver shipped with a performance regression in AVX. It was worse to use AVX than to not use it. How incredibly incompetent can the company have been? It doesn't take a high IQ to understand that one doesn't ship broken AVX.

    AMD then refused to replace Piledriver until Zen came out. It tinkered half-heartedly with APU rubbish and focused on pushing junk like Jaguar.

    While it's true that the extreme failure of AMD (the construction core line) is due, to a large degree, to Intel abusing its monopoly to starve AMD of customers and cash — cash it needed to do R&D, one does not release a new chip with worse IPC and then very shortly after break AVX and refuse to stop feeding that junk to customers for many years. Just tinkering with Phenom would have been better (Phenom 3).

    As for the foundation claim... we have no idea how well the CMT concept could have worked out with competent engineering. Remember, they literally broke AVX in the Piledriver revision that was supposed to fix Bulldozer enough to make it sellable. Operations caching could have been stronger. The L3 cache was almost as slow as main memory. The RAM controller was weak, just like Phenom's. Etc.

    We paid for Intel's monopoly and we're still paying today. Only its monopoly and the lack of adequate competition is enabling the company to be so profitable despite failing so badly. Relying on two companies (or one 1/2, when it comes to R&D money ratio and other factors) to deliver adequate competition doesn't work.

    Google and Microsoft = Google owns the clearnet. Apparently, they have some sort of cooperation agreement which helps to explain why Bing has such a tiny index and such a poor-quality search.

    TSMC and Samsung = Can't meet demand.

    AMD and Nvidia = Nvidia keeps breaking profit records while utterly failing to meet demand. Both companies refuse to stop making their cards attractive for mining and have for a long long time. AMD refused to adequately compete beyond the lower midrange (Polaris forever, or you can buy a 'console'!) for a long time, leaving us to pay through the nose for Nvidia's prices. AMD literally competes against the PC market by pushing the console scam. Consoles are gaming PCs in disguise and they're parasitic in multiple ways, including in terms of wafer allocations. AMD's many many years of refusal to compete with Nvidia beyond the Polaris price point caused so much pent-up demand and now the company can enjoy the artificially high price points from that. It let Nvidia keep raising prices to get consumers used to that. Now that it has finally been forced to improve the 'consoles' beyond the garbage-tier Jaguar CPU it has to offer a bit more value to the PC gaming market. And so, after all these years, we have something decent that one can't buy. I can go on about this so-called competition but why bother. People will go to the most extravagant lengths to excuse the problem of lack of adequate competition — like the person who recently said it's easier to create Google's empire from scratch than it is to make a competitive GPU and sell it as a third GPU company.

    There are plenty of other areas in tech with inadequate competition, too.
  • blppt - Tuesday, March 9, 2021 - link

    "AMD then refused to replace Piledriver until Zen came out. It tinkered half-heartedly with APU rubbish and focused on pushing junk like Jaguar."

    To be fair, AMD had put a LOT of time, money and effort into Bulldozer/Piledriver, and were never a company with bottomless wells of cash to toss an architecture out immediately. Plus, Zen took a long time to design and finalize---thankfully, they made literally ALL the right moves in designing it, including hiring the brilliant Jim Keller.

    I think if Zen had been another BD like failure, that would have been the almost the end of AMD in the cpu market (leaving them basically as ATI was) The consoles likely would have gone with Intel or ARM for their next iteration. AMD once again spent tons of money that they don't have as disposable income in designing Zen. Two failures in a row would have been disastrous.

    Heck, the consoles might go with their own custom ARM design for PS6/Xbox(whatever) anyways.
  • GeoffreyA - Wednesday, March 10, 2021 - link

    blppt. Agreed, that would have been the end of AMD.
  • Oxford Guy - Wednesday, March 10, 2021 - link

    AMD did not put a lot of resources into fixing Bulldozer.

    It shipped Piledriver with broken AVX and never bothered to replace Piledriver on the desktop until Zen.

    Inexcusable. It shipped Steamroller and Excavator in cost-cut mode, cutting cores, cutting clocks, cutting the socket standards, and cutting cache. It used a dense library to save money by keeping the die small and used the inferior 28nm bulk process.

    Pathetic in basically every respect.

Log in

Don't have an account? Sign up now