CPU Tests: Microbenchmarks

Core-to-Core Latency

As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.

But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.

If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.

Broadwell is a familiar design, with all four cores connected in a ring-bus topology.

Cache-to-DRAM Latency

This is another in-house test built by Andrei, which showcases the access latency at all the points in the cache hierarchy for a single core. We start at 2 KiB, and probe the latency all the way through to 256 MB, which for most CPUs sits inside the DRAM (before you start saying 64-core TR has 256 MB of L3, it’s only 16 MB per core, so at 20 MB you are in DRAM).

Part of this test helps us understand the range of latencies for accessing a given level of cache, but also the transition between the cache levels gives insight into how different parts of the cache microarchitecture work, such as TLBs. As CPU microarchitects look at interesting and novel ways to design caches upon caches inside caches, this basic test proves to be very valuable.

Our data shows a 4-cycle L1, a 12-cycle L2, a 26-50 cycle L3, while the eDRAM has a wide range from 50-150 cycles. This is still quicker than main memory, which goes to 200+ cycles.

Frequency Ramping

Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.

Intel’s technology is called SpeedShift, although SpeedShift was not enabled until Skylake.

One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.

We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.

We got around the issue by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies. Our Frequency Ramp tool has already been in use in a number of reviews.

From an idle frequency of 800 MHz, It takes ~32 ms for Intel to boost to 2.0 GHz, then another ~32 ms to get to 3.7 GHz. We’re essentially looking at 4 frames at 60 Hz to hit those high frequencies.

A y-Cruncher Sprint

The y-cruncher website has a large about of benchmark data showing how different CPUs perform to calculate specific values of pi. Below these there are a few CPUs where it shows the time to compute moving from 25 million digits to 50 million, 100 million, 250 million, and all the way up to 10 billion, to showcase how the performance scales with digits (assuming everything is in memory). This range of results, from 25 million to 250 billion, is something I’ve dubbed a ‘sprint’.

I have written some code in order to perform a sprint on every CPU we test. It detects the DRAM, works out the biggest value that can be calculated with that amount of memory, and works up from 25 million digits. For the tests that go up to the ~25 billion digits, it only adds an extra 15 minutes to the suite for an 8-core Ryzen CPU.

With this test, we can see the effect of increasing memory requirements on the workload and the scaling factor for a workload such as this.

  • MT   25m: 1.617s
  • MT   50m: 3.639s
  • MT  100m: 8.156s
  • MT  250m: 24.050s
  • MT  500m: 53.525s
  • MT 1000m: 118.651s
  • MT 2500m: 341.330s

The scaling here isn’t linear – moving from 25m to 2.5b, we should see a 100x time increase, but instead it is 211x.

CPU Tests: SPEC Gaming Tests: Chernobylite
Comments Locked

120 Comments

View All Comments

  • krowes - Monday, November 2, 2020 - link

    CL22 memory for the Ryzen setup? Makes absolutely no sense.
  • Ian Cutress - Tuesday, November 3, 2020 - link

    That's JEDEC standard.
  • Khenglish - Monday, November 2, 2020 - link

    Was anyone else bothered by the fact that Intel's highest performing single thread CPU is the 1185G7, which is only accessible in 28W tiny BGA laptops?

    Also the 128mb edram cache does seem to make on average a 10% improvement over the edramless 4790S at the same TDP. I would love to see edram on more cpus. It's so rare to need more than 8 cores. I'd rather have 8 cores with edram than 16+ cores and no edram.
  • ichaya - Monday, November 2, 2020 - link

    There's definitely a cost trade-off involved, but with an I/O die since Zen 2, it seems like AMD could just spin up a different I/O die, and justify the cost easily by selling to HEDT/Workstation/DC.
  • Notmyusualid - Wednesday, November 4, 2020 - link

    Chalk me up as 'bothered'.
  • zodiacfml - Monday, November 2, 2020 - link

    Yeah but Intel is about squeezing the last dollar in its products for a couple of years now.
  • Endymio - Monday, November 2, 2020 - link

    CPU register-> 3 levels of cache -> eDRAM -> DRAM -> Optane -> SSD -> Hard Drive.

    The human brain gets by with 2 levels of storage. I really don't feel that computers should require 9. The entire approach needs rethinking.
  • Tomatotech - Tuesday, November 3, 2020 - link

    You remember everything without writing down anything? You remarkable person.

    The rest of us rely on written materials, textbooks, reference libraries, wikipedia, and the internet to remember stuff. If you jot down all the levels of hierarchical storage available to the average degree-educated person, it's probably somewhere around 9 too depending on how you count it.

    Not everything you need to find out is on the internet or in books either. Data storage and retrieval also includes things like having to ask your brother for Aunt Jenny's number so you can ring Aunt Jenny and ask her some detail about early family life, and of course Aunt Jenny will tell you to go and ring Uncle Jonny, but she doesn't have Jonny's number, wait a moment while she asks Max for it and so on.
  • eastcoast_pete - Tuesday, November 3, 2020 - link

    You realize that the closer the cache is to actual processor speed, the more demanding the manufacturing gets and the more die area it eats. That's why there aren't any (consumer) CPUs with 1 or more MB of L1 Cache. Also, as Tomatotech wrote, we humans use mnemonic assists all the time, so the analogy short-term/long-term memory is incomplete. Writing and even drawing was invented to allow for longer-term storage and easier distribution of information. Lastly, at least IMO, it boils down to cost vs. benefit/performance as to how many levels of memory storage are best, and depends on the usage scenario.
  • Oxford Guy - Monday, November 2, 2020 - link

    Peter Bright of Ars in 2015:

    "Intel’s Skylake lineup is robbing us of the performance king we deserve. The one Skylake processor I want is the one that Intel isn't selling.

    in games the performance was remarkable. The 65W 3.3-3.7GHz i7-5775C beat the 91W 4-4.2GHz Skylake i7-6700K. The Skylake processor has a higher clock speed, it has a higher power budget, and its improved core means that it executes more instructions per cycle, but that enormous L4 cache meant that the Broadwell could offset its disadvantages and then some. In CPU-bound games such as Project Cars and Civilization: Beyond Earth, the older chip managed to pull ahead of its newer successor.

    in memory-intensive workloads, such as some games and scientific applications, the cache is better than 21 percent more clock speed and 40 percent more power. That's the kind of gain that doesn't come along very often in our dismal post-Moore's law world.

    Those 5775C results tantalized us with the prospect of a comparable Skylake part. Pair that ginormous cache with Intel's latest-and-greatest core and raise the speed limit on the clock speed by giving it a 90-odd W power envelope, and one can't help but imagine that the result would be a fine processor for gaming and workstations alike. But imagine is all we can do because Intel isn't releasing such a chip. There won't be socketed, desktop-oriented eDRAM parts because, well, who knows why.

    Intel could have had a Skylake processor that was exciting to gamers and anyone else with performance-critical workloads. For the right task, that extra memory can do the work of a 20 percent overclock, without running anything out of spec. It would have been the must-have part for enthusiasts everywhere. And I'm tremendously disappointed that the company isn't going to make it."

    In addition to Bright's comments I remember Anandtech's article that showed the 5675C beating or equalling the 5775C in one or more gaming tests, apparently largely due to the throttling due to Intel's decision to hobble Broadwell with such a low TDP.

Log in

Don't have an account? Sign up now