A Broadwell Retrospective Review in 2020: Is eDRAM Still Worth It?

Name: A Broadwell Retrospective Review in 2020: Is eDRAM Still Worth It?
Item: A Broadwell Retrospective Review in 2020: Is eDRAM Still Worth It?
Author: Dr. Ian Cutress

by Dr. Ian Cutress on November 2, 2020 11:00 AM EST

120 Comments | Add A Comment

120 Comments

CPU Tests: Microbenchmarks

Core-to-Core Latency

As the core count of modern CPUs is growing, we are reaching a time when the time to access each core from a different core is no longer a constant. Even before the advent of heterogeneous SoC designs, processors built on large rings or meshes can have different latencies to access the nearest core compared to the furthest core. This rings true especially in multi-socket server environments.

But modern CPUs, even desktop and consumer CPUs, can have variable access latency to get to another core. For example, in the first generation Threadripper CPUs, we had four chips on the package, each with 8 threads, and each with a different core-to-core latency depending on if it was on-die or off-die. This gets more complex with products like Lakefield, which has two different communication buses depending on which core is talking to which.

If you are a regular reader of AnandTech’s CPU reviews, you will recognize our Core-to-Core latency test. It’s a great way to show exactly how groups of cores are laid out on the silicon. This is a custom in-house test built by Andrei, and we know there are competing tests out there, but we feel ours is the most accurate to how quick an access between two cores can happen.

Broadwell is a familiar design, with all four cores connected in a ring-bus topology.

Cache-to-DRAM Latency

This is another in-house test built by Andrei, which showcases the access latency at all the points in the cache hierarchy for a single core. We start at 2 KiB, and probe the latency all the way through to 256 MB, which for most CPUs sits inside the DRAM (before you start saying 64-core TR has 256 MB of L3, it’s only 16 MB per core, so at 20 MB you are in DRAM).

Part of this test helps us understand the range of latencies for accessing a given level of cache, but also the transition between the cache levels gives insight into how different parts of the cache microarchitecture work, such as TLBs. As CPU microarchitects look at interesting and novel ways to design caches upon caches inside caches, this basic test proves to be very valuable.

Our data shows a 4-cycle L1, a 12-cycle L2, a 26-50 cycle L3, while the eDRAM has a wide range from 50-150 cycles. This is still quicker than main memory, which goes to 200+ cycles.

Frequency Ramping

Both AMD and Intel over the past few years have introduced features to their processors that speed up the time from when a CPU moves from idle into a high powered state. The effect of this means that users can get peak performance quicker, but the biggest knock-on effect for this is with battery life in mobile devices, especially if a system can turbo up quick and turbo down quick, ensuring that it stays in the lowest and most efficient power state for as long as possible.

Intel’s technology is called SpeedShift, although SpeedShift was not enabled until Skylake.

One of the issues though with this technology is that sometimes the adjustments in frequency can be so fast, software cannot detect them. If the frequency is changing on the order of microseconds, but your software is only probing frequency in milliseconds (or seconds), then quick changes will be missed. Not only that, as an observer probing the frequency, you could be affecting the actual turbo performance. When the CPU is changing frequency, it essentially has to pause all compute while it aligns the frequency rate of the whole core.

We wrote an extensive review analysis piece on this, called ‘Reaching for Turbo: Aligning Perception with AMD’s Frequency Metrics’, due to an issue where users were not observing the peak turbo speeds for AMD’s processors.

We got around the issue by making the frequency probing the workload causing the turbo. The software is able to detect frequency adjustments on a microsecond scale, so we can see how well a system can get to those boost frequencies. Our Frequency Ramp tool has already been in use in a number of reviews.

From an idle frequency of 800 MHz, It takes ~32 ms for Intel to boost to 2.0 GHz, then another ~32 ms to get to 3.7 GHz. We’re essentially looking at 4 frames at 60 Hz to hit those high frequencies.

A y-Cruncher Sprint

The y-cruncher website has a large about of benchmark data showing how different CPUs perform to calculate specific values of pi. Below these there are a few CPUs where it shows the time to compute moving from 25 million digits to 50 million, 100 million, 250 million, and all the way up to 10 billion, to showcase how the performance scales with digits (assuming everything is in memory). This range of results, from 25 million to 250 billion, is something I’ve dubbed a ‘sprint’.

I have written some code in order to perform a sprint on every CPU we test. It detects the DRAM, works out the biggest value that can be calculated with that amount of memory, and works up from 25 million digits. For the tests that go up to the ~25 billion digits, it only adds an extra 15 minutes to the suite for an 8-core Ryzen CPU.

With this test, we can see the effect of increasing memory requirements on the workload and the scaling factor for a workload such as this.

MT 25m: 1.617s
MT 50m: 3.639s
MT 100m: 8.156s
MT 250m: 24.050s
MT 500m: 53.525s
MT 1000m: 118.651s
MT 2500m: 341.330s

The scaling here isn’t linear – moving from 25m to 2.5b, we should see a 100x time increase, but instead it is 211x.

CPU Tests: SPEC Gaming Tests: Chernobylite

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

120 Comments

View All Comments

dsplover - Tuesday, November 3, 2020 - link
For Digital s Audio applications the i7-5775C @ 3.3GHz was incredible when disabling the Iris GFX turning the cache over to audio, then running s discrete GFX card.

Bested my i7 4790k’s.
Tried OC’ing but even with the kick but Supermicro H70 it was unstable as the Ring Bus/L4 would also clock up and choked @ 2050MHz.

This rig allowed really tight low latency timings and I prayed they would release future designs with a larger cache.
AMD beat them to to it w/Matisse which was good for 8 core only.

The new 5000s are going to be Digital Audio dreams @ low wattage.

Intel just keeps lagging behind.
ironicom - Tuesday, November 3, 2020 - link
fps is irrelevant in civ; turn time and load time are what matter.
vorsgren - Tuesday, November 3, 2020 - link
Thanks for using my benchmark! Hope it was usefull!
Nictron - Wednesday, November 4, 2020 - link
Which benchmark was that?
erotomania - Wednesday, November 4, 2020 - link
Google the username.
vorsgren - Wednesday, November 4, 2020 - link
http://www.bay12forums.com/smf/index.php?topic=173...
Oxford Guy - Thursday, November 5, 2020 - link
"The Intel skew on this site is getting silly its becoming an Intel promo machine!"

Yes. An article that exposes how much Intel was able to get away with sandbagging because of our tech world's lack of adequate competition (seen in MANY tech areas to the point where it's more the norm than the exception) — clearly such an article is showing Intel in a good light.

If you were an Intel shareholder.

For everyone else (the majority of the readers), the article condemns Intel for intentionally hobbing Skylake's gaming performance. ArsTechnica produced an article about this five years ago when it became clear that Skylake wasn't going to have EDRAM.

The ridiculousness of the situation (how Intel got away with charging premium prices for horribly hobbled parts — $10 worth of EDRAM missing, no less) really shows the world's economic system particularly poorly. For all the alleged capitalism in tech, there certainly isn't much competition. That's why Intel didn't have to ship Skylake with EDRAM. Monopolization (and near-monopoly) enables companies to do what they want to do more than anything else: sell less for more. As long as regulators are toothless and/or incompetent the situation won't improve much.
erikvanvelzen - Saturday, November 7, 2020 - link
Ever since the Pentium 4 Extreme Edition I've wondered why intel does not permanently offer a top product with a large L3 or L4 cache.
abufrejoval - Monday, November 9, 2020 - link
Just picked up a NUC8i7BEH last week (quad i7, 48EU GT3e with 128MB eDRAM), because they dropped below €300 including VAT: A pretty incredible value at that price point and extremely compatible with just about any software you can throw at it.

Yes, Tiger Lake NUC11 would be better on paper and I have tried getting a Ryzen 7-4800U (as PN50-BBR748MD), but I've never heard of one actually shipped.

It's my second NUC8i7BEH, I had gotten another a month or two previously, while it was still at €450, but decided to swap that against a hexa-core NUC10i7FNH (24EU no eDRAM) at the same price, before the 14-days zero-cost return period was up. GT3e+quad-core vs. GT2+hexa-core was a tough call to make, but acutally both run really mostly server loads anyway. But at €300/quad vs €450/hexa the GT3e is quite simply for free, when the silicon die area for the GT3e/quad is in all likelyhood much greater than for the GT2/hexa, even without counting the eDRAM.

My Whiskey-lake has 200MHz less top clock than the Comet-lake, but that doesn't show in single core results, where the L4 seems to put Whiskey consistently into a small lead.

GT3e doesn't quite manage to double graphics performance over GT2, but I am not planning to use either for gaming. Both do fairly well at 4k on anything 2D, even Google Map's 3D renders do pretty well.

BTW: While Google Earth Pro's Flight simulator actually gives a fairly accurate representation of the area where I live, it doesn't do great on FPS, even with an Nvidia GPU. By contrast Microsoft latest and greatest is a huge disappointment when it comes to terrain accuracy (buildings are pure fantasy, not related at all to what's actually there), but delivers ok FPS on my RTX2080ti. No, I didn't try FlightSim on the NUCs...

However, the 3D rendering pipeline Google has put into the browser variant of Google Maps, beats the socks off both Google Earth Pro and Microsoft Flight: With Chrome leading over Firefox significantly, the 3D modelled environment is mind-boggling even on the GT2 at 4k resolutions, it's buttery smooth on GT3e. A browser based flight simulator might actually give the best experience overall, quite hard to believe in a way.

It has me appreciate how good even iGPU graphics could be, if code was properly tuned to make do with what's there.

And it exposes just how bad Microsoft Flight is with nothing but Bing map data unterneath: Those €120 were a full waste of money, but I just saved those from buying the second NUC8 later.
mrtunakarya - Wednesday, December 9, 2020 - link
<a href="https://www.mrtunakarya.com/?m=1">Nice<...

A Broadwell Retrospective Review in 2020: Is eDRAM Still Worth It?

CPU Tests: Microbenchmarks

Core-to-Core Latency

Cache-to-DRAM Latency

Frequency Ramping

A y-Cruncher Sprint

Post Your Comment

120 Comments

View All Comments

dsplover - Tuesday, November 3, 2020 - link

ironicom - Tuesday, November 3, 2020 - link

vorsgren - Tuesday, November 3, 2020 - link

Nictron - Wednesday, November 4, 2020 - link

erotomania - Wednesday, November 4, 2020 - link

vorsgren - Wednesday, November 4, 2020 - link

Oxford Guy - Thursday, November 5, 2020 - link

erikvanvelzen - Saturday, November 7, 2020 - link

abufrejoval - Monday, November 9, 2020 - link

mrtunakarya - Wednesday, December 9, 2020 - link

Log in

Don't have an account? Sign up now