Frequency: Going Above 5.0 GHz

One of the major highlights that AMD is promoting with the new Zen 3 core and Ryzen 5000 processors how the company has kept the same power and yet delivered both more frequency, more performance per MHz, and ultimately more performance, despite using the same TSMC N7 manufacturing process node. The updated efficiency of the core, assuming the design can scale in frequency and voltage, can naturally lead to those higher frequency numbers. One of AMD’s humps in competing against Intel of late has been, despite any IPC difference, the higher frequency of Intel’s 14nm process. With Zen 3, we are seeing AMD drive those higher numbers – and some numbers higher than on the box.

When AMD announced the top 16-core processor, the Ryzen 9 5950X, it gave a base frequency of 3400 MHz and a turbo frequency of 4900 MHz. This turbo value was so close to the ‘magic’ number of 5000 MHz, and would yield an additional angle for AMD in its marketing strategy and promotional toolkit. Ultimately scoring a 5000 MHz version comes down to binning – AMD would have detailed analysis of the chiplets it makes at TSMC, and it would see how many chiplets could hit this mark. The question then becomes if there would be enough to satisfy demand, or if those chiplets were better suited in higher efficiency future EPYC products where the margins are higher.

We have seen what happens when you launch a processor that can’t be built in the numbers required: Intel’s Core i9-10900K, at 5.3 GHz turbo, was a super high frequency but couldn’t be built enough to meet demand, and Intel launched the Core i9-10850K – an identical chip except now down to 5.1 GHz, which was an easier target to meet.

If you’ve read through this far in the review, you have already seen that we’re here quoting going above 5.0 GHz for the Ryzen 9 5950X. Despite having an official single core turbo of 4.9 GHz, the processor has an internal metric of ‘peak’ frequency assuming there is sufficient thermal and power headroom of 5025 MHz. This in effect should be its official turbo value. In combination with the default precision boost behavior, we saw a very regular and sustained 5050 MHz.

We quizzed AMD on this. We were told that the 4.9 GHz value for single core turbo should cover all situations, based on BIOS version, motherboard used, and the quality of the silicon inside. The company is happy to let the base precision boost algorithms (or what eXtreme Frequency Range/XFR was rolled into) enable something higher than 4.9 GHz if it can, and they confirmed that with a standard high-end AM4 built and this processor, 5025/5050 MHz should be easily achievable with a large proportion of 5950X retail hardware.

So Why Does AMD Not Promote 5.0 GHz?

From the standpoint of ‘I’ve dealt with press relations from these companies for over 10 years’, I suspect the real answer for AMD not promoting 5.0 GHz is more about sculpting the holistic view of Zen 3 and Ryzen 5000.

If the company were to promote/place the Ryzen 9 5950X as AMD’s second ever processor to go above 5.0 GHz (the first was the FX-9590 back in 2013), or reaching 5.0 GHz on 7nm, then this achievement would necessarily overshadow all of AMD’s other achievements on Zen 3. Rather than pointing to the new core, the increased IPC, or the efficiency of the new processor, everyone would be pointing to the 5.0 GHz frequency instead. Achieving that value and promoting it as such effectively masks the ability for AMD (and the press) to be able to discuss some of the other major wins – that 5.0 GHz win would come off as a poisoned chalice. Not only this, but it might spur users to purchase them at a higher rate; you might consider this a win from both a revenue and gross margins perspective, but it does tie in to AMD’s ability to produce the chiplets at this frequency or if they want to use them for other higher margin products.

Of course, some of this is vanity. AMD would rather speak to its engineering expertise and successes, its teams of engineers, and dive into the specific performance wins, especially for a product where the claims about absolute performance leadership are in-of-themselves a strong statement. Users might conflate the fact that AMD reaching 5.0 GHz was the only reason for performance leadership, and that’s ultimately not the narrative that AMD wants to cultivate.

It also leaves the door open to a future product that will certainly say 5.0 GHz on the box. When AMD has extracted the marketing performance of its increased IPC and efficiency, it can open that window and reap another focused review cycle.

In short: effective marketing is a skill, especially when there are multiple angles that can be leveraged for promotional tools. Identifying how you layer those communications could drastically affect, multiply, or amplify product perception. In what order you execute those multiples and amplifications can make or break a product cycle.

From a member of the press’ perspective, the more I interact with communications teams, the more I understand how they think.

Frequency Reporting

With all that being said we need an updated table showing our measured peak and all-core turbo frequencies for the Ryzen 5000 series. Going through each of the four processors, as part of our power testing we hoover up all the data for per-core power and per-core frequencies as we scale from idle to full-CPU load. Part of that data shows:

Ryzen 5000 Series Measured Data
AnandTech Listed
1T
Firm
ware
1T*
Data
1T
  Listed
Base
Data
nT
  TDP
(W)
Data
(W)
nT
W/core
Ryzen 9 5950X 4900 5025 5050   3400 3775   105 142 6.12
Ryzen 9 5900X 4800 4925 4950   3700 4150   105 142 7.85
Ryzen 7 5800X 4700 4825 4825   3800 4450   105 140 14.55
Ryzen 5 5600X 4600 4625 4650   3700 4450   65 76 10.20
*Listed 1T: The official number on the box
*Firmware 1T: 'Maximum Frequency' as listed in CPU registers in AGESA 1100

The main takeaway from this data, aside from those measured turbo values, is that one of AMD’s new Zen 3 cores can hit 4000 MHz in around 7 W, as indicated by the per core values on the 5950X and 5900X. For the future AMD Milan EPYC enterprise processors, this is vital information to see where exactly some of those processors will end up within any given power budget (such as 225 W or 280 W).

Also of note are the last two processors – both processors are reporting 4450 MHz all-core turbo frequency, however the 5800X is doing it with 14.55 W per core, but the 5600X can do it with only 10.20 W per core. In this instance, this seems that the voltage of the 5800X is a lot higher than the other processors, and this is forcing higher thermals – we were measuring 90ºC at full load after 30 seconds (compared to 73ºC on the 5600X or 64ºC on the 5950X), which might be stunting the frequency here. The motherboard might be over-egging the voltage a little here, going way above what is actually required for the core.

Moving back to the halo chip, we can compare the loaded Core Frequency scaling of the new Ryzen 9 5950X with Zen 3 cores against the previous generation Ryzen 9 3950X with Zen 2 cores. It looks a little something like this.

Note that the 3950X numbers are updated from our original 3950X review, given that there have been a wide variety of BIOS updates since. Both CPUs exhibit a quick drop off from single core loading, and between 3-8 core load it remains steady, with the new processor anywhere from 400-450 MHz higher. As we scale up beyond eight cores, the two parts actually converge at 14-core load, and when we sit at a full CPU, our Ryzen 9 5950X is 125 MHz lower than the 3950X.

Should we look much into this? The listed base frequency of the Ryzen 9 5950X is 100 MHz lower than the Ryzen 9 3950X (3400 MHz vs 3500 MHz), and we’re seeing a 125 MHz all-core difference. This has the potential to indicate that Zen3 has a higher current density when all the cores are active, and due to the characteristics of the silicon and the core design (such as the wider core and faster load/store), there has to be this frequency difference to maintain the power when all cores are loaded. Naturally the benefit of Zen 3 is that higher performance per core, which should easily go beyond the 125 MHz difference. The benchmarks over the next dozen pages will showcase this.

New and Improved Instructions TDP and Per-Core Power Draw
Comments Locked

339 Comments

View All Comments

  • TheinsanegamerN - Tuesday, November 10, 2020 - link

    However AMD's boost algorithim is very temperature sensitive. Those coolers may work fine, but if they get to the 70C range you're losing max performance to higher temperatures.
  • Andrew LB - Sunday, December 13, 2020 - link

    Blah blah....

    Ryzen 5800x @ 3.6-4.7ghz : 219w and 82'c.
    Ryzen 5800x @ 4.7ghz locked: 231w and 88'c.

    Fractal Celsius+ S28 Prisma 280mm AIO CPU cooler at full fan and pump speed
    https://www.kitguru.net/components/cpu/luke-hill/a...

    If you actually set your voltages on Intel chips they stay cool. My i7-10700k @ 5.0ghz all-core locked never goes above 70'c.
  • Count Rushmore - Friday, November 6, 2020 - link

    It took 3 days... finally the article load-up.
    AT seriously need to upgrade their server (or I need to stop using IE6).
  • name99 - Friday, November 6, 2020 - link

    "AMD wouldn’t exactly detail what this means but we suspect that this could allude to now two branch predictions per cycle instead of just one"

    So imagine you have wide OoO CPU. How do you design fetch? The current state of the art (and presumably AMD have aspects of this, though perhaps not the *entire* package) goes as follows:

    Instructions come as runs of sequential instructions separated by branches. At a branch you may HAVE to fetch instructions from a new address (think call, goto, return) or you may perhaps continue to the next address (think non-taken branch).
    So an intermediate complexity fetch engine will bring in blobs of instructions, up to (say 6 or 8) with the run of instructions terminating at
    - I've scooped up N or
    - I've hit a branch or
    - I've hit the end of a cache line.

    Basically every cycle should consist of pulling in the longest run of instructions possible subject to the above rules.

    The way really advanced fetch works is totally decoupled from the rest of the CPU. Every cycle the fetch engine predicts the next fetch address (from some hierarchy of : check the link stack, check the BTB, increment the PC), and fetches as much as possible from that address. These are stuck in a queue connected to decode, and ideally that queue would never run dry.

    BUT: on average there is about a branch every 6 instructions.
    Now supposed you want to sustain, let's say, 8-wide. That means that you might set N at 8, but most of the time you'll fetch 6 or so instructions because you'll bail out based on hitting a branch before you have a full 8 instructions in your scoop. So you're mostly unable to go beyond an IPC of 6, even if *everything* else is ideal.

    BUT most branches are conditional. And good enough half of those are not taken. This means that if you can generate TWO branch predictions per cycle then much of the time the first branch will not be taken, can be ignored, and fetch can continue in a straight line past it. Big win! Half the time you can pull in only 6 instructions, but the other half you could pull in maybe 12 instructions. Basically, if you want to sustain 8 wide, you'd probably want to pull in at least 10 or 12 instructions under best case conditions, to help fill up the queue for the cases where you pull in less than 8 instructions (first branch is taken, or you reach the end of the cache line).

    Now there are some technicalities here.
    One is "how does fetch know where the branches are, to know when to stop fetching". This is usually done via pre-decode bits living in the I-cache, and set by a kinda decode when the line is first pulled into the I-cache. (I think x86 also does this, but I have no idea how. It's obviously much easier for a sane ISA like ARM, POWER, even z.)
    Second, and more interesting, is that you're actually performing two DIFFERENT TYPES of prediction, which makes it somewhat easier from a bandwidth point of view. The prediction on the first branch is purely "taken/not taken", and all you care about is "not taken"; the prediction on the second branch is more sophisticated because if you predict taken you also have to predict the target, which means dealing BTB or link stack.

    But you don't have to predict TWO DIFFERENT "next fetch addresses" per cycle, which makes it somewhat easier.
    Note also that any CPU that uses two level branch prediction is, I think, already doing two branch prediction per cycle, even if it doesn't look like it. Think about it: how do you USE a large (but slow) second level pool of branch prediction information?
    You run the async fetch engine primarily from the first level; and this gives a constant stream of "runs of instructions, separated by branches" with zero delay cycles between runs. Great, zero cycle branches, we all want that. BUT for the predictors to generate a new result in a single cycle they can't be too large.
    So you also run a separate engine, delayed a cycle or two, based on the larger pool of second level branch data, checking the predictions of the async engine. If there's a disagreement you flush whatever was fetched past that point (which hopefully is still just in the fetch queue...) and resteer. This will give you a one (or three or four) cycle bubble in the fetch stream, which is not ideal, but
    - it doesn't happen that often
    - it's a lot better catching a bad prediction very early in fetch, rather than much later in execution
    - hopefully the fetch queue is full enough, and filled fast enough, that perhaps it's not even drained by the time decode has walked along it to the point at which the re-steer occurred...

    This second (checking) branch prediction doesn't ever get mentioned, but it is there behind the scenes, even when the CPU is ostensibly doing only a single prediction per cycle.

    There are other crazy things that happen in modern fetch engines (which are basically in themselves as complicated as a whole CPU from 20 years ago).

    One interesting idea is to use the same data that is informing the async fetch engine to inform prefetch. The idea is that you now have essentially two fetch engines running. One is as I described above; the second ONLY cares about the stream of TAKEN branches, and follows that stream as rapidly as possible, ensuring that each line referenced by this stream is being pulled into the I-cache. (You will recognize this as something like a very specialized form of run-ahead.)
    In principle this should be perfect -- the I prefetcher and branch-prediction are both trying to solve the *exact* same problem, so pooling their resources should be optimal! In practice, so far this hasn't yet been perfected; the best simulations using this idea are a very few percent behind the best simulations using a different I prefetch technology. But IMHO this is mostly a consequence of this being a fairly new idea that has so far been explored mainly by using pre-existing branch predictors, rather than designing a branch predictor store that's optimal for both tasks.
    The main difference is that what matters for prefetching is "far future" branches, branches somewhat beyond where I am now, so that there's plenty of time to pull in the line all the way from RAM. And existing branch predictors have had no incentive to hold onto that sort of far future prediction state. HOWEVER
    A second interesting idea is what IBM has been doing for two or three years now. They store branch prediction in what they call an L2 storage but, to avoid things, I'll cal a cold cache. This is stale/far future branch prediction data that is unused for a while but, on triggering events, that cold cache data will be swapped into the branch prediction storage so that the branch predictors are ready to go for the new context in which they find themselves.

    I don't believe IBM use this to drive their I-prefetcher, but obviously it is a great solution to the problem I described above and I suspect this will be where all the performance CPUs eventually find themselves over the next few years. (Apple and IBM probably first, because Apple is Apple, and IBM has the hard part of the solution already in place; then ARM because they's smart and trying hard; then AMD because they're also smart but their technology cycles are slower than ARM; and final Intel because, well, they're Intel and have been running on fumes for a few years now.)
    (Note of course this only solves I-prefetch, which is nice and important; but D-prefetch remains as a difficult and different problem.)
  • name99 - Friday, November 6, 2020 - link

    Oh, one more thing. I referred to "width" of the CPU above. This becomes an ever vaguer term every year. The basic points are two:

    - when OoO started, it seemed reasonable to scale every step of the pipeline together. Make the CPU 4-wide. So it can fetch up to 4 instructions/cycle. decode up to 4, issue up to 4, retire up to 4. BUT if you do this you're losing performance every step of the way. Every cycle that fetches only 3 instructions can never make that up; likewise every cycle that only issues 3 instructions.

    - so once you have enough transistors available for better designs, you need to ask yourself what's the RATE-LIMITING step? For x86 that's probably in fetch and decode, but let's consider sane ISAs like ARM. There the rate limiting step is probably register rename. So lets assume your max rename bandwidth is 6 instructions/cycle. You actually want to run the rest of your machinery at something like 7 or 8 wide because (by definition) you CAN do so (they are not rate limiting, so they can be grown). And by running them wider you can ensure that the inevitable hiccups along the way are mostly hidden by queues, and your rename machinery is running at full speed, 6-wide each and every cycle, rather than frequently running at 5 or 4 wide because of some unfortunate glitch upstream.
  • Spunjji - Monday, November 9, 2020 - link

    These were interesting posts. Thank you!
  • GeoffreyA - Monday, November 9, 2020 - link

    Yes, excellent posts. Thanks.

    Touching on width, I was expecting Zen 3 to add another decoder and take it up to 5-wide decode (like Skylake onwards). Zen 3's keeping it at 4 makes good sense though, considering their constraint of not raising power. Another decoder might have raised IPC but would have likely picked up power quite a bit.
  • ignizkrizalid - Saturday, November 7, 2020 - link

    Rip Intel no matter how hard you try squeezing Intel sometimes on top within your graphics! stupid site bias and unreliable if this site was to be truth why not do a live video comparison side by side using 3600 or 4000Mhz ram so we can see the actual numbers and be 100% assured the graphic table is not manipulated in any way, yea I know you will never do it! personally I don't trust these "reviews" that can be manipulated as desired, I respect live video comparison with nothing to hide to the public. Rip Intel Rip Intel.
  • Spunjji - Monday, November 9, 2020 - link

    I... don't think this makes an awful lots of sense, tbh.
  • MDD1963 - Saturday, November 7, 2020 - link

    It would be interesting to also see the various results of the 10900K the way most people actually run them on Z490 boards, i.e, with higher RAM clocks, MCE enabled, etc...; do the equivalent tuning with 5000 series, I'm sure they will run with faster than DDR4-3200 MHz. plus perhaps a small all-core overclock.

Log in

Don't have an account? Sign up now