The High-Level Zen Overview

AMD is keen to stress that the Zen project had three main goals: core, cache and power. The power aspect of the design is one that was very aggressive – not in the sense of aiming for a mobile-first design, but efficiency at the higher performance levels was key in order to be competitive again. It is worth noting that AMD did not mention ‘die size’ in any of the three main goals, which is usually a requirement as well. Arguably you can make a massive core design to run at high performance and low latency, but it comes at the expense of die size which makes the cost of such a design from a product standpoint less economical (if AMD had to rely on 500mm2 die designs in consumer at 14nm, they would be priced way too high). Nevertheless, power was the main concern rather than pure performance or function, which have been typical AMD targets in the past. The shifting of the goal posts was part of the process to creating Zen.

This slide contains a number of features we will hit on later in this piece, but covers a number of main topics which come under those main three goals of core, cache and power.

For the core, having bigger and wider everything was to be expected, however maintaining a low latency can be difficult. Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch prediction means that higher throughput can be maintained longer and in the fastest order possible. Add in dual threads and the applicability of keeping the functional units occupied with full queues also improves multi-threaded performance.

For the caches, having a faster prefetch and better algorithms ensures the data is ready when each of the caches when a thread needs it. Aiming for faster caches was AMD’s target, and while they are not disclosing latencies or bandwidth at this time, we are being told that L1/L2 bandwidth is doubled with L3 up to 5x.

For the power, AMD has taken what it learned with Carrizo and moved it forward. This involves more aggressive monitoring of critical paths around the core, and better control of the frequency and power in various regions of the silicon. Zen will have more clock regions (it seems various parts of the back-end and front-end can be gated as needed) with features that help improve power efficiency, such as the micro-op cache, the Stack Engine (dedicated low power address manipulation unit) and Move elimination (low-power method for register adjustment - pointers to registers are adjusted rather than going through the high-power scheduler).

The Big Core Diagram

We saw this diagram last year, showing some of the bigger features AMD wants to promote:

The improved branch predictor allows for 2 branches per Branch Target Buffer (BTB), but in the event of tagged instructions will filter through the micro-op cache. On the other side, the decoder can dispatch 4 instructions per cycle however some of those instructions can be fused into the micro-op queue. Fused instructions still come out of the queue as two micro-ops, but take up less buffer space as a result.

As mentioned earlier, the INT and FP pipes and schedulers are separated, however the INT rename space is 168 registers wide, which feeds into 6x14 scheduling queues. The FP employs as 160 entry register file, and both the FP and INT sections feed into a 192-entry retire queue. The retire queue can operate at 8 instructions per cycle, moving up from 4/cycle in previous AMD microarchitectures.

The load/store units are improved, supporting a 72 out-of-order loads, similar to Skylake. We’ll discuss this a bit later. On the FP side there are four pipes (compared to three in previous designs) which support combined 128-bit FMAC instructions. These can be combined for one 256-bit AVX, but beyond that it has to be scheduled over multiple instructions.

The Ryzen Die Fetch and Decode
Comments Locked

574 Comments

View All Comments

  • mikeZZZ - Friday, March 3, 2017 - link

    Anadtech, can we please run closer to real life scenarios such as a gaming benchmark with a file compression benchmark running at the same time. Even gaming enthusiasts run more than one program at a time. For example, file decompression in the background while playing a game, or baseball game streaming in a small window while playing a game. You already have many individual benchmarks, so why not go the extra but significant benchmark of running two? We know this favors the higher core CPUs (maybe even Ryzen 7 1700 over all other lower core ones CPUs) but that is closer to real life and should be very meaningful to someone wanting to make an informed purchase.
  • ValiumMm - Saturday, March 4, 2017 - link

    Would also like to see this
  • UrQuan3 - Friday, March 3, 2017 - link

    Just want to put out a quick comment about benchmarking with Handbrake. In dealing with Broadwell-E, and especially ThunderX, I've found that Handbrake often doesn't scale well past about 10 cores, and really doesn't scale well past 16 or so. What seems to happen is that the single-threaded parts of Handbrake tend to dominate the encode time. In extreme cases, ultra-fast and placebo will take almost the same amount of time as x264 is consuming input faster than the rest of Handbrake can generate it. On ThunderX, I've found I can complete four 1080p placebo encodes in the same amount of time that I can complete one. I would expect a similar result on a 48 core Intel, though I do not have access to one beyond 24 cores. Turbo boost would hide this effect a bit.

    I am not knocking using Handbrake for benchmarking. The Handbrake and ray-trace results are the two that I care about most. I just thought I'd give a heads up about this limitation. You can check CPU usage statistics to get an indication of when you are running up against this limit.

    Oh, and I am very excited to see multiple ray-tracers in your runs. Please continue.
  • Meteor2 - Saturday, March 4, 2017 - link

    Presumably though you can have several x264 jobs running simultaneously on that hardware? So while your time to encode a certain piece doesn't decrease, you have more total-throughput (e.g. encoding several different bitrates for adaptive streaming). Should give good efficiency too on a larger Broadwell-E or a ThunderX.
  • UrQuan3 - Tuesday, March 7, 2017 - link

    Exactly. It's the first time I've thought about installing a queue manager for a single computer.
  • jade5419 - Saturday, March 4, 2017 - link

    I agree with this. In my experience Handbrake has a core / thread limit.

    I have a Z600 system with dual Xeon 5570 @ 2.93GHz, 6 core / 12 threads (total 24 threads), 48GB of RAM and a Z620 system with dual Xeon E5-2690 @ 2.9GHz 8 core / 16 threads (total 32 threads), 64GB RAM.

    The two systems transcode video at the same speed using Handbrake 1.0.3. Monitoring CPU usage shows all threads of the Z600 at 100% utilization whereas the CPU utilization on the Z620 is approximately 80%.
  • Notmyusualid - Sunday, March 5, 2017 - link

    Ever tried running GTA5 on 28 cores?

    It doesn't work. You have to adjust the game 'launchers' core affinity to < 26 cores or it won't even load.

    Given this discovery, I expect there are many more applications out there, that may crap-out as we see more and more cores come into the mainstream.

    Just a thought.
  • mapesdhs - Sunday, March 5, 2017 - link

    I'd love to know why this happens. I'm guessing something dumb within Windows.
  • Outlander_04 - Friday, March 3, 2017 - link

    There is more than enough good news to make me want to buy a 6 core Ryzen when they become available .
    Likely that will be the sweet spot for gamers
  • 0ldman79 - Saturday, March 4, 2017 - link

    I'm looking forward to seeing Ryzen updated in the bench.

    There aren't any apps or benchmarks that cross over between the FX series and the Ryzen series, so we can't do any side by side comparison.

    Great review guys. Looking forward to the six core Ryzen. I think just like the FX series the six core will be the sweet spot.

Log in

Don't have an account? Sign up now