In their own side event this week, AMD invited select members of the press and analysts to come and discuss the next layer of Zen details. In this piece, we’re discussing the microarchitecture announcements that were made, as well as a look to see how this compares to previous generations of AMD core designs.

AMD Zen

Prediction, Decode, Queues and Execution

First up, let’s dive right into the block diagram as shown:

If we focus purely on the left to start, we can see most of the high-level microarchitecture details including basic caches, the new inclusion of an op-cache, some details about decoders and dispatch, scheduler arrangements, execution ports and load/store arrangements.  A number of slides later in the presentation talk about cache bandwidth.

Firstly, one of the bigger deviations from previous AMD microarchitecture designs is the presence of a micro-op cache (it might be worth noting that these slides sometimes say op when it means micro-op, creating a little confusion). AMD’s Bulldozer design did not have an operation cache, requiring it to fetch details from other caches to implement frequently used micro-ops. Intel has been implementing a similar arrangement for several generations to great effect (some put it as a major stepping stone for Conroe), so to see one here is quite promising for AMD. We weren’t told the scale or extent of this buffer, and AMD will perhaps give that information in due course.

Aside from the as-expected ‘branch predictor enhancements’, which are as vague as they sound, AMD has not disclosed the decoder arrangements in Zen at this time, but has listed that they can decode four instructions per cycle to feed into the operations queue. This queue, with the help of the op-cache, can deliver 6 ops/cycle to the schedulers. The reasons behind the queue being able to dispatch more per cycle is if the decoder can supply an instruction which then falls into two micro-ops (which makes the instruction vs micro-op definitions even muddier). Nevertheless, this micro-op queue helps feed the separate integer and floating point segments of the CPU. Unlike Intel who uses a combined scheduler for INT/FP, AMD’s diagram suggests that they will remain separate with their own schedulers at this time.

The INT side of the core will funnel the ALU operations as well as the AGU/load and store ops. The load/store units can perform 2 16-Byte loads and one 16-Byte store per cycle, making use of the 32 KB 8-way set associative write-back L1 Data cache. AMD has explicitly made this a write back cache rather than the write through cache we saw in Bulldozer that was a source of a lot of idle time in particular code paths. AMD is also stating that the load/stores will have lower latency within the caches, but has not explained to what extent they have improved.

The FP side of the core will afford two multiply ports and two ADD ports, which should allow for two joined FMAC operations or one 256-bit AVX per cycle. The combination of the INT and FP segments means that AMD is going for a wide core and looking to exploit a significant amount of instruction level parallelism. How much it will be able to depends on the caches and the reorder buffers – no real data on the buffers has been given at this time, except that the cores will have a +75% bigger instruction scheduler window for ordering operations and a +50% wider issue width for potential throughput. The wider cores, all other things being sufficient, will also allow AMD’s implementation of simultaneous multithreading to potentially take advantage of multiple threads with a linear and naturally low IPC.

Deciphering the New Cache Hierarchy: L1, 512 KB L2, 8 or 16 MB L3
Comments Locked

216 Comments

View All Comments

  • looncraz - Thursday, August 18, 2016 - link

    Really, this design is like nothing Intel has.

    Intel uses a unified scheduler, and it looks from the diagram that AMD is using seven schedulers... which is just insane. Beyond both using SMT schemes and executing x86, they are very different designs.
  • e36Jeff - Thursday, August 18, 2016 - link

    Just a quick FYI, Intel is licencing the SMT technology from Sun, as they hold the US patents for it. So Intel, just like AMD, is copying Sun.
  • svan1971 - Thursday, August 18, 2016 - link

    Wow that was a hell of a burn on AMD zingy....Nothing better than routing for Goliath huh.
  • farmergann - Thursday, August 18, 2016 - link

    Zen is actually an enlarged evolution of the Jag Cores with doubled up pipelines and SMT. Don't take my word for it either, study the link below and pay attention to what we learn about Zen. Jag/Puma+ are actually better Cores than their intel competitors despite a huge node disadvantage. AMD is back.
    http://www.realworldtech.com/jaguar/
  • msx68k - Thursday, August 18, 2016 - link

    AMD did not copy anything from Intel, because Intel did not invent the SMT technique. The SMT was developed by IBM in '60, while CMT was by DEC in '90, and both are processor design techniques, something like Risc or CISC.
  • The_Countess - Friday, August 19, 2016 - link

    like intel copied the short pipeline of the athlon64, the on die memory controller, and the larger l1 and l2 caches, in addition to the already mentioned AMD64.
  • medi03 - Friday, August 19, 2016 - link

    That's one silly statement.
    That's the way progress works. When there is a good idea to (re-)use, you do it. Nothing wrong with it.
  • stimudent - Friday, August 19, 2016 - link

    Think or research about what you're about to say before posting.
  • SanX - Friday, August 19, 2016 - link

    Doubt about that. Somebody is just pumping AMD stock. Typical bluff, none of these 200 journos have a clue about all these cache speed exchange etc, they understand only cash speed exchange. The 40% increase in processor performance they claim will actually be 20% or even 10%. And compared to Intel in 2017 - 0%. You can not jump factor of 2 anymore, the Moore's law is dead. And 10-20% difference in computing means EQUAL, and all that Zen noise means NOTHING.
  • looncraz - Friday, August 19, 2016 - link

    In order for the feat they demonstrated to be real, they had to have exceeded 40% IPC over Excavator, unless their SMT is scaling unusually well.

    FX-8350 at 3Ghz would take well more than twice as long. Even the FX-8350 at 4GHz would probably take twice as long.

Log in

Don't have an account? Sign up now