AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed

Name: AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed
Item: AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed
Author: Dr. Ian Cutress

by Ian Cutress on August 18, 2016 9:00 AM EST

Posted in
CPUs
AMD
Zen

216 Comments | Add A Comment

216 Comments

In their own side event this week, AMD invited select members of the press and analysts to come and discuss the next layer of Zen details. In this piece, we’re discussing the microarchitecture announcements that were made, as well as a look to see how this compares to previous generations of AMD core designs.

AMD Zen

Prediction, Decode, Queues and Execution

First up, let’s dive right into the block diagram as shown:

If we focus purely on the left to start, we can see most of the high-level microarchitecture details including basic caches, the new inclusion of an op-cache, some details about decoders and dispatch, scheduler arrangements, execution ports and load/store arrangements. A number of slides later in the presentation talk about cache bandwidth.

Firstly, one of the bigger deviations from previous AMD microarchitecture designs is the presence of a micro-op cache (it might be worth noting that these slides sometimes say op when it means micro-op, creating a little confusion). AMD’s Bulldozer design did not have an operation cache, requiring it to fetch details from other caches to implement frequently used micro-ops. Intel has been implementing a similar arrangement for several generations to great effect (some put it as a major stepping stone for Conroe), so to see one here is quite promising for AMD. We weren’t told the scale or extent of this buffer, and AMD will perhaps give that information in due course.

Aside from the as-expected ‘branch predictor enhancements’, which are as vague as they sound, AMD has not disclosed the decoder arrangements in Zen at this time, but has listed that they can decode four instructions per cycle to feed into the operations queue. This queue, with the help of the op-cache, can deliver 6 ops/cycle to the schedulers. The reasons behind the queue being able to dispatch more per cycle is if the decoder can supply an instruction which then falls into two micro-ops (which makes the instruction vs micro-op definitions even muddier). Nevertheless, this micro-op queue helps feed the separate integer and floating point segments of the CPU. Unlike Intel who uses a combined scheduler for INT/FP, AMD’s diagram suggests that they will remain separate with their own schedulers at this time.

The INT side of the core will funnel the ALU operations as well as the AGU/load and store ops. The load/store units can perform 2 16-Byte loads and one 16-Byte store per cycle, making use of the 32 KB 8-way set associative write-back L1 Data cache. AMD has explicitly made this a write back cache rather than the write through cache we saw in Bulldozer that was a source of a lot of idle time in particular code paths. AMD is also stating that the load/stores will have lower latency within the caches, but has not explained to what extent they have improved.

The FP side of the core will afford two multiply ports and two ADD ports, which should allow for two joined FMAC operations or one 256-bit AVX per cycle. The combination of the INT and FP segments means that AMD is going for a wide core and looking to exploit a significant amount of instruction level parallelism. How much it will be able to depends on the caches and the reorder buffers – no real data on the buffers has been given at this time, except that the cores will have a +75% bigger instruction scheduler window for ordering operations and a +50% wider issue width for potential throughput. The wider cores, all other things being sufficient, will also allow AMD’s implementation of simultaneous multithreading to potentially take advantage of multiple threads with a linear and naturally low IPC.

Deciphering the New Cache Hierarchy: L1, 512 KB L2, 8 or 16 MB L3

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

216 Comments

View All Comments

JoeyJoJo123 - Thursday, August 18, 2016 - link
The ignorance... It hurts...

Original x86 (32-bit) was Intel-AMD developed.
AMD then developed x86-64, or x86 64-bit, and then Intel continues to license it to this day.

There's no copying here. Someone did it first, then others are licensing that IP from them.

See: https://en.wikipedia.org/wiki/X86-64
>x86-64 (also known as x64, x86_64 and AMD64) is the 64-bit version of the x86 instruction set.
>The original specification, created by AMD and released in 2000, has been implemented by AMD, Intel and VIA.

That's why sometimes you might see driver versions labeled AMD64, and you might be puzzled as to why despite being on a Intel 64-bit CPU that the 64-bit driver you downloaded states AMD64 in the name. It's because it was an AMD-first technology, but it's usable on any x86-64bit processor.
Bateluer - Thursday, August 18, 2016 - link
Intel simply paid for the license to copy the technology AMD designed. They still copied it, just legally paid for the right to do so.
Klimax - Saturday, August 20, 2016 - link
Actually, not exactly correct. Intel was forced by Microsoft to adopt AMD's solution, despite Intel having parallel own implementation which was different. And Intel's version is still bit different from AMD's. (Some instructions are different between implementations, mostly relevant only to OS)
xenol - Thursday, August 18, 2016 - link
IBM made the dual-core on a single die design.
ExarKun333 - Thursday, August 18, 2016 - link
In many ways, Intel's 64-bit was superior to AMD's, but x86-64 was more backward compatible. I can see it both ways....different solutions to the same problem. Both companies have pushed each other...
TheMightyRat - Thursday, August 18, 2016 - link
How is IA64 superior to AMD64?
AMD64 can run 32-bit software without performance hit and still run 64-bit software comparatively equally to Intel counterpart.
IA64 Itanium runs 64-bit software much slower than a Pentium 4 64-bit at the same clock and has a massive performance hit in 32-bit emulation (1/3 as fast). Aren't both of them based on Netburst?
EMT64 only has more codes than AMD64 as it also implement both AMD64 and IA64, which is no longer used in modern server software anymore.
Klimax - Saturday, August 20, 2016 - link
He was talking about Intel's x64 which was backup plan in case Itanium fails.
Myrandex - Thursday, August 25, 2016 - link
I don't think Itaniums were Netburst in architecture, it seemed to be a totally different architecture.
Gigaplex - Thursday, August 18, 2016 - link
Itanium was novel but turned out to be a poor performer. It relied too much on good compilers optimising the instruction order.
KPOM - Friday, August 19, 2016 - link
Wasn't Itanium based on "Very Long Instruction Word" architecture? Hence the long pipelines and reliance on clock speed? The Pentium M from Intel Israel righted Intel's ship and allowed them to take leadership of the x86 architecture back from AMD.

AMD Zen Microarchitecture: Dual Schedulers, Micro-Op Cache and Memory Hierarchy Revealed

AMD Zen

Prediction, Decode, Queues and Execution

Post Your Comment

216 Comments

View All Comments

JoeyJoJo123 - Thursday, August 18, 2016 - link

Bateluer - Thursday, August 18, 2016 - link

Klimax - Saturday, August 20, 2016 - link

xenol - Thursday, August 18, 2016 - link

ExarKun333 - Thursday, August 18, 2016 - link

TheMightyRat - Thursday, August 18, 2016 - link

Klimax - Saturday, August 20, 2016 - link

Myrandex - Thursday, August 25, 2016 - link

Gigaplex - Thursday, August 18, 2016 - link

KPOM - Friday, August 19, 2016 - link

Log in

Don't have an account? Sign up now