AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism

Name: AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism
Item: AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism
Author: Dr. Ian Cutress

by Ian Cutress on August 23, 2016 8:45 PM EST

106 Comments | Add A Comment

106 Comments

Hot Chips is an annual conference that allows semiconductor companies to present their latest and greatest ideas or forthcoming products in an academic-style environment, and is predominantly aimed as the professional semiconductor engineer. This year has a number of talks about power management, upcoming IBM CPUs, upcoming Intel CPUs, upcoming NVIDIA SoCs and the final talk of the final day is from AMD, discussing Zen in even more depth than the previous week. While we were unable to attend the event in person, we managed to get some hands on time with information and put questions to Mike Clark, AMD Senior Fellow and design engineer.

What We Learned Last Week: L1/L2/L3 Caches and the Micro-Op Buffer

In AMD’s initial presentation for the general media, we were given a sense of the microarchitecture layout. We covered the material, but it contained a number of highlights.

AMD Zen Microarchitecture: Dual Schedulers, Micro-op Cache and Memory Hierarchy Revealed
AMD Server CPUs and Motherboard Analysis
Unpacking AMD's Zen Benchmark: Is Zen actually 2% Faster than Broadwell?

First up, and the most important, was the announcement of the inclusion of a micro-op cache. This allows for instructions that are frequently used to be closer to the micro-op queue and saves a trip through the core and caches to load the desired information. Typically micro-op caches are still relatively small, and while AMD isn’t giving any information for size and accessibility, we know that Intel’s version can support 1536 uOps with 8-way associativity; we expect AMD’s to be similar though there are many options in play.

Secondly is the cache structure. We were given details for the L1, L2 and L3 cache sizes, along with associativity, to compare it to former microarchitectures as well as Intel’s offering.

CPU Cache Comparison
	Zen HEDT	Bulldozer HEDT	Excavator	Skylake	Broadwell HEDT
L1-I	64KB/core	64KB/module	96KB/module	32KB/core	32KB/core
L1-I	4-way	2-way	3-way	8-way	8-way
L1-D	32KB/core	16KB/thread	32KB/thread	32KB/core	32KB/core
L1-D	8-way	4-way	8-way	8-way	8-way
L2	512KB/core	1MB/thread	512KB/thread	256KB/core	256KB/core
L2	8-way	16-way	16-way	4-way	8-way
L3	2MB/core	1MB/thread	-	>2MB/core	1.5-3MB/core
L3	16-way	64-way	-	16-way	16/20-way
L3 Type	Victim	Victim	-	Write-back	Write-back

In this case, AMD has given Zen a 64KB L1 Instruction cache per core with 4-way associativity, with a lop-sided 32KB L1 Data cache per core with 8-way associativity. The size and accessibility determines how frequently a cache line is missed, and it is typically a trade-off for die area and power (larger caches require more die area, more associativity usually costs power). The instruction cache, per cycle, can afford a 32byte fetch while the data cache allows for 2x 16-byte loads and one 16-byte store per cycle. AMD stated that allowing two D-cache loads per cycle is more representative of the most workloads that end up with more loads than stores.

The L2 is a large 512 KB, 8-way cache per core. This is double the size of Intel’s 256 KB 4-way cache in Skylake or 256 KB 8-way cache in Broadwell. Typically doubling the cache size affords a 1.414 (square root of 2) better chance of a cache hit, reducing the need to go further out to find data, but comes at the expense of die area. This will have a big impact on a lot of performance metrics, and AMD is promoting faster cache-to-cache transfers than previous generations. Both the L1 and L2 caches are write-back caches, improving over the L1 write-through cache in Bulldozer.

The L3 cache is an 8MB 16-way cache, although at the time last week it was not specified over how many cores this was. From the data release today, we can confirm rumors that this 8 MB cache is split over a four-core module, affording 2 MB of L3 cache per core or 16 MB of L3 cache for the whole 8-core Zen CPU. These two 8 MB caches are separate, so act as a last-level cache per 4-core module with the appropriate hooks into the other L3 to determine if data is needed. As part of the talk today we also learned that the L3 is a pure victim cache for L1/L2 victims, rather than a cache for prefetch/demand data, which tempers the expectations a little but the large L2 will make up for this. We’ll discuss it as part of today’s announcement.

The mid-week release also gave insight into the dual schedulers, one for INT and another for FP, which is different to Intel’s joint scheduler/buffer implementation. The talk at Hot Chips goes into detail about how the dispatch and schedulers operate

The New Information

As part of the Hot Chips presentation, AMD is reaffirming its commitment to at least +40% IPC improvement over Excavator. This has specifically been listed as a throughput goal at an equivalent energy per cycle, resulting in an increase in efficiency. Obviously a number of benefits come from moving the 28nm TSMC process to GloFo’s 14nm FinFET process which is used via a Samsung licence. Both the smaller node and FinFET improvements have been well documented so we won’t go over them here, but AMD is stating that Zen is much more than this as a direct improvement to immediate performance, not just efficiency. While Zen is initially a high-performance x86 core at heart, it is designed to scale all the way from notebooks to supercomputers, or from where the Cat cores (such as Jaguar and Puma) were all the way up to the old Opterons and beyond, all with at least +40% IPC.

The first immediate image out of the presentation is the CPU Complex (a CCX), which shows the Zen core design as a four-CPU cluster with caches. This shows the L2/L3 cache breakdown, and also confirms 2MB of L3 per core with 8 MB of L3 per CCX. It also states that the L3 is mostly exclusive of the L2 cache, which stems from the L3 cache as a victim cache for L2 data. AMD is stating that the protocols involved in the L3 cache design allow each core to access the L3 of each other core with an average (but a range) of latencies.

Over the next few pages, we’ll go through the slides. They detail more information about the application of Simultaneous Multithreading (SMT), New Instructions, the size of various queues and buffers, the back-end of the design, the front-end of the design, fetch, decode, execute, load/store and retire segments.

The High-Level Zen Overview

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

106 Comments

View All Comments

Krysto - Wednesday, August 24, 2016 - link
I think PCs in general run better on four cores than on two, even if most apps themselves can't take advantage of them, although I think in the next 5 years most new games will take advantage of 8 threads. But otherwise, it's just good for multitasking.
tarqsharq - Wednesday, August 24, 2016 - link
I had an argument with one fellow on the internet regarding i7 being plenty for whatever I was doing in terms of core count. But streaming a show on one monitor while playing Overwatch was hitting 70%+ CPU usage, with all logical cores being 60-70% utilized consistently, with spikes up to 90%+.

That was on my i7-4770K to be specific, running 1080P on a 144hz monitor for Overwatch, and Crunchyroll for 1080P anime stream on the second monitor.

So some games combined with slight multitasking is already taxing the 4C/8T environment.
galta - Wednesday, August 24, 2016 - link
And how much multitasking are we really using? If I had to guess, I would say not much, on average.
You might have some folks here and there using it, but regular users need something between two and four cores, just as you said.
You have the OS, the software you're using, be it a game or not, plus everything that's running behind the scenes, including Windows ineficiencies, and that's it. But for some weird guy that spends his day on 7zip, more than 4 cores brings no extra power.
This is the reason why, no matter how excited we might get with 10 cores (I would love one, even if for bragging rights only), our i5s are enough for what we do.
Maybe in 5 years from now games will be multithreaded, but I'm not holding my breath: something similar was said 5 years ago, and here we are.
At the end of the day, we still need improvement in per core performance.
looncraz - Wednesday, August 24, 2016 - link
Browsers are becoming better and better at using more cores... and we're all running tens of processes in the background, some of which fire interrupts on a CPU. More cores allows for more going on at the same time without interruptions. You can actually feel this moving to an eight-core FX-8350 from a quad core i5... those eight cores provide a somewhat smoother multi-tasking environment, despite each core being slower and the overall performance being lower.

Humans are simply sensitive to changes in timing - more cores and more threads reduces the variability in timing, which improves perceived performance.
galta - Thursday, August 25, 2016 - link
Hum....
I don't know many people who share your opinion about FX-8350 vs i5.
Anyway, we have been multitasking for a while, a least to some extent: OS, Word, anti-virus, browser. The question is: for this light multitasking, are we better off with several cores with poor performance/core, or with less cores but with great performance/core.
Reviews and actual people generally prefer the later.
As of browsers, great news that they are improving, but download/upload speed is by far the most important factor in users experience.
Alexvrb - Sunday, August 28, 2016 - link
Download speed is fine for web browsing if you've got something faster than DSL. How much data exactly do you think you're consuming while browsing the web? Outside of streaming videos you won't use up a ton of bandwidth.
Cooe - Thursday, May 6, 2021 - link
I know this is ANCIENT, but how the hell did you not realize that multi-core optimization was so bad only because nobody could afford greater than >4 core CPU's pre-Zen??? Modern games run freaking TERRIBLE now on 4c/4t i5's.
Notmyusualid - Wednesday, August 24, 2016 - link
No, nope, nej, and nein.

I see (FEEL) tangible improvements in my computing ever since dropped 2 cores for 4.

And it looks like others below agree....
galta - Thursday, August 25, 2016 - link
I believe you do, for the sweet spot is now around 4 cores, as I said before.
The question is: do you believe that your experience will improve significantly if you mo to 6 or 8 cores?
Probably not, unless you spend your day zipping files or rendering images.
Alexvrb - Sunday, August 28, 2016 - link
They said the same thing about quad cores, and dual cores before that. AMD has to get on top of the curve, not behind it. They'll offer quad cores for more mainstream systems, and 8 for performance rigs. More for servers, and potentially less for low-power and/or low-cost.

AMD Zen Microarchiture Part 2: Extracting Instruction-Level Parallelism

What We Learned Last Week: L1/L2/L3 Caches and the Micro-Op Buffer

The New Information

Post Your Comment

106 Comments

View All Comments

Krysto - Wednesday, August 24, 2016 - link

tarqsharq - Wednesday, August 24, 2016 - link

galta - Wednesday, August 24, 2016 - link

looncraz - Wednesday, August 24, 2016 - link

galta - Thursday, August 25, 2016 - link

Alexvrb - Sunday, August 28, 2016 - link

Cooe - Thursday, May 6, 2021 - link

Notmyusualid - Wednesday, August 24, 2016 - link

galta - Thursday, August 25, 2016 - link

Alexvrb - Sunday, August 28, 2016 - link

Log in

Don't have an account? Sign up now