Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive

Name: Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive
Item: Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive
Author: Andrei Frumusanu

by Andrei Frumusanu on August 20, 2018 1:00 PM EST

45 Comments | Add A Comment

45 Comments

As part of this year’s first conference talks at HotChips 2018 at the Flint Center for the Performing Arts in Cupertino, California, we’ve had the pleasure to finally hear Samsung’s official microarchitecture disclosure on this year’s most polarising new CPU design, the Exynos M3.

Our first exclusive report on the details of the new microarchitecture back in January. It was clear at this point from that point on that the design was a big one: Samsung had gone for a huge push in terms of performance, resulting in one of the biggest generational jumps of any silicon CPU designer in recent history.

Over the coming months much of the hype for the new Exynos 9810 with its M3 cores fizzled out, with each bit of additional testing revealing less and less enticing results. Starting from some questionable early-on benchmarks at the release of the Galaxy S9 through to our extremely in-depth Galaxy S9 device and SoC review, later on moving to DIY improvements in attempting to resolve some of the lower-hanging fruit in terms of software issues which hampered the real-world performance of the Exynos Galaxy S9. Throughout these pieces, of course, we had little in the way of official information from Samsung – until today we still didn’t know much about how the M3 microarchitecture actually worked.

Rewinding back two years, we briefly covered Samsung’s initial microarchitecture disclosure of the Exynos M1 at HotChips 2016, which was a great thing to see. While I always prefer to stay on-topic in articles, we’ve never really taken the opportunity to talk that much about Samsung’s design teams – understanding the teams that create these products also gives us a great deal of insight into the products. This has been especially evident over the last few years, with us understanding more about the workings inside of Arm’s Cambridge, Sophia Antipolis and Austin Cortex-A CPU design centres.

Samsung’s CPU IP is developed in Austin, Texas, at “Samsung’s Austin R&D Center”, or SARC. The centre was founded in 2010 with the goal of establishing in-house IP for Samsung’s S.LSI division and Exynos chipsets. Staffed with ex-AMD, ex-Intel and various other talented industry veterans, what we saw come out - alongside memory controllers and custom interconnects - was also the of course more visible IPs: Samsung’s first custom CPUs.

The Exynos M1 is said to have started its design cycle sometime in 2012 and saw a quite short 3 year development phase, starting from scratch to first tape-out. It made its first appearance in the Exynos 8890 in the 2016 Galaxy S7. Over the years SARC has been expanding, and in 2017 the Advanced Computing Lab (ACL) in San Jose was opened and added to the SARC’s joint charter – adding custom GPU IP to its design portfolio that we hope to see productised in a couple of years.

The Exynos M1 being designed from scratch, it’s natural to expect that follow-up generations would be using it as the starting point for further development. Following the tape-out of the M1, the SARC team started off the M3 design with the existing M1 RTL back in Q1 of 2015. At first, this was meant to be an incremental development. However, there was a larger change of plans later on in Q1 2016, as goals were set higher for a much larger performance push.

The existing improvements were forked in Q3 2015 into what became the M2 – which was initially meant to only be a 10LPE port of the M1 (Which was 14LPP). As a reminder, the M2 had a robust ~20% IPC improvement across workloads, which allowed it to outperform the M1 even though it was clocked 12% slower in production silicon. Samsung had achieved this by implementing some of the originally planned M3 features into the M2, while the new M3 design became more aggressive.

Here Samsung clearly states one of the industry’s most unforgiving aspects: a relentless launch cycle in which the IP and silicon must be in sync with. Many issues that we’ve seen in SoCs from various vendors, both in terms of performance and especially in terms of power consumption are mostly due to the fact that there was a rush to market in order to catch the commercial release window for new products.

Samsung's Exynos M3 - Introducing Meerkat

Starting off with a high-level overview of the Exynos M3 alongside with the original slide of the M1, we see big similarities, yet the M3 just adds more onto the table. The SARC team increased the microarchitecture width from a 4-wide decode unit to a 6-wide, the overall core defining characteristic of the new µarch. We see an added integer ALU with multiplier capability, a second load unit and a vastly expanded floating point / SIMD complex with up to triple the compute capacity.

We never really had a microarchitecture disclosure of the M2 and there’s also no specific compiler machine model tied to it, but among one of the changes revealed today is that the reorder buffer has seen a minor adjustment from 96 to 100 entries. As mentioned in our first µarch disclosure back in January, the M3 vastly expands this to up to 228 entries, making this aspect of the µarch a lot more similar to the what we see in Intel’s core designs - although we can’t make a direct density comparison across different ISAs as instruction complexity varies.

When Arm disclosed the A76 µarch details and particularly the 128-entry ROB (which in comparison seems quite small to the M3), they said that this was a balance between performance and area/power. In particular we saw a mention that a 7% increase in the ROB capacity only came with a 1% performance gain on average.

Samsung explains that ROB capacity is a choice which is tied to the design of the rest of the microarchitecture and the various buffers and backend scheduler capacities – µarch depth and µarch width are complementary to each other for performance. A wider µarch such as the M3 is able to fill the ROB faster and thus get more performance return out of the larger capacity. Overall the M3 is said to have been made with a different set of decisions than the M1/M2, aiming for the higher performance while taking into account the associated costs.

A Much Larger Front-End

Diving into more details on the front-end, we see the various improvements on the branch predictor and fetch units. The M1’s branch predictor differed from other µarch’s in that it was able to take two branches per cycle and having two branch ports on the backend. The M3 seems to maintain this width, but doubles up the µBTB from 64 to 128 entries. The mainBTB remains at 4K entries but has had latencies improved for taken branches.

On top of this the branch predictor quality overall has been increased – resulting in a 15% average reduction in missed branches. It was interesting to see Samsung actually publish an actual MPKI (Misses per kilo instructions) value here as it's something we haven’t seen acknowledged by Arm (or any vendor?) to date. Here Samsung monitors a continuously expanding suite of 4000-6000 code traces from various applications and use-cases against which it validates its performance against during development.

The branch predictor and fetch units respectively feed a decoupled address queue and a decoupled instruction queue, this is likely done so that the units can be clock gated in the implementation.

The fetch unit’s bandwidth has been doubled and now can read up to 48 bytes per cycle which corresponds to 12 32b instructions per cycle – this results in a 2:1 ratio of fetch versus decode capacity which is an increase over the 1.5:1 ratio (24B/c, 4 decode) in the M1. Samsung explains that the big increase is needed to combat the increasingly big problem of branch bubbles on wider microarchitectures. They admit that on average, the distance between taken branches is less than 12 instructions, but the larger width helps a lot for temporary bursts of instructions.

While this change has a high instantaneous power utilisation, when the instruction queue (which is now double the depth) gets filled faster than the decode unit can drain it, it has an overall net positive effect on power used as it allows the fetch unit to be clock gated. Here the overall energy efficiency is more tied to the branch predictor quality as it doesn’t actually matter when an instruction is fetched, only with a larger “batch” such as implemented here, there might be more overhead in the thrown away instructions in case of a wrongly taken branch path.

The instruction cache / L1I comes at 64KB. We’re not sure if this is an increase over the M2 as it’s hard to measure, but it definitely is double the size of the M1 µarch.

The instruction translation lookaside buffer (ITLB) has grown from 256 to 512 entries. It’s to be noted here that Samsung is employing a three-level hierarchy as opposed to what we see in Arm’s processors. The A75 and A76 respectively have a first-level 32 and 48-entry µITLB, with a clustered mainTLB of a total of 1280 entries, consisting of 1024 entries for pages up to 64KB and a secondary 256-entry table for pages >=1MB.

Samsung has a first level data and instruction TLB as well, but doesn’t disclose the size of the L1 ITLB. The last level unified TLB for Samsung is also larger at 4096 entries, we’ll come back to it on a later slide.

Middle-Machine, FPU & Cache Hierarchy

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

45 Comments

View All Comments

name99 - Wednesday, August 22, 2018 - link
The issue is not technology (everyone knows CDMA basics because you need it on the data side for WCDMA aka 3G). The issue is licensing.
Maybe QC changes the licensing terms so that doing it in-house was simply not worth the cost? This seems very much in-line with the world-wide lawsuits against QC for various anti-competitive behavior.
petar_b - Saturday, November 3, 2018 - link
I when they use Exynos they have less royalty fees to pay since it's their own processor. EU market is smaller than the US or ASIA so they compensate by providing chip with cheaper costs.

PS - CDMA, modem and all other reasons mentioned are irrelevant; other brands use Qualcom in EU and all works perfectly.
AlB80 - Monday, August 20, 2018 - link
Unused cores do not consume energy at all regardless of voltage. Thus there should be two threads on performance cores, one requires 3 GHz, but second do not, to lose efficiency.
linuxgeex - Tuesday, August 21, 2018 - link
While technically you're right that the core itself doesn't draw power when it's powered down, there's other factors to consider:
1) the power cost to power it down and back up when you need it again. If you use the core intermittently then the cost of powering it up and down can exceed the cost of operating it for the workload, and then it may be more efficient to keep those jobs on the less efficient core(s)
2) the uncore, ie the busses and support logic that don't get powered down, can use nearly as much power as the cores themselves. In some CPUs, ie Ryzen2 TRW 2990WX the uncore actually consumes 76% of the die power when there's only 2 cores out of 32 active.
name99 - Monday, August 20, 2018 - link
"When Arm disclosed the A76 µarch details and particularly the 128-entry ROB (which in comparison seems quite small to the M3), they said that this was a balance between performance and area/power. In particular we saw a mention that a 7% increase in the ROB capacity only came with a 1% performance gain on average."

ROB per se costs very little, it's just a queue.
What is expensive is the physical register file, and the fact that it more or less scales with the size of the ROB (since most instructions generate one value, which consumes one physical register).
The trick, then, is what can you do to increase the size of the ROB (which allows you to do more work during dead periods while your ROB-blocking instruction at the head of the queue is waiting on DRAM) without paying the cost of the register file?

There is a bag of tricks for this, and as you make your CPU more advanced, you use more and more of them. They include
- clustering (so you duplicate the register file, and have half the execution units use one of the register files, the other half use the other register file. This works because there are quadratic aspects to the register file, so cutting some things in half [even if they are then duplicated] reduces area/power by four.
- various "resource amplification" techniques getting more use out of what you have. These might be giving your register file one fewer read ports, but then having a smart allocator that can cope if the reads are oversubscribed. Or it might be delayed register allocation and early register release (so the register is held for a shorter time). Or it might be various forms of instruction fusion.
- you can try to set aside instructions that you believe will be dependent on the blocking instruction, so that they do not even get allocated a register until the blocking instruction completes. There has been some very interesting recent work on how you can do this without requiring long range communication, so while this has been talked about for 20 years, we might soon see actual implementations.
- a variant on the above is you can measure how "critical" instructions are (ie does the rest of the computation get delayed if this one instruction gets delayed. Based on this knowledge, you can send through critical instructions when resources are scarce, and delay non-critical until more resources are available.

The larger point here is that the reason Samsung et al are happy to tell you the info they are telling you (number of ROB slots, numbers of execution units, etc) is because this stuff is thoroughly uninteresting and uncompetitive. The competitive stuff is the sort of thing I have described above --- how does company A get 1.5x the performance from a certain level of HW versus company B --- and that's what no-one is ever willing to talk about...
Best you can do is see what look like good ideas in the academic literature of a few years ago and then assume that at least some of them have been picked up.
jospoortvliet - Wednesday, August 29, 2018 - link
Mja this stuff might be uninteresting for deep techies but for many it is nice to get some degree of comparison between the various CPU's being built.
name99 - Monday, August 20, 2018 - link
"The fetch unit’s bandwidth has been doubled and now can read up to 48 bytes per cycle which corresponds to 12 32b instructions per cycle – this results in a 2:1 ratio of fetch versus decode capacity which is an increase over the 1.5:1 ratio (24B/c, 4 decode) in the M1. Samsung explains that the big increase is needed to combat the increasingly big problem of branch bubbles on wider microarchitectures. They admit that on average, the distance between taken branches is less than 12 instructions, but the larger width helps a lot for temporary bursts of instructions."

One point you missed is that it's generally not worth the (area+power) cost to allow two lines per cycle to be read from your I-cache. So that 12-wide fetch is a maximum, which can only be reached if the starting address is one of the first four (of sixteen) instructions in the line. Every later instruction gives you a shorter fetch because you only extend to the end of the line.
Averaged across all possibilities, your average fetch width is something like nine, which is still larger than the six sustained you need, but not quite as extravagant as it seems.

It's also the case that the way these multi-level branch predictors work (at least sometimes, who knows exactly what SS are doing) is that a first fast prediction is made, then the next cycle or two, that prediction is confirmed against the larger prediction data structures. If the later prediction disagrees, you get a flush --- but ideally you can just flush what's in the I-queue before it hit decode, you don't have to flush the entire pipeline. Point is, however, you are now also using up some fraction of that 9-wide-fetch (on average) on fetches that get tossed before they even hit decode :-(
And for this to work well, you want the I-queue to ALWAYS be somewhat fullish, so you want it filling up fast, right after any sort of flush event.

So all things considered, 12-wide fetch is probably optimal, not at all extravagant.
jospoortvliet - Wednesday, August 29, 2018 - link
Keep commenting I love this ;-)
eastcoast_pete - Monday, August 20, 2018 - link
Andrei, Thanks for the in-depth coverage, especially the added information from your own deep dive from a little while ago! I wonder if Samsung even commented on the software-induced self-inflicted injury that really hogtied what, by its specs, should have been a faster alternative to the A75/845. Also, did anybody from Samsung thank you for showing them how to partially improve the M3's performance in your deep dive, at least informally/in the hallway? Somehow, the M3 is a very "Samsung" product: okay - great hardware, flawed - awful software.
All that being said: any mentioning or rumors on "Windows on Exynos"?
Andrei Frumusanu - Monday, August 20, 2018 - link
Today's disclosures are just on the µarch - the CPU design teams are not in charge of any of the software which is S.LSI's and Samsung Mobile's responsibilities.

Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive

Samsung's Exynos M3 - Introducing Meerkat

A Much Larger Front-End

Post Your Comment

45 Comments

View All Comments

name99 - Wednesday, August 22, 2018 - link

petar_b - Saturday, November 3, 2018 - link

AlB80 - Monday, August 20, 2018 - link

linuxgeex - Tuesday, August 21, 2018 - link

name99 - Monday, August 20, 2018 - link

jospoortvliet - Wednesday, August 29, 2018 - link

name99 - Monday, August 20, 2018 - link

jospoortvliet - Wednesday, August 29, 2018 - link

eastcoast_pete - Monday, August 20, 2018 - link

Andrei Frumusanu - Monday, August 20, 2018 - link

Log in

Don't have an account? Sign up now