Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive

Name: Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive
Item: Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive
Author: Andrei Frumusanu

by Andrei Frumusanu on August 20, 2018 1:00 PM EST

45 Comments | Add A Comment

45 Comments

Physical Layout: Making Sense of the Silicon Blocks

Exynos M1 Core Layout

Exynos M3 Core Layout

Samsung delights us with this disclosure this year, as they break down the core’s floor plan in this slide. I’m pretty happy to have been almost accurate with the medium resolution die shot I had at hand in delimiting the various function blocks in the original review article.

Here’s some short explanations on the terms:

pL2: Private L2 cache, here we see the 512KB cache implemented in what seems to be two banks/slices.
FPB: Floating point data path; the FP and ASIMD execution units themselves.
FRS: Floating point schedulers as well as the FP/vector physical register file memories.
MC: Mid-core, the decoders and rename units.
DFX: This is debug/test logic and stands for “design for X” such as DFD (Design for debug), DFT (Design for test), DFM (Design for manufacturability), and other miscellaneous logic.
LS: Load/store unit along with the 64KB of L1 data cache memories.
IXU: Integer execution unit; contains the execution units, schedulers and integer physical register file memories.
TBW: Transparent buffer writes, includes the TLB structures.
FE: The front-end including branch predictors, fetch units and the 64KB L1 instruction cache memories.

Exynos 9810 Floor Plan. Image Credit TechInsights

Overall compared to the M1, almost all facets of the functional units in the M3 have vastly increased in size. The end product ends up at 2.52mm² for the core’s functional blocks, plus another 0.98mm² for the 512LB L2 cache and logic.

Exynos 9810 Floor Plan. Image Credit TechInsights

Here Samsung showcases the whole cluster floorplan, again marking the 4 cores laid down next to each other in a row with the L2 and L3 slides also orderly placed next to each other. This layout seems to have saved some layout efforts as each block is designed once and then simply replicated 4 times.

59% Higher IPC Across Variety of Workloads

Finally Samsung talks a bit about their performance profiling infrastructure and how they run various amounts of workload traces through the RTL and model simulators in order to evaluate design choices, find mistakes, and do fine-tuning to the µarch.

In this slide we finally have an official figure for the IPC increase for the core: ~59%. I had pointed out at >50% at the beginning of the year, so I'm glad to see that work out in the end. As we see in the graph, the increase is naturally not linear across all workloads and we see limited increases of only 25% in high ILP workloads, to near to not much of an increase in what is likely to be MLP workloads. Conversely, there’s also a lot of mixed workloads where the IPC increase is >80%.

Performance & Efficiency: Samsung's Data and AnandTech's Data

The next slides showcase a snippet of the performance improvements on GeekBench4 between the M2, M3, and the A75; representing commercial performance on the Exynos 8895, Exynos 9810 and the Snapdragon 845.

Again we’ve already very much extensively covered the performance aspects of the SoC and microarchitecture in past articles;

To add to today’s µarchitecture article I’m also adding some new SPEC scores which improve on the originally review data. The difference and cause for the improvement is DVFS tweaking, further scheduler enhancements, and a more synthetic testing environment and care with coping with the higher power draw at the M3’s maximum frequencies.

Click for large full resolution

I won’t go over the details of the scores, but the performance improvements under the new conditions more closely represents the kind of high jump Samsung showcases in GB4.

Power efficiency has been a big topic for the M3 – and here it is quite telling that they chose to omit results of competing solutions. As we’ve covered in our reviews, Samsung’s high boost clock at up to 2.7GHz comes at a price of very high required voltages and exponential power draw. Here, even though it showcases leading edge performance, it ends up less efficient than the Exynos 8895’s M2. The figures here represent active system power; that means CPU, memory controller, DRAM, much in the same way we measure it here at AT.

Reducing the clock to the same 2.3GHz as the M2, we see the M3 lead in terms of efficiency as per Samsung’s presentation.

To add to Samsung's data and give more context, I’m reposting the revised benchmark and efficiency overview in our own independently performed analysis of the platform. The below chart showcases the energy usages to finish the workload suite, alongside the average power consumption during the test. The left bars represent the consumed energy in Joures, and the shorter the bars are (the less energy), the more efficient a platform is. The right bars represent the performance score, the longer the bars denoting better performance.

I’ve also re-tested the workloads at the three top-frequencies of the M3; 1794, 2314 and 2704MHz, giving us a wider idea of how the efficiency scales with performance.

Overall the M3 offers a quite dynamic range in its results. At (almost) equivalent peak performance against the competing A75 results for this generation, the M3 is able to post a good efficiency advantage. This lower performance point of the M3 still outperforms the 2.3GHz maximum performance of the M2 – all while having significant power and energy efficiency advantages.

Clocking it up to 2.3GHz here the M3 more clearly outperforms the A75, albeit at an efficiency hit in the integer workloads, while the FP workloads closely match the Arm competition.

Finally the 2.7GHz results further the performance gap, but comes at a great cost in efficiency, using up more energy than any other recent SoC.

The fact that the E9810 had a cluster of 4 M3 cores running on the same frequency and voltage plane came at a cost of overall efficiency. Secondary threads that didn’t require the peak performance driven by a larger primary thread, but whose requirements are still bigger than the capacity of the littles cores, had to take a large the efficiency hit of running at the same bad efficiency points as the biggest thread in the cluster. The result of this adds to the bad battery life scores we’ve come to measure.

I’ve been able to resolve the scheduling issues in a custom kernel, improving the web browsing score further to 9h, however there are still compromises that just can’t be resolved due to how the SoC operates. Here I expect Samsung to depart from the 4 “huge core” topology for the next generation M4 and introduce something that will be a lot more power efficient in diverse multi-threaded scenarios.

Middle-Machine, FPU & Cache Hierarchy Samsung's Future Strategy & Conclusion

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

45 Comments

View All Comments

HStewart - Monday, August 20, 2018 - link
One thing I curious is why does Samsung use this CPU in Europe products and use SnapDragon in US, Japan Markets - Only thing I can think of regulations for security on the chips.
Great_Scott - Monday, August 20, 2018 - link
My understanding is that the US version uses the SnapDragon for the integral modem.

I'm not exactly sure why this is, it might be an artifact of needing legacy CDMA support for Verizon/Sprint, which isn't needed for a world-phone.
beginner99 - Tuesday, August 21, 2018 - link
That is what I thought as well. CDMA. Not needed for rest of the world.
abufrejoval - Friday, August 24, 2018 - link
Unless you assume that some people actually travel. You might even argue that the average European is more likely to enter CMDA space in his life-time than the other way around: Most US citizens only ever leave the country to fight a war, I keep hearing.

I prefer to chose myself than have choices mandated to me based on where I tend to be most of the time.
Ej24 - Friday, August 24, 2018 - link
Samsungs own Shannon modem works on CDMA. My galaxy S6 on Verizon Wireless is proof of that. It's probably legal reasons, patents, licensing whatever. It's annoying. Because the Qualcomm has put out some duds and we don't have a choice.
az060693 - Monday, August 20, 2018 - link
It's supposedly due to a 1993 patent licensing deal- https://www.androidcentral.com/qualcomm-licensing-...

Though in certain phone generations, it might have also been to supply issues.
HStewart - Monday, August 20, 2018 - link
That sounds about right - but I believe the Qualcomm version is faster than the Samsung version. This is a problem with have Modem built into chip - you have to used different CPU depending on which modem is in the SOC,
bebby - Tuesday, August 21, 2018 - link
There is another more or less official reason - to reduce risks - instead of only relying on one chip, they use 2, one in-house designed and one external, in case the internal chip has issues. Samsung anyhow manufactures both in-house.
name99 - Monday, August 20, 2018 - link
At least one reason they use Snapdragon is for markets that still require voice CDMA, which QC modems provide.
aryonoco - Monday, August 20, 2018 - link
The S6 showed that Samsung has no problems making integrated CDMA modems.

Hot Chips 2018: Samsung’s Exynos-M3 CPU Architecture Deep Dive

Physical Layout: Making Sense of the Silicon Blocks

59% Higher IPC Across Variety of Workloads

Performance & Efficiency: Samsung's Data and AnandTech's Data

Post Your Comment

45 Comments

View All Comments

HStewart - Monday, August 20, 2018 - link

Great_Scott - Monday, August 20, 2018 - link

beginner99 - Tuesday, August 21, 2018 - link

abufrejoval - Friday, August 24, 2018 - link

Ej24 - Friday, August 24, 2018 - link

az060693 - Monday, August 20, 2018 - link

HStewart - Monday, August 20, 2018 - link

bebby - Tuesday, August 21, 2018 - link

name99 - Monday, August 20, 2018 - link

aryonoco - Monday, August 20, 2018 - link

Log in

Don't have an account? Sign up now