The Samsung Exynos 7420 Deep Dive - Inside A Modern 14nm SoC

Name: The Samsung Exynos 7420 Deep Dive - Inside A Modern 14nm SoC
Item: The Samsung Exynos 7420 Deep Dive - Inside A Modern 14nm SoC
Author: Andrei Frumusanu

by Andrei Frumusanu on June 29, 2015 6:00 AM EST

114 Comments | Add A Comment

114 Comments

Power Management

Power management of previous big.LITTLE SoCs from Samsung was disappointing as it showed little signs of optimizations for efficiency and a general of attention to detail. The Exynos 7420 improves on this in several areas, some which are tied to the 14nm improvement and others which are tied to software improvements.

Modern ARM CPU’s power management works in a few different ways. Firstly, DVFS (Dynamic Voltage and Frequency Scaling) mechanisms try to optimize power efficiency by running the lowest possible frequency state without impacting performance. Because lower frequency states require lower operating voltage they intrinsically use less energy for a given fixed workload. The switching between these P-states (Performance states) is arbitrated by a so-called CPU frequency governor which works within the Linux kernel’s CPUFreq framework.

Google has since Android 4.1 Jellybean standardized the use of the “interactive” CPU governor as a part of Android and the vast amount of devices out there adopt this as the default governor, although vendors may have modifications done to it. The interactive governor is a relatively simple concept: Given a certain sampling time (20ms), it checks the load of the CPU. If the load exceeds the target load on the current frequency, then change to a frequency that would accommodate the current load within the target load threshold. The target load threshold is a parameter which describes how much % of CPU capacity we want the CPU to be at when scaling up to a certain P-state. If the load spikes too fast and much is superior to the target load, then there’s a secondary threshold called the high-speed load threshold which forcefully scales the CPUs to a fixed higher frequency, which in the case of the Exynos 7420 is respectively 900 and 1200MHz for the A53 and A57 cores. If the load has been stable and the newly computed target frequency is consistently aiming lower for 4 sample periods, meaning 80ms, it then scales back frequency to a lower state.

Samsung tries to optimize the Interactive governor to improve big.LITTLE scaling by introducing some new operating modes which alter the configurables of the interactive scaling logic on-the-fly. For example if only a single big CPU passes a load threshold of 95% it enters “single-load” mode which reduces the scaling thresholds for easier increases in frequency and also sets up a quality-of-service minimum frequency request to the small cores. I’m not too sure why they forcefully raise the frequency on the small cores when load is high on the big cores but Samsung must have profiled frequency scaling and decided that this is a beneficial change. Another mode triggered on top of the single-load mode is when the cumulative load across all 4 CPUs exceeds a certain threshold. This multi-load mode again changes scaling parameters by making them more lax and easier to scale up.

These changes had already been implemented in the Exynos 5433 as well but were never effectively used as the parameters remained at their default values and thus representing no improvement in the scaling mechanism. The Galaxy Alpha's 5430 did have the settings correctly set up, but then again Meizu's MX4Pro didn't, meaning we're either seeing an unlikely deliberate design decision, or what I find more likely and reasonable explanation, an oversight on the part of the software teams.

It looks like these modifications are mostly aimed at improving performance and reaction time of the DVFS scaling, and it looks the due to these changes the Exynos 7420 behaves much better in that regard. Samsung’s handling of frequency scaling is generally very good as the governor does well in its task. There are also a large number of QoS (Quality-of-Service) mechanisms by a variety of drivers which are able to instantly request the CPU to transition to a minimum frequency. One example is the screen touch booster: this is an independent scaling mechanism that is able to control the CPU frequency of both clusters as well as to tell the scheduler to force migrations onto the big cores for better reaction time and UI fluidity as soon as the display driver receives an interrupt request from the touch controller. Another scenario would be IP blocks in the media pipeline – blocks such as the 2D composer or the hardware video accelerator are predictable in terms of the required memory bandwidth and CPU capacity, so their drivers will dynamically put performance floors on the device’s DVFS mechanism to guarantee throughput. Samsung goes as far as to also use a QoS system for I/O bandwidth for the NAND, modem and WiFi as well as IPC (Inter-process calls) communications.

Of course beyond DVFS scaling as a power management mechanism all modern devices also offer clock- and power-gating. For the CPU this is again something which is controlled by the kernel within a mechanism called the CPUIdle framework. In the past before hardware had such power-saving mechanisms idling a system usually meant that it was running infinite loops of NOPs (no operation) until it got interrupted to do some actual work. Today instead of running inefficient idle loops, the scheduler calls the CPUIdle governor telling it to do “nothing”. The CPUIdle governor accumulates statistics on how long each idle period is and based on this data is able to choose from a variety of deeper or shallower hardware idle states. On ARM CPUs since the A15/A7 this is mostly consolidated into 3 so-called C-states: a clock-gating state called WFI (Wait-for-interrupt), an individual core power-gating state and a cluster power-gating state.

WFI is an instruction-level and architectural power-management state with extremely low latency that stops the clock to a given CPU. By stopping the clock one avoids dynamic leakage by the CPU, so this is a crucial part of doing “nothing” in mobile CPUs. Individual core power-gating states are able to turn off power to the CPU this way. This is a deeper state as the CPU needs to save its state upon entry and restore it upon waking up. On the 7420 we’re talking about exit latencies of 100µs. Because of the overhead of restoring the CPU state, it’s also not worth to enter these modes for reduced periods of time (called residency time). For the A57 cores this residency threshold is 2000µs and for the A53 cores 750µs. When all cores within a cluster are idle the whole cluster is allowed to be powered down. This of course has larger overhead with larger exit latencies (300µs) and greater minimal residency times (5ms). The cluster power-down is largely used on the big cluster as the small cluster is only allowed to power itself down when the screen is off. A very low-hanging fruit which has finally been picked by Samsung is to have optimized configuration values for each cluster. Previous Samsung SoCs would oddly just use a single driver with the same settings for both clusters, which didn’t make much sense and likely impacted CPU idle efficiency.

The 14nm process seems to have introduced a change in dynamic between the two CPU clusters as the efficiency of each cluster has scaled differently. This has significant impact in the way the GTS scheduler settings are set up as the new chipset’s power efficiency curves are tighter to each-other when compared to the Exynos 5433. To demonstrate this, I took the SPECint2000 scores of each cluster to determine what the IPC difference between the two architectures is and then used this as a ratio to normalize the A57 perf/W curve to the A53’s clocks. On the first set of charts the vertical axis is just an arbitrary normalized value of MHz/mW for the A53 cores, and the A57 curve uses a multiplier ratio of 2.09 to scale the efficiency value and thus represent the IPC increase of the larger architecture.

I’ll get back to actual perf/W charts in just a bit, but first I want to explain why the perf/MHz/W curves are an important metric we can deduce a lot from. Currently the Linux kernel and GTS mechanism sees load on a frequency invariant scale; what this means that if a process takes up 50% of the CPU while it’s running at 500MHz and its maximal scaling frequency is 1GHz, the scheduler will account the task as a 25% load on that CPU. This mechanism is meant to normalize current load to the maximum possible capacity of a CPU, and not just the current one.

The trigger points that determine thread migrations in GTS are called the up- and down-thresholds, which are thresholds on the load scales of the CPUs. For the Exynos 5433 Samsung used 50% and 25% as the up- and down-thresholds. When a thread would exceed 50% of the A53’s capacity it would be migrated over to the big cores, and once on the big core if the task would fall below 25%’s of the CPU’s capacity it would then migrate down. On the 7420 these values are set up slightly differently as Samsung configured the default values at 46.7% and 20.8%. At first I was confused to see such specific values and didn’t fully understand why they were set up as such until I calculated the actual performance/W curves of both CPU cluster.

One will have noticed the arrows I put on the graphs – these represent the theoretical point where a thread should migrate up to the big core, or down to the little cores. For the very attentive readers they will notice that the up threshold arrows aren’t at the mentioned 50 and 46% frequency points of the little cores. This is because the CPU frequency governor should actually be able to scale up frequency faster than the task triggering a scheduler migration by hitting the normalized up-threshold. For example 50% up-threshold of the 5433 would mean a 100% load at 800MHz of the A53 cores, but that will realistically never happen as the CPU will have scaled up to a higher frequency by then. The 5433 governor will try to maintain 10% of idle capacity when scaling to a frequency while the 7420 seeks 25%, meaning the latter has more lax settings which make it scale higher in frequency even though the load doesn’t require it. The result is that the avarage effective performance/capacity point where the little CPUs will try to migrate to the big cores is slightly below 900MHz for the 5433 and just above 1100MHz for the 7420.

For the down-threshold of the big cores the logic is a tad simpler because the scaling-down mechanism of the frequency governor is slower than the scheduler’s migration mechanism. This means that the arrow depicted in the graphs is a minimal value of when a thread will migrate down, and a down-migration might happen anytime at the higher frequencies.

When plotting the efficiency points on an axis depicting the absolute performance of the cores we get a much clearer picture of what big.LITTLE is supposed to achieve. And this is where we see a large difference between the 5433 and 7420: The way the Note 4 is currently set up makes it migrate up threads sooner than compared to the Galaxy S6 and the efficiency degradation when doing so is much greater. An optimal implementation would be a device where the up- and down-migration points would be as close as possible to each other in the efficiency axis while having a slight jump in the performance axis acting as a hysteresis to avoid migrations when a load falls in between the two performance curves.

It seems to me that Samsung paid much better attention to efficiency optimizations on the Exynos 7420’s software as it fixes many of the weird configuration issues of the Note 4 Exynos. The Exynos 7420 joins the Exynos 5430 (And MediaTek’s MT6595 which I’ll hopefully address sometime soon) as one of the rare SoCs which are able to reign in ARM’s big CPU core designs in a small form-factor mobile device and effectively use big.LITTLE without major downsides. While Samsung’s software stack could definitely improve with features such as full energy awareness inside the scheduler, it's no longer as misconfigured and as bad as I decribed it in the Exynos 5433 review.

In terms of maximum power consumption, I think 1.9GHz would have been a slightly more reasonable cap for the A57 cores as the device can on some occasions such as updating many apps or visiting a very heavy ad-ridden site can load up the big CPUs to their full capacity and make the device run a bit hot, but it’s a rare occasion and the vast majority of processing time is spent on the lower frequencies. It will be interesting to see what ARM's A72 processor core will be able to achieve in terms of performance and power efficiency. For 2015 though it seems Samsung's A57 SoC still remains king due to its process node advantage.

CPU Power Consumption GPU & LPDDR4 Performance & Power

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

114 Comments

View All Comments

lilmoe - Monday, June 29, 2015 - link
I've waited long for this piece. Thanks for the hard work.

Now for the rant. All I've read in this article leads me to think that the tech blog community is partially to blame for the stupid benchmark and resolution race that has negative effects on consumer perception. Some bloggers (NOT consumers) now swear by 1440p when the difference is so minimal, you literally have to be 1 inch far from the screen to even slightly tell, and nowhere near the transition from 720p to 1080p on ~5". A more balanced SoC, with better thermal and voltage limits, and a less stressful 1080p screen would have made the GS6 a much better device in both performance and battery life...

I see the GS6 sell like hotcakes every time I'm at the mall or electronics shops, but no one, NO ONE, knows (or cares) about 14nm or 1440p. All people care about is design and build. THEN when it starts to sink in, they start caring about fast, consistent performance, battery life, a good speaker, and a nice (BRIGHT) screen (which 1080p is already more than capable of delivering, and till now, Apple is STILL getting away with 720p).

The only one who "gets it" (except for the battery life) is Apple. This is mainly for the sole reason that Apple can afford relying more on its brand name (and a couple of stupid buzzwords) rather than numbers and benchmarks. But GOD do you I HATE iOS, their ecosystem, and limiting ways of doing things.

Yup. It's best to skip all this and wait for what they'll bring with the GS7. Samsung tried hard to merge what I believe is an engineering marvel that is the GS5 with the great design and aesthetics of the iPhone 6 and, IMHO as a Samsung fan, didn't totally deliver with the GS6.
gnx - Monday, June 29, 2015 - link
+1 And hopefully they'll revamp Touchwiz too.
ruturaj1989@gmail.com - Monday, June 29, 2015 - link
I guess VR was one of the reason for 1440p.
larryvand - Monday, June 29, 2015 - link
VR on a 720p screen is a total FAIL. My S6 Edge does VR like nothing else. Best phone on the planet.
jjj - Monday, June 29, 2015 - link
Apple's screen res is about costs, they couldn't go from 4 inch devices to 4.7 and 5.5 with higher res without harming the margins. Plus they always save some upgrades for the next cycle sto give people a reason to buy.
phoenix_rizzen - Monday, June 29, 2015 - link
Sony "gets it" as well. The Z3 is only 1080p, and the Z3 Compact is 720p. Both use the same SoC, running at the same speeds. So you can get a large screen with good performance and battery life, or you can get a smaller screen with better performance and battery life.

It's just too bad they don't market their phones as much in North America as they used to.
lilmoe - Monday, June 29, 2015 - link
Make Sony put a good OLED screen on their phones and I'm in. The Z3 is an amazing phone, yes, but damn that screen knows NO blacks.
YoloPascual - Monday, June 29, 2015 - link
TLDR; Mediatek the real MVP
Refuge - Monday, June 29, 2015 - link
lol fuck it +1
gnx - Monday, June 29, 2015 - link
Fabulous review! Way above my pay-grade, but nice to read and understand what makes the differences that end-users like myself experience.

One question, Andrei. I know this may not be up your alley, but any reason why other OEMs aren't buying the Exynos 7420, with Snapgdragon 810 now confirmed as a miss? Is it that Samsung is hands full with just production for it's sibling's SGS6s? (Or perhaps also producing for Apple?) Or is there also a steep price difference? Or is it that there are inherent reasons for OEMs like Xiaomi or Meizu or even HTC to not use Samsung parts?

The Samsung Exynos 7420 Deep Dive - Inside A Modern 14nm SoC

Power Management

Post Your Comment

114 Comments

View All Comments

lilmoe - Monday, June 29, 2015 - link

gnx - Monday, June 29, 2015 - link

ruturaj1989@gmail.com - Monday, June 29, 2015 - link

larryvand - Monday, June 29, 2015 - link

jjj - Monday, June 29, 2015 - link

phoenix_rizzen - Monday, June 29, 2015 - link

lilmoe - Monday, June 29, 2015 - link

YoloPascual - Monday, June 29, 2015 - link

Refuge - Monday, June 29, 2015 - link

gnx - Monday, June 29, 2015 - link

Log in

Don't have an account? Sign up now