HiSilicon Kirin 960: A Closer Look at Performance and Power

Name: HiSilicon Kirin 960: A Closer Look at Performance and Power
Item: HiSilicon Kirin 960: A Closer Look at Performance and Power
Author: Matt Humrick

by Matt Humrick on March 14, 2017 7:00 AM EST

86 Comments | Add A Comment

86 Comments

GPU Power Consumption and Thermal Stability

GPU Power Consumption

The Kirin 960 adopts ARM’s latest Mali-G71 GPU, and unlike previous Kirin SoCs that tried to balance performance and power consumption by using fewer GPU cores, the 960’s 8 cores show a clear focus on increasing peak performance. More cores also means more power and raises concerns about sustained performance.

We measure GPU power consumption using a method that’s similar to what we use for the CPU. Running the GFXBench Manhattan 3.1 and T-Rex performance tests offscreen, we calculate the system load power by subtracting the device’s idle power from its total active power while running each test, using each device’s onboard fuel gauge to collect data.

GFXBench Manhattan 3.1 Offscreen Power Efficiency (System Load Power)
	Mfc. Process	FPS	Avg. Power (W)	Perf/W Efficiency
LeEco Le Pro3 (Snapdragon 821)	14LPP	33.04	4.18	7.90 fps/W
Galaxy S7 (Snapdragon 820)	14LPP	30.98	3.98	7.78 fps/W
Xiaomi Redmi Note 3 (Snapdragon 650)	28HPm	9.93	2.17	4.58 fps/W
Meizu PRO 6 (Helio X25)	20Soc	9.42	2.19	4.30 fps/W
Meizu PRO 5 (Exynos 7420)	14LPE	14.45	3.47	4.16 fps/W
Nexus 6P (Snapdragon 810 v2.1)	20Soc	21.94	5.44	4.03 fps/W
Huawei Mate 8 (Kirin 950)	16FF+	10.37	2.75	3.77 fps/W
Huawei Mate 9 (Kirin 960)	16FFC	32.49	8.63	3.77 fps/W
Galaxy S6 (Exynos 7420)	14LPE	16.62	4.63	3.59 fps/W
Huawei P9 (Kirin 955)	16FF+	10.59	2.98	3.55 fps/W

The Mate 9’s 8.63W average is easily the highest of the group and simply unacceptable for an SoC targeted at smartphones. With the GPU consuming so much power, it’s basically impossible for the GPU and even a single A73 CPU core to run at their highest operating points at the same time without exceeding a 10W TDP, a value more suitable for a large tablet. The Mate 9 allows its GPU to hit 1037MHz too, which is a little silly. For comparison, the Exynos 7420 on Samsung’s 14LPE FinFET process, which also has an 8 core Mali GPU (albeit an older Mali-T760), only goes up to 772MHz, keeping its average power below 5W.

The Mate 9’s average power is 3.1x higher than the Mate 8’s, but because peak performance goes up by the same amount, efficiency turns out to be equal. Qualcomm’s Adreno 530 GPU in Snapdragon 820/821 is easily the most efficient with this workload, and despite achieving about the same performance of Kirin 960, it uses less than half the power.

GFXBench T-Rex Offscreen Power Efficiency (System Load Power)
	Mfc. Process	FPS	Avg. Power (W)	Perf/W Efficiency
LeEco Le Pro3 (Snapdragon 821)	14LPP	94.97	3.91	24.26 fps/W
Galaxy S7 (Snapdragon 820)	14LPP	90.59	4.18	21.67 fps/W
Galaxy S7 (Exynos 8890)	14LPP	87.00	4.70	18.51 fps/W
Xiaomi Mi5 Pro (Snapdragon 820)	14LPP	91.00	5.03	18.20 fps/W
Apple iPhone 6s Plus (A9) [OpenGL]	16FF+	79.40	4.91	16.14 fps/W
Xiaomi Redmi Note 3 (Snapdragon 650)	28HPm	34.43	2.26	15.23 fps/W
Meizu PRO 5 (Exynos 7420)	14LPE	55.67	3.83	14.54 fps/W
Xiaomi Mi Note Pro (Snapdragon 810 v2.1)	20Soc	57.60	4.40	13.11 fps/W
Nexus 6P (Snapdragon 810 v2.1)	20Soc	58.97	4.70	12.54 fps/W
Galaxy S6 (Exynos 7420)	14LPE	58.07	4.79	12.12 fps/W
Huawei Mate 8 (Kirin 950)	16FF+	41.69	3.58	11.64 fps/W
Meizu PRO 6 (Helio X25)	20Soc	32.46	2.84	11.43 fps/W
Huawei P9 (Kirin 955)	16FF+	40.42	3.68	10.98 fps/W
Huawei Mate 9 (Kirin 960)	16FFC	99.16	9.51	10.42 fps/W

Things only get worse for Kirin 960 in T-Rex, where average power increases to 9.51W and GPU efficiency drops to the lowest value of any device we’ve tested. As another comparison point, the Exynos 8890 in Samsung’s Galaxy S7, which uses a wider 12 core Mali-T880 GPU at up to 650MHz, averages 4.7W and is only 12% slower, making it 78% more efficient.

All of the flagship SoCs we’ve tested from Apple, Qualcomm, and Samsung manage to stay below a 5W ceiling in this test, and even then these SoCs are unable to sustain peak performance for very long before throttling back because of heat buildup. Ideally, we like to see phones remain below 4W in this test, and pushing above 5W just does not make any sense.

GFXBench Manhattan ES 3.1 / Metal Battery Life

The Kirin 960’s higher power consumption has a negative impact on the Mate 9’s battery life while gaming. It runs for 1 hour less than the Mate 8, a 22% reduction that would be more pronounced it the Mate 9 did not throttle back GPU frequency during the test. Ultimately, the Mate 9’s runtime is similar to other flagship phones (with smaller batteries), while providing similar or better performance. To reconcile Kirin 960’s high GPU power consumption with the Mate 9’s acceptable battery life in our gaming test, we need to look more closely at its behavior over the duration of the test.

GPU Thermal Stability

The Mate 9 only maintains peak performance for about 1 minute before reducing GPU frequency, dropping frame rate to 21fps after 8 minutes, a 38% reduction relative to the peak value. It reaches equilibrium after about 30 minutes, with frame rate hovering around 19fps, which is still better than the phones using Kirin 950/955 that peak at 11.5fps with sustained performance hovering between 9-11fps. It’s also as good as or better than phones using Qualcomm’s Snapdragon 820/821 SoCs. The Moto Z Force Droid, for example, can sustain a peak performance of almost 18fps for 12 minutes, gradually reaching a steady-state frame rate of 14.5fps, and the LeEco Pro 3 sustains 19fps after dropping from a peak value of 33fps.

In the lower chart, which shows how the Mate 9’s GPU frequency and power consumption change during the first 15 minutes of the gaming battery test, we can see that once GPU frequency drops to 533MHz, average power consumption drops below 4W, a sustainable value that still results in performance on par with other flagship SoCs after they’ve throttled back too. This suggests that Huawei/HiSilicon should have chosen a more sensible peak operating point for Kirin 960’s GPU of 650MHz to 700MHz. The only reason to push GPU frequency to 1037MHz (at least in a phone or tablet) is to make the device look better on a spec sheet and post higher peak scores in benchmarks.

Lowering GPU frequency would not improve Kirin 960’s low GPU efficiency, however. Because we do not have any other Mali-G71 examples at this time, we cannot say if this is indicative of ARM’s new GPU microarchitecture (I suspect not) or the result of HiSilicon’s implementation and process choice.

CPU Power Consumption and Thermal Stability Final Words

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

86 Comments

View All Comments

Eden-K121D - Tuesday, March 14, 2017 - link
Samsung only
Meteor2 - Wednesday, March 15, 2017 - link
I think the 820 acquitted itself well here. The 835 could be even better.
name99 - Tuesday, March 14, 2017 - link
"Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s."

Well, sure, if you're judging by Intel standards...
Apple has been able to sustainabout a 15% increase in IPC from A7 through A8, A9, and A10, while also ramping up frequency aggressively, maintaining power, and reducing throttling. But sure, not a BAD showing by ARM, the real issue is will they keep delivering this sort of improvement at least annually?

Of more technical interest:
- the largest jump is in mcf. This is a strongly memory-bound benchmark, which suggests a substantially improved prefetcher. In particular simplistic prefetchers struggle with it, suggesting a move beyond just next-line and stride prefetchers (or at least the smarts to track where these are doing more harm than good and switch them off.) People agree?

- twolf appears to have the hardest branches to predict of the set, with vpr coming up second. So it's POSSIBLE (?) that their relative shortcomings reflect changes in the branch/fetch engine that benefit
most apps but hurt specifically weird branching patterns?

One thing that ARM has not made clear is where instruction fusion occurs, and so how it impacts the two-decode limit. If, for example, fusion is handled (to some extent anyway) as a pre-decode operation when lines are pulled into L1I, and if fusion possibilities are being aggressively pursued [basically all the ideas that people have floated --- compare+branch, large immediate calculation, op+storage (?), short (+8) branch+op => predication like POWER8 (?)] there could be a SUBSTANTIAL fraction of fused instruction going through the system so that the 2-wide decode is basically as good as the 3-wide of A72?
fanofanand - Wednesday, March 15, 2017 - link
Once WinArm (or whatever they want to call it) is released, we will FINALLY be able to compare apples to apples when it comes to these designs. Right now there are mountains of speculation, but few people actually know where things are at. We will see just how performant Apple's cores are once they can be accurately compared to Ryzen/Core designs. I have the feeling a lot of Apple worshippers are going to be sorely disappointed. Time will tell.
name99 - Wednesday, March 15, 2017 - link
We can compare Apple's ARM cores to the Intel cores in Apple laptops today, with both GeekBench and Safari. The best matchup I can find is this:
https://browser.primatelabs.com/v4/cpu/compare/177...

(I'd prefer to compare against the MacBook 12" 2016 edition with Skylake, but for some reason there seem to be no GB4 results for that.)

This compares an iPhone (so ~5W max power?) against a Broadwell that turbo's up to 3.1 GHz (GB tends to run everything at the max turbo speed bcs it allows the core to cool between the [short] tests), and with TDP of 15W.

Even so, the performance is comparable. When you normalize for frequency, you get that A10 is about 20% better IPC, so drops down to maybe 15% better IPC for Skylake.
Of course that A10 runs at a lower (peak) frequency --- but also at much lower power.

There's every reason to believe that the A10X will beat absolutely the equivalent Skylake chip in this class (not just m-class but also U-class), running at a frequency of ?between 3 and 3.5GHz? while retaining that 15-20% IPC advantage over Skylake and at a power of ?<10W?
Hopefully we'll see in a few weeks --- the new iPads should be released either end March or beginning April.

Point is --- I don't see why we need to wait for WinARM server --- specially since MS has made no commitment to selling WinARM to the public, all they've committed to is using ARM for Azure.
Comparing GB4 or Safari on Apple devices gives us comparable compilers, comparable browsers, comparable OSs, comparable hardware design skill. I don't see what a Windows equivalent brings to the table that adds more value.
joms_us - Wednesday, March 15, 2017 - link
Bwahaha keep dreamin iTard, GB is your most trusted benchmark. =D

Why don't you run both machine with A10 and Celeron released in 2010. You will see how pathetic your A10 is in realworld apps.
name99 - Wednesday, March 15, 2017 - link
When I was 10 years old, I was in the car and my father and his friend were discussing some technical chemistry. I was bored with this professional talk of pH and fractionation and synthesis, so after my father described some particular reagent he'd mixed up, I chimed in with "and then you drank it?", to which my father said "Oh be quiet. Listen to the adults and you might learn something." While some might have treated this as a horrible insult, the cause of all their later failures in life, I personally took it as serious advice and tried (somewhat successfully) to abide by it, to my great benefit.
Thanks Dad!

Relevance to this thread is an exercise left to the reader.
joms_us - Wednesday, March 15, 2017 - link
Even the latest Ryzen is just barely equal or faster than Skylake clock per clock so what makes you think a worthless low-powered mobile chip will surpass them? A10 is not even better than SD821 on real-world apps comparison. Again real-world apps not Antutu, not Geekbench.
zodiacfml - Wednesday, March 15, 2017 - link
Intel's chips are smaller than Apple's. Apple also has the luxury to spend much on the SoC.
Andrei Frumusanu - Tuesday, March 14, 2017 - link
Stamp of approval.

HiSilicon Kirin 960: A Closer Look at Performance and Power

GPU Power Consumption and Thermal Stability

GPU Power Consumption

GPU Thermal Stability

Post Your Comment

86 Comments

View All Comments

Eden-K121D - Tuesday, March 14, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

name99 - Tuesday, March 14, 2017 - link

fanofanand - Wednesday, March 15, 2017 - link

name99 - Wednesday, March 15, 2017 - link

joms_us - Wednesday, March 15, 2017 - link

name99 - Wednesday, March 15, 2017 - link

joms_us - Wednesday, March 15, 2017 - link

zodiacfml - Wednesday, March 15, 2017 - link

Andrei Frumusanu - Tuesday, March 14, 2017 - link

Log in

Don't have an account? Sign up now