Kirin 950 - Performance

At the heart of the Mate 8 we find HiSilicon’s new Kirin 950 SoC. We’ve been able to attend the official unveiling earlier in November and to post about the new high end silicon chipset in our coverage of the announcement. The Kirin 950 is an important chipset because of several factors: 

First of all it seems that for the majority of markets Huawei is insistent on using its own in-house silicon even when it is evident that it may not be the best choice for the device. In past Huawei devices we’ve seen this as one of the root causes for rather disappointing battery life as former chipsets displayed some rather disappointing power efficiency. We also saw a regression in performance in last year’s devices as HiSilicon opted to rely on ARM's Cortex A53 cores to drive performance, rather than stick with 2014’s higher performing A15 cores or to release an A57 chipset such as Qualcomm’s Snapdragon 810/808 or Samsung’s Exynos 7420. This of course put Huawei’s flagship devices at a disadvantage as it's a SoC's performance and efficiency that largely determines the competitive positioning of a smartphone.

The second factor (or collection of factors) as to why the Kirin 950 is an important chipset lies in the chipset’s raw specifications. This is the first shipping smartphone SoC equipped with ARM’s new Cortex A72 processor. The A72 promises a lot of efficiency gains over the A57 as the latter had the impression of a rather disappointing core as power efficiency seemed to have suffered in comparison to A15 implementations of the past generation. Only Samsung was able to objectively deliver a satisfactory performance in the Exynos 7420 – something we’ll come back to as we get to the power efficiency numbers of the Kirin 950.

HiSilicon High-End Kirin Lineup
SoC Kirin 925
(Hi3620)
Kirin 935
(Hi3630)
Kirin 950
(Hi3650)
CPU 4x Cortex A7 @ 1.3GHz

4x Cortex A15 @ 1.8 GHz
4x Cortex A53 @ 1.5 GHz

4x Cortex A53 @ 2.2 GHz
4x Cortex A53 @ 1.8 GHz

4x Cortex A72 @ 2.3 GHz
Memory
Controller
2x 32-bit LPDDR3 @ 800MHz



12.8GB/s b/w
2x 32-bit LPDDR3
or LPDDR4 @ 1333MHz
(hybrid controller)

21.3GB/s b/w
GPU Mali T628MP4
600MHz
Mali T628MP4
680MHz
Mali T880MP4
900MHz
Encode/
Decode
1080p H.264
Decode & Encode
1080p60 H.264
Decode & Encode

2160p30 HEVC 
Decode
Integrated
Modem
Balong Integrated
UE Cat. 6 LTE

To recap the specifications of the Kirin 950: This is a big.LITTLE SoC with 4x Cortex A72 cores each at 2.3GHz with a further 4 Cortex A53 cores at 1.8GHz. HiSilicon touts a brand new memory controller that is able to support both LPDDR3 and LPDDR4. In the case of the Mate 8 we find 3/4GB of LPDDR4 running at 1333MHz, slightly lower than what we’ve seen on competing SoCs such as from Samsung, Qualcomm and Nvidia. 

One characteristic that might be defining for the Kirin 950 is that it still uses a CCI-400 fabric between the two CPU clusters and the chip’s NoC (Network-on-Chip). I mentioned a few weeks ago that this was somewhat surprising revelation as I was expecting the newer CCI-500 generation interconnect to see its debut. HiSilicon explained to us that this was due to time to market constraints and that the CCI-500 wasn't fully ready to be implemented in a SoC when the Kirin 950 was being finalized. The effect of this is that we might be looking at slightly lower memory performance than what we’ll be able to see from other upcoming A72 designs.

Of course this is also the first SoC to market to integrate ARM’s new Mali T880 GPU IP. We’ll be analysing the GPU performance and efficiency later on in the article as for now we’ll be focusing on CPU performance and power.

Given that we didn’t see many consumer A57 designs from vendors and HiSilicon only used the core in its server- and enterprise-SoCs, and that the Kirin 925 is two generations behind (We’ve covered the A15 vs A57 architectural and performance improvements differences in our review of the Exynos 5433), I chose to use the Exynos 7420 as the main comparison SoC against the Kirin 950 as it gives us the closest possible apples-to-apples evaluation of the architecture of the new CPU.

We start with SPECint2000 estimated benchmark numbers. Developed by the Standard Performance Evaluation Corporation, SPECint2000 is the integer component of their larger SPEC CPU2000 benchmark. Designed around the turn of the century, officially SPEC CPU2000 has been retired for PC processors, but mobile processors are roughly a decade behind their PC counterparts in performance. Keeping that in mind it still provides an excellent benchmark for today's mobile phones and allows us to do single-threaded architectural comparisons between the competing CPU designs out there. The scores we publish are only estimates and should not taken as officially validated numbers (Which requires the test to be supervised by SPEC). Nevertheless, we try our best in choosing compiler flags and making the tests pass internal validation.

SPECint2000 - Estimated Scores
ARMv8 / AArch64
  Exynos 7420 Kirin 950 % Advantage
164.gzip
927
1085
17%
175.vpr
2592
3589
38%
176.gcc
1279
1833
43%
181.mcf
927
674
-27%
186.crafty
1562
2083
33%
197.parser
900
1208
34%
252.eon
2407
3333
38%
253.perlbmk
1232
1666
35%
254.gap
1208
1641
36%
255.vortex
1472
1844
25%
256.bzip2
1048
1219
16%
300.twolf
2097
2777
32%

As we see the Kirin 950 is able to show significant improvements across the board, averaging about 27% higher performance than the Exynos 7420. It seems 176.gcc is the test that is most improved by the new architecture as we see a large 43% improvement. We also have to keep in mind that the Kirin’s A72 is clocked in at 2304MHz which is 200MHz / 10% higher than the Exynos 7420’s 2100MHz A57.

SPEC2000 32b Estimated Ratio/MHz

It’s been a long time since we had an IPC comparison between chipsets so I went back and ran a 32-bit variant of SPEC (for a valid apples-to-apples comparison against older chipsets) across a range of SoCs past the Cortex A9 generation. At an estimated 0.77 SPECint2000 ratio score per MHz the Kirin 950 ends up at the top of the table, but it’s not clear how this compares to the Cortex A57 as we see quite differing IPC scores from the chipsets at hand. This is an interesting result as what we can learn from it is that even though two SoCs can have the same high-level specifications on paper we can see quite differing performances caused by implementation differences (be it software or hardware), even among products of single vendors.

Going back to the official reported IPC increases that ARM shared with us, we can see that the reported ~16% increase in performance per cycle in integer performance seems to be about to be quite spot-on when comparing it to the results of our tests.

To have a further data-point, we also analyse the overall increases in GeekBench 3’s test scores:

Geekbench 3 - Integer Performance
  Exynos 7420 Kirin 950 % Advantage
AES ST
607.7 MB/s
928.3 MB/s
53%
AES MT
2.57 GB/s
3.64 GB/s
42%
Twofish ST
94.1 MB/s
109.6 MB/s
16%
Twofish MT
484.5 MB/s
477.4 MB/s
-1%
SHA1 ST
693.7 MB/s
864.1 MB/s
25%
SHA1 MT
2.88 GB/s
2.51 GB/s
-13%
SHA2 ST
91.6 MB/s
101.6 MB/s
11%
SHA2 MT
450.8 MB/s
531.0 MB/
18%
BZip2Comp ST
5.56 MB/s
6.72 MB/s
21%
BZip2Comp MT
23.5 MB/s
25.7 MB/s
9%
Bzip2Decomp ST
8.49 MB/s
9.92 MB/s
17%
Bzip2Decomp MT
37.5 MB/s
41.0 MB/s
9%
JPG Comp ST
20.6 MP/s
22.0 MP/s
7%
JPG Comp MT
103.7 MP/s
95.4 MP/s
-8%
JPG Decomp ST
47.2 MP/s
52.0 MP/s
10%
JPG Decomp MT
188.5 MP/s
167.8 MP/s
-11%
PNG Comp ST
1.19 MP/s
1.45 MP/s
22%
PNG Comp MT
5.54 MP/s
6.57 MP/s
19%
PNG Decomp ST
20.3 MP/s
23.4 MP/s
15%
PNG Decomp MT
96.9 MP/s
92.8 MPs
-4%
Sobel ST
55.8 MP/s
61.6 MP/s
10%
Sobel MT
260.8 MP/s
271.0 MP/s
4%
Lua ST
1.25 MB/s
1.74 MB/s
39%
Lua MT
5.57 MB/s
7.14 MB/s
28%
Dijkstra ST
3.88 Mpairs/s
4.32 Mpairs/s
11%
Dijkstra MT
17.5 Mpairs/s
17.0 Mpairs/s
-3%

The advantages in GeekBench are slightly lower as we an average increase of 13.3% across all integer tests and 10.8% across non-cryptographic tests, increases that seem to be more in line with the frequency increase of the cores rather than the micro-architectural improvements, which might indicate that GeekBench is less sensitive to the particular increases of the A72 or that there are other architectural factors at play which make the two SoCs perform differently across workloads.

Geekbench 3 - Floating Point Performance
  Exynos 7420 Kirin 950 % Advantage
BlackScholes ST
5.51 Mnodes/s
8.39 Mnodes/s
52%
BlackScholes MT
28.3 Mnodes/s
37.4 Mnodes/s
32%
Mandelbrot ST
1.23 GFLOPS
2.08 GFLOPS
69%
Mandelbrot MT
6.09 GFLOPS
8.73 GFLOPS
43%
Sharpen Filter ST
1.19 GFLOPS
1.40 GFLOPS
18%
Sharpen Filter MT
6.00 GFLOPS
6.10 GFLOPS
2%
Blur Filter ST
1.38 GFLOPS
1.61 GFLOPS
17%
Blur Filter MT
7.45 GFLOPS
6.87 GFLOPS
-8%
SGEMM ST
2.67 GFLOPS
3.00 GFLOPS
12%
SGEMM MT
9.09 GFLOPS
10.3 GFLOPS
13%
DGEMM ST
1.26 GFLOPS
1.35 GFLOPS
7%
DGEMM MT
4.06 GFLOPS
5.1 GFLOPS
26%
SFFT ST
1.44 GFLOPS
1.53 GFLOPS
6%
SFFT MT
5.29 GFLOPS
6.63 GFLOPS
25%
DFFT ST
1.17 GFLOPS
1.23 GFLOPS
5%
DFFT MT
3.76 GFLOPS
4.37 GFLOPS
16%
N-Body ST
528.8 Kpairs/s
756.2 Kpairs/s
43%
N-Body MT
2.01 Mpairs/s
2.92 Mpairs/s
45%
Ray Trace ST
1.98MP/s
2.70 MP/s
36%
Ray Trace MT
8.30 MP/s
10.8 MP/s
30%

In the floating point tests the advantage is larger as we see an average increase of 22.8% throughout all of the tests, with some sub-tests seeing high increases of up to 45% and even 69% in Mandelbrot. The larger floating-point increases are natural as the A72's new microarchitecture introduces comparatively larger changes to its floating-point execution pipelines than it does to its integer pipelines.

Geekbench 3 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
Exynos 7420 - A57
(2100 MHz)
7.78 GB/s 7.23 GB/s 6.61 GB/s 6.52 GB/s
Kirin 950 - A72
(2300 MHz)
9.15 GB/s 9.09 GB/s 8.97 GB/s 8.21GB/s
A72 > A57
Advantage
18% 26% 36% 26%
 

Memory bandwidth is something that ARM claims that the A72 architecture to have seen the largest improvements, and indeed we see this both in GeekBench STREAM memory tests as well as our own internal tests. It seems in particular write that has increased a lot. L2 bandwidth in particular has increase a lot and reaches ARM’s projected 50% and higher depending on the instructions and access pattern used, such as for example NEON instructions where two-thread concurrent read+write reaches up to ~38GB/s on the Kirin 950 while the Exynos only reaches up to ~23GB/s and the Snapdragon 810 ~19GB/s. 

Overall the Kirin 950’s Cortex A72 seems to perform as advertised and gives a robust boost in performance thanks to both architectural improvements as well as higher clocks due to the new process node.

Kirin 950 - Power & Efficiency

Besides the performance improvements of the Cortex A72, we're also supposed to see a micro-architectural power reduction. Both factors combined are promised to bring large power efficiency improvements of up to 30%. HiSilicon was extremely upfront with us in presenting us some unique figures on exactly how much the Cortex A72 improves compared to the Cortex A57. Indeed, beyond the IPC increase we also see a flat 20% power reduction at the same frequency and manufacturing process.

Process is one factor we can’t isolate as the Kirin 950 is among one of the first SoCs to ship on TSMC’s new 16nm FinFET+ manufacturing nodes (short 16FF+). We’ll get back to the manufacturing node in a bit but first let’s have a rough look at the device’s power consumption.

Because we were sampled early for the Mate 8 I wasn’t able to take advantage of root for more controlled DVFS measurements by external equipment as bootloader unlock codes yet to be made available by Huawei. As such, I had to rely on restricted measurements done via the device’s own fuel-gauge, making the following figures rough estimates rather than the more detailed figures published in recent articles.

System Active Power - CPU Load
+ Per Core Increments (mW)
SoC 1 Core 2 Cores 3 Cores 4 Cores
Kirin 925
Cortex A15
@1.8GHz
2144 3112 4089 5022
- +969 +977 +933
Kirin 935
Cortex A53
@2.2GHz
1062 1769 2587 3311
- +707 +818 +724
Kirin 950
Cortex A72
@2.3GHz
1387 2255 3051 3734
- +868 +796 +683
Exynos 7420
Cortex A57
@2.1GHz
1619 2969 4186 5486
- +1350 +1217 +1300

Using the same power-virus at varying thread-counts across all devices and SoCs, we end up with the above table of power consumption. Immediately the Kirin 950’s power figures stand out as being much better than what we’ve come to be used to from ARM’s big cores. At a maximum system load power (total power consumption during a scenario minus average idle power) we can see that the Kirin 950 only reaches 3.7W at full frequency on all big cores. The same scenario on the Exynos 7420 for example reaches a much higher 5.4W. When looking at the per-core increases we see that it seems that the Kirin 950 uses only about 900-700mW of power. The diminishing power with thread count is something that I’ve observed in the past with some SoCs and CPU microarchitectures, so it seems to be a characteristic of the power virus I’m using that starves the cluster of resources and makes each additional thread/core become more bottle-necked.

As opposed to past HiSilicon SoCs, I was luckily able to extract the voltage tables of the Kirin 950. Through some calculations as well as reported power curves by ARM’s Intelligent Power Allocation (IPA) driver, I was able to create an estimated power curve of the Kirin 950’s big cluster.

In the past I’ve been rather harsh of the Kirin 925’s power consumption and was blaming this on the CPU cores, but it was only with the more-in-depth investigation of the Kirin 930 and 935’s power curves that I discovered the real culprit for the bad power consumption lay with non-CPU blocks such as HiSilicon’s memory controller. 

When accounting for this and looking back at the per-CPU power figures of the Kirin 925 we see that the Cortex A15 only consumed a rather reasonable ~950mW of power at 1.8GHz on 28nm which falls in line with the ~750mW average of the Exynos 5430’s 20nm A15 cores.

Luckily the Kirin 950 is said to have fixed the large power overhead from the memory controller and thus is able to shave off a good amount of power off the effective 1-core power figures. While I still have to confirm this with an actual measured power curve once I have the proper software tools for it, we’ll for now assume good faith in HiSilicon and estimate that the new overhead is as low as that from other vendors.

The new numbers put A57-based SoCs such as the Exynos 7420 in perspective, as the Kirin 950 now makes the competing SoC look comparatively unreasonable in terms of its peak power consumption. It’s odd to see this now as in hindsight 2015’s SoC generation may have been a one-off phenomenon where power consumption has risen drastically due to the switch to ARM’s first 64-bit micro-architecture.

To better visualize the improvement in power efficiency I graphed the average power curves of each SoC and normalized frequencies with the performance per clock value calculated by each SoC’s SPECint2000 score. Performance per clock remains stable across frequency across most architectures (Krait is an exception due to an asynchronous L2), so this is a valid estimate of overall power efficiency across most SoCs.

The Kirin 950 is able to stand out from 2015’s SoC both in terms of performance as well as efficiency. The most surprising data-point is that at the Exynos 7420's maximum performance point the Kirin 950’s A72 is almost twice as efficient – a eye-brow raising difference that I really did not expect. It’s here where the Kirin’s advantage in manufacturing process shows as beyond the micro-architectural improvement we also see that HiSilicon is able to take advantage of lower operating voltages.

As you notice in the graph, I included two curves of the Kirin 950’s A72 cores – one of the estimated actual power efficiency derived through the power curve of my test device as well as a “target” curve. 

The Kirin 950 uses a Cortex M3 microcontroller that uses hardware performance monitors to measure the silicon block's physical characteristics and based on that determines the operating voltages of the various SoC blocks. This seems to be a similar mechanism to the closed-loop voltage regulation mechanisms found on Qualcomm SoCs starting with the Snapdrago 810 or Nvidia’s Tegra systems. I’m not quite sure if the voltages change during run-time or if they're fixed at boot as I now currently presume. While extracting the voltage tables for the SoC during boot I noticed that the actual applied voltage on the A72 cores no longer scales down with reduced frequency. 

I reached out to HiSilicon on the matter and been given the explanation that early SoCs run a more conservative voltage scaling policy related to a still "unstable" process, and that future production units will steadily lower the operating voltages towards the target values as yields increase to more stable levels. As such it’s possible that users of future devices will be able to enjoy higher efficiency at the lower frequencies of the A72. I would just like to add that the overall power difference in such a case would be rather small as the device’s DVFS is set up to operate most of the time at frequencies above 1200MHz.

Overall the Kirin 950 and ARM's Cortex A72 are able to show some solid performance improvements, but it’s mostly in terms of efficiency where HiSilicon’s new chipset seems to excel. The gains effectively not only allow the chip-maker to catch up to vendors like Qualcomm and Samsung in performance, but also have enough of an efficiency lead over current designs such as the Exynos 7420 and Snapdragon 810 that we should be seeing the Kirin 950 able to compete against upcoming designs such as the Exynos 8890 and Qualcomm’s Snapdragon 820.

I’m also very excited to see peak power finally go back to sub-1W per core for sub-4W CPU TDPs, something we weren’t able to enjoy in 2015 in the high-performance segment. In the frequency drivers of the SoC I saw that HiSilicon had prepared a 2.5GHz bin of the SoC (Kirin 955?) so it’s possible we’ll again see a slight frequency bump much like the one we saw with the Kirin 925 or Kirin 935. If HiSilicon also manages to lower the operating voltages closer to the targeted values then it seems that future Huawei devices this year should find themselves in an excellent competitive position when it comes to performance and efficiency.

Software UI - EmotionUI 4.0 System & CPU Performance
POST A COMMENT

113 Comments

View All Comments

  • syxbit - Tuesday, January 05, 2016 - link

    Despite only using the MP4 variant of the Mali-T880, I wish Google had used this SoC in the Nexus 6P.
    The Snapdragon 810 is garbage, and the only bad thing about my phone. It gets very warm when simply browsing webpages.....
    Reply
  • jjj - Tuesday, January 05, 2016 - link

    Page 2 you do 810vs 820 FP comparison, wrong paste. Reply
  • tuxRoller - Tuesday, January 05, 2016 - link

    Unless your browsing includes js/webgl heavy games, out benchmarks something is wrong with your phone.
    The only time I've noticed my gf's n6p getting noticeably warm was during benchmarks.
    Reply
  • IanHagen - Wednesday, January 06, 2016 - link

    Is your girlfriend using her phone outdoors in Winnipeg, by chance? Reply
  • tuxRoller - Wednesday, January 06, 2016 - link

    Oymyakon. Why do you ask? Do you think that's significant? Reply
  • IanHagen - Thursday, January 07, 2016 - link

    Well tell her to be careful! The ground might be slippery with all this ice and fish laying around, she might slip and drop her brand new phone. That would be a problem. Erm... what was it we were talking about again? Reply
  • tuxRoller - Tuesday, January 12, 2016 - link

    Oh, don't worry! The fish aren't a problem any longer. It's really the roving superpacks of wolves you need to watch out for...also, her phone has a case:) Reply
  • Ethos Evoss - Wednesday, January 06, 2016 - link

    Ppl never learn that ShtDragon is epic fail..
    Now SD 820 is going to be tested on LeTV on purpose in case of ignite..
    LeTV not realising qualcom used them as experiment and others manufs. will be watching.. bcos nobody wants to implement SD anymore..
    Reply
  • syxbit - Wednesday, January 06, 2016 - link

    SD800 and SD801 were brilliant chips, and SD805 was pretty good, but came out when others were further along with 64-bit. People have hope that SD820 will bring them back on track. However, if SD820 is bad, QCOM will risk losing permanent trust with everyone. They really have to nail SD820. Even Google was rumoured to be so unhappy with QCOM after SD810 that they began looking into designing their own SoC designs.

    SD810 was so horrifically bad, that they really owe everyone an apology. SD810 singlehandedly caused most flagship 2015 devices to suck, overheat, and have bad battery life. Worse still, QCOM denied the problems, claiming that people were just just spreading FUD about overheating SD810.
    Just look at QCOM's stock. They're at almost half their value from a year ago. All attributable to SD810. They need to apologize, and admit that SD810 was awful before people will believe their claims for SD820.

    However, most people believe that QCOM's custom SoC were great, so there's hope for SD820.
    Reply
  • extide - Thursday, January 07, 2016 - link

    Well, the benches for the 820 are out ... and they look pretty good so I am pretty confident that the 820 will not be a disappointment. Reply

Log in

Don't have an account? Sign up now