HiSilicon Kirin 970 - Android SoC Power & Performance Overview

Name: HiSilicon Kirin 970 - Android SoC Power & Performance Overview
Item: HiSilicon Kirin 970 - Android SoC Power & Performance Overview
Author: Andrei Frumusanu

by Andrei Frumusanu on January 22, 2018 9:15 AM EST

116 Comments | Add A Comment

116 Comments

CPU Performance: SPEC2006

SPEC2006 has been a natural goal to aim for as a keystone analysis benchmark as it’s a respected industry standard benchmark that even silicon vendors use for architecture analysis and development. As we saw SPEC2017 released last year SPEC2006 is getting officially retired on January 9^th, a funny coincidence as we now finally start using it.

As Android SoCs improve in power efficiency and performance it’s now becoming more practical to use SPEC2006 on consumer smartphones. The main concerns of the past were memory usage for subtests such as MCF, but more importantly sheer test runtimes for battery powered devices. For a couple of weeks I’ve been busy in porting over SPEC2006 to a custom Android application harness.

The results are quite remarkable as we see both the generational performance as well as efficiency improvements from the various Android SoC vendors. The Kirin 970 in particular closes in on the efficiency of the Snapdragon 835, leapfrogging the Kirin 960 and Exynos SoCs. We also see a non-improvement in absolute performance as the Kirin 970 showcases a slight performance degradation over the Kirin 960 – with all SoC vendors showing just meagre performance gains over the past generation.

Going Into The Details

Our new SPEC2006 harness is compiled using the official Android NDK. For this article the NDK version used in r16rc1 and Clang/LLVM were used as the compilers with just the –Ofast optimization flags (alongside applicable test portability flags). Clang was chosen over of GCC because Google has deprecated GCC in the NDK toolchain and will be removing the compiler altogether in 2018, making it unlikely that we’ll revisit GCC results in the future. It should be noted that in my testing GCC 4.9 still produced faster code in some SPEC subtests when compared to Clang. Nevertheless the choice of Clang should in the future also facilitate better Androids-to-Apples comparisons in the future. While there are arguments that SPEC scores should be published with the best compiler flags for each architecture I wanted a more apples-to-apples approach using identical binaries (Which is also what we expect to see distributed among real applications). As such for this article the I’ve chosen to pass to the compiler the –mcpu=cortex-a53 flag as it gave the best average overall score among all tested CPUs. The only exception was the Exynos M2 which profited from an additional 14% performance boost in perlbench when compiled with its corresponding CPU architecture target flag.

As the following SPEC scores are not submitted to the SPEC website we have to disclose that they represent only estimated values and thus are not officially validated submissions.

Alongside the full suite for CINT2006 we are also running the C/C++ subtests of CFP2006. Unfortunately 10 out of the 17 tests in the CFP2006 suite are written in Fortran and can only be compiled with hardship with GCC on Android and the NDK Clang lacks a Fortran front-end.

As an overview of the various run subtests, here are the various application areas and descriptions as listed on the official SPEC website:

SPEC2006 C/C++ Benchmarks
Suite	Benchmark	Application Area	Description
SPECint2006 (Complete Suite)	400.perlbench	Programming Language	Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
	401.bzip2	Compression	Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
	403.gcc	C Compiler	Based on gcc Version 3.2, generates code for Opteron.
	429.mcf	Combinatorial Optimization	Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
	445.gobmk	Artificial Intelligence: Go	Plays the game of Go, a simply described but deeply complex game.
	456.hmmer	Search Gene Sequence	Protein sequence analysis using profile hidden Markov models (profile HMMs)
	458.sjeng	Artificial Intelligence: chess	A highly-ranked chess program that also plays several chess variants.
	462.libquantum	Physics / Quantum Computing	Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.
	464.h264ref	Video Compression	A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2
	471.omnetpp	Discrete Event Simulation	Uses the OMNet++ discrete event simulator to model a large Ethernet campus network.
	473.astar	Path-finding Algorithms	Pathfinding library for 2D maps, including the well known A* algorithm.
	483.xalancbmk	XML Processing	A modified version of Xalan-C++, which transforms XML documents to other document types.
SPECfp2006 (C/C++ Subtests)	433.milc	Physics / Quantum Chromodynamics	A gauge field generating program for lattice gauge theory programs with dynamical quarks.
	444.namd	Biology / Molecular Dynamics	Simulates large biomolecular systems. The test case has 92,224 atoms of apolipoprotein A-I.
	447.dealII	Finite Element Analysis	deal.II is a C++ program library targeted at adaptive finite elements and error estimation. The testcase solves a Helmholtz-type equation with non-constant coefficients.
	450.soplex	Linear Programming, Optimization	Solves a linear program using a simplex algorithm and sparse linear algebra. Test cases include railroad planning and military airlift models.
	453.povray	Image Ray-tracing	Image rendering. The testcase is a 1280x1024 anti-aliased image of a landscape with some abstract objects with textures using a Perlin noise function.
	470.lbm	Fluid Dynamics	Implements the "Lattice-Boltzmann Method" to simulate incompressible fluids in 3D
	482.sphinx3	Speech recognition	A widely-known speech recognition system from Carnegie Mellon University

It’s important to note one extremely distinguishing aspect of SPEC CPU versus other CPU benchmarks such as GeekBench: it’s not just a CPU benchmark, but rather a system benchmark. While benchmarks such as GeekBench serve as a good quick view of basic workloads, the vastly greater workload and codebase size of SPEC CPU stresses the memory subsystem to a much greater degree. To demonstrate this we can see the individual subtest performance differences when solely limiting the memory controller frequency, in this case on the Mate 10 Pro with the Kirin 970.

An increase in main memory latency from just 80ns to 115ns (Random access within access window) can have dramatic effects on many of the more memory access sensitive tests in SPEC CPU. Meanwhile the same handicap essentially has no effect on the GeekBench 4 single-threaded scores and only marginal effect on some subtests of the multi-threaded scores.

In general the benchmarks can be grouped in three categories: memory-bound, balanced memory and execution-bound, and finally execution bound benchmarks. From the memory latency sensitivity chart it’s relatively easy to find out which benchmarks belong to which category based on the performance degradation. The worst memory bound benchmarks include the infamous 429.mcf but alongside we also see 433.milc, 450.soplex, 470.lbm and 482.sphinx3. The least affected such as 400.perlbench, 445.gobmk, 456.hmmer, 464.h264ref, 444.namd, 453.povray and with even 458.sjeng and 462.libquantum slightly increasing in performance pointing out to very saturated execution units. The remaining benchmarks are more balanced and see a reduced impact on the performance. Of course this is an oversimplification and the results will differ between architectures and platforms, but it gives us a solid hint in terms of separation between execution and memory-access bound tests.

As well as tracking performance (SPECspeed) I also included a power tracking mechanisms which relies on the device’s fuel-gauge for current measurements. The values published here represent only the active power of the platform, meaning it subtracts idle power from total absolute load power during the workloads to compensate for platform components such as the displays. Again I have to emphasize that the power and energy figures don't just represent the CPU, but the SoC system as a whole, including interconnects, memory controllers, DRAM, and PMIC overhead.

Alongside the current generation SoCs I also included a few predecessors to be able to track the progress that has happened over the last 2 years in the Android space and over CPU microarchitecture generations. Because the runtime of all benchmarks is in excess of 5 hours for the fastest devices we are actively cooling the phones with an external fan to ensure consistent DVFS frequencies across all of the subtests and that we don’t favour the early tests.

The Kirin 970 - Overview SPEC2006 - The Results

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

116 Comments

View All Comments

GreenReaper - Thursday, January 25, 2018 - link
If it can do 2160p60 Decode then I'd imagine that of course it can do 2160p30 Decode, just as it can do 1080p60/30 decode. You list the maximum in a category.
yhselp - Tuesday, January 23, 2018 - link
What a wonderful article: a joy to read, thoughtful, and very, very insightful. Thank you, Andrei. Here's to more coverage like that in the future.

It looks like the K970 could be used in smaller form factors. If Huawei were to make a premium, bezel-less ~ 4.8" 18:9 model power by K970, it would be wonderful - a premium, Android phone about the size of the iPhone SE.

Even though Samsung and Qualcomm (S820) have custom CPUs, it feels like their designs are much closer to stock ARM than Apple's CPUs. Why are they not making wider designs? Is it a matter of inability or unwillingness?
Raqia - Tuesday, January 23, 2018 - link
Props for a nice article with data rich diagrams filled with interesting metrics as well as the efforts to normalize tests now and into the future w/ LLVM + SPECINT 06. (Adding the units after the numbers in the chart and "avg. watts" to the rightward pointing legend at the bottom would have helped me grok faster...) Phones are far from general purpose compute devices and their CPUs are mainly involved in directing program flow rather than actual computation, so leaning more heavily into memory operations with the larger data sets of SPECINT is a more appropriate CPU test than Geekbench. The main IPC uplift from OoOE comes from the prefetching and execution in parallel of the highest latency load and store operations and a good memory/cache subsystem does wonders for IPC in actual workloads. Qualcomm's Hexagon DSP has

It would be interesting to see the 810 here, but its CPU figures would presumably blow off the chart. A modem or wifi test would also be interesting (care for a donation toward the aforementioned harness?), but likely a blowout in the good direction for the Qualcomm chipsets.
Andrei Frumusanu - Friday, January 26, 2018 - link
Apologies for the chart labels, I did them in Excel and it doesn't allow for editing the secondary label position (watts after J/spec).

The Snapdragon 810 devices wouldn't have been able to sustain their peak performance states for SPEC so I didn't even try to run it.

Unless your donation is >$60k, modem testing is far beyond the reach of AT because of the sheer cost of the equipment needed to do this properly.
jbradfor - Wednesday, January 24, 2018 - link
Andrei, two questions on the Master Lu tests. First, is there a chance you could run it on the 835 GPU as well and compare? Second, do these power number include DRAM power, or are they SoC only? If they do not include DRAM power, any chance you could measure that as well?
Andrei Frumusanu - Friday, January 26, 2018 - link
The Master Lu uses the SNPE framework and currently doesn't have the option to chose computing target on the SoC. The GPU isn't any or much faster than the DSP and is less efficient.

The power figures are the active power of the whole platform (total power minus idle power) so they include everything.
jbradfor - Monday, January 29, 2018 - link
Thanks. Do you have the capability of measuring just the SoC power separately from the DRAM power?
ReturnFire - Wednesday, January 24, 2018 - link
Great article Andrei. So glad there is new mobile stuff on AT. Fingers crossed for more 2018 flagship / soc articles!
KarlKastor - Thursday, January 25, 2018 - link
"AnandTech is also partly guilty here; you have to just look at the top of the page: I really shouldn’t have published those performance benchmarks as they’re outright misleading and rewarding the misplaced design decisions made by the silicon vendors. I’m still not sure what to do here and to whom the onus falls onto."
That is pretty easy. Post sustained performace values and not only peak power. Just run the benchmarks ten times in a row, it's not that difficult.
If in every review sustained performance is shown, the people will realize this theater.

And it is a big problem. Burst GPU performance is useless. No one plays a game for half a minute.
Burst CPU performance ist perhaps a different matter. It helps to optimize the overall snappiness.
Andrei Frumusanu - Friday, January 26, 2018 - link
I'm planning to switch to this in future reviews.

HiSilicon Kirin 970 - Android SoC Power & Performance Overview

CPU Performance: SPEC2006

Going Into The Details

Post Your Comment

116 Comments

View All Comments

GreenReaper - Thursday, January 25, 2018 - link

yhselp - Tuesday, January 23, 2018 - link

Raqia - Tuesday, January 23, 2018 - link

Andrei Frumusanu - Friday, January 26, 2018 - link

jbradfor - Wednesday, January 24, 2018 - link

Andrei Frumusanu - Friday, January 26, 2018 - link

jbradfor - Monday, January 29, 2018 - link

ReturnFire - Wednesday, January 24, 2018 - link

KarlKastor - Thursday, January 25, 2018 - link

Andrei Frumusanu - Friday, January 26, 2018 - link

Log in

Don't have an account? Sign up now