HiSilicon Kirin 960: A Closer Look at Performance and Power

Name: HiSilicon Kirin 960: A Closer Look at Performance and Power
Item: HiSilicon Kirin 960: A Closer Look at Performance and Power
Author: Matt Humrick

by Matt Humrick on March 14, 2017 7:00 AM EST

86 Comments | Add A Comment

86 Comments

Final Words

HiSilicon’s Kirin 950 delivered impressive performance and efficiency, raising our expectations for its successor. And on paper at least, the Kirin 960 seems better in every way. It incorporates ARM’s latest IP, including A73 CPUs, the new Mali-G71 GPU with more cores, and a CCI-550 interconnect. It offers other improvements too, such as a new modem that supports higher LTE speeds and UFS 2.1 support. But when it comes to performance and efficiency, the Kirin 960 improves in some areas and regresses in others.

The Kirin 960’s A73 CPU cores are marginally faster than the 950’s A72 cores when handling integer workloads, with a more noticeable lead over Qualcomm’s Kryo and the older A57. When looking at floating-point IPC, the opposite is true, with Qualcomm’s Kryo and Kirin 950’s A72 cores posting better results than the 960’s A73.

Some of this performance regression may be explained by Kirin 960’s memory performance. Both latency and read bandwidth improve for its larger 64KB L1 cache, but write bandwidth is lower than Kirin 950. The 960’s L2 cache bandwidth is also lower for both read and write. Its latency to main memory improves by 25%, however, and bandwidth improves by an impressive 69%.

What’s really disappointing (and puzzling) about Kirin 960, though, is that its CPU efficiency is actually worse than the 950’s. ARM did a lot of work to reduce the A73’s power consumption relative to the A72, but the Kirin 960’s A73 cores see a substantial power increase over the 950’s A72 cores. The poor efficiency numbers are likely a combination of HiSilicon’s specific implementation and the switch to the 16FFC process. This was definitely an unexpected result considering the Mate 9’s excellent battery life. Fortunately, Huawei was able to save power elsewhere, such as the display, to make up for the SoC’s power increase, but it’s difficult not to think about how much better the battery life could have been.

Power consumption for Kirin 960’s GPU is even worse, with peak power numbers that are entirely inappropriate for a smartphone. Part of the problem is poor efficiency, again likely a combination of implementation and process, which is only made worse by an overly aggressive 1037MHz peak operating point that only serves to improve the spec sheet and benchmark results.

The Kirin 960 is difficult to categorize. It’s definitely not a clear upgrade over the 950, but it does just enough things right that we cannot dismiss it outright either. For example, its generally improved integer performance and lower system memory latency give it an advantage over the 950 in many real-world workloads. We cannot completely condemn its GPU either, because its sustained performance, at least in the Mate 9’s large aluminum chassis, is on par with or better than competing flagship phones, as is its battery life when gaming. Certainly the Mate 9 proves that Kirin 960 is a viable flagship SoC as long as Huawei puts in the effort to work around its flaws. But with a new generation of 10nm SoCs just around the corner, those flaws will only become more apparent.

GPU Power Consumption and Thermal Stability

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

86 Comments

View All Comments

BedfordTim - Tuesday, March 14, 2017 - link
I suspect it comes down to cost and usage. The iPhone cores are roughly four times the size of an A73.
name99 - Tuesday, March 14, 2017 - link
True. But the iPhone cores are still small ENOUGH. The main CPU complex on an A10 (two big cores, two small cores, and L2, is maybe 15 mm^2.
ARM STILL seems to be optimizing for core area, and then spending that same core area anyway in octacores and decacores. It makes no sense to me.

Obviously part of it is that Apple must be throwing a huge number of engineers at the problem. But that's not enough; there has to be some truly incredible project management involved to keep all those different teams in sync, and I don't think anyone has a clue how they have done that.
They certainly don't seem to be suffering from any sort of "mythical man-month" Fred Brooks problems so far...

My personal suspicion is that, by luck or by hiring the best senior engineer in the world, they STARTED OFF at a place that is pretty much optimal for the trajectory they wanted.
They designed a good 3-wide core, then (as far as anyone can tell) converted that to a 6-wide core by clustering and (this is IMPORTANT) not worrying about all the naysayers who said that a very wide core could not be clocked very high.

Once they had the basic 6-wide core in place, they've had a superb platform on top of which different engineers can figure out improved sub-systems and just slot them in when ready. So we had the FP pipeline redesigned for lower latency, we had an extra NEON functional unit added, we've doubtless had constant improvements to branch prediction, I-fetching, pre-fetching, cache placement and replacement; and so on --- but these are all (more or less) "easy" to optimize given a good foundation on which to build.

I suspect, also, that unlike some in the industry, they have been extremely open to new ideas from academia, so that there's an implementation turnaround time of maybe two years or so from encountering a good idea (say a new design for a cluster predictor) through simulating it to validate its value, to implementing it.
I'm guessing that management (again unlike most companies) is willing to entertain a constant stream of ideas (from engineers, from reading the literature, from talking to academics) and to ACCEPT and NOT COMPLAIN about the cost of writing the simulations, in the full understanding that only 5 or 10% of simulated ideas are worth emulating. My guess is that they've managed to increase frequency rapidly (in spite of the 6-wide width) by implementing a constant stream of the various ideas that have been published (and generally mocked or ignored by the industry) for ways to scale things like load-store queues, issue, and rename --- the standard frequency/power pain-points in OoO design.

Meanwhile ARM seems to suffer from terminal effort-wasting. Apple has a great design, which they have been improving every year. ARM's response, meanwhile, has been to hop like a jack rabbit from A57 to A72 to A73, with no obvious conceptual progression. If each design spends time revising basics like the decoder and the optimal pipeline width, there's little time left to perform the huge number of experiments that I think Apple perform to keep honing the branch predictors, the instruction fusion, the pre-fetchers, and so on.

It reminds me of a piece of under-appreciated software, namely Mathematica, which started off with a ridiculously good foundation and horrible performance. But because the foundation was so good, every release had to waste very little time re-inventing the wheel, it could just keep adding and adding, until the result is just unbelievable.
Meteor2 - Wednesday, March 15, 2017 - link
Didn't Jim Keller have something to do with their current architecture?

And yes, Apple seems to have excellent project management. Really, they have every stage of every process nailed. They're not the biggest company in the world by accident.
Meteor2 - Wednesday, March 15, 2017 - link
Also don't forget that (like Intel) ARM has multiple design teams. A72 and A73 are from separate teams; from that perspective, ARM's design progression does make sense. The original A73 'deepdive' by Andrei explained it very well.
name99 - Wednesday, March 15, 2017 - link
This is a facet of what I said about project management.
The issue is not WHY there are separate CPU design teams --- no-one outside the companies cares about the political compromises that landed up at that point.
The issue is --- are separate design teams and restarting each design from scratch a good fit to the modern CPU world?

It seems to me that the answer has been empirically answered as no, and that every company that follows this policy (which seem to include IBM, don't know about QC or the GPU design teams) really ought to rethink. We don't recreate compilers, or browsers, or OS's every few years from scratch, but we seem to have taken it for granted that doing so for CPUs made sense.

I'm not sure this hypothesis explains everything --- no-one outside Apple (and few inside) have the knowledge necessary to answer the question. But I do wonder if the biggest part of Apple's success came from their being a SW company, and thus looking at CPU design as a question of CONSTANTLY IMPROVING a good base, rather than as a question of re-inventing the wheel every few years the way the competition has always done things.
Meteor2 - Wednesday, March 15, 2017 - link
Part of having separate teams is to engender competition; another is to hedge bets and allow risk-taking. Core replacing Netburst is the standard example, I suppose. I'm sure there are others but they aren't coming to mind at the moment... Does replacing Windows CE with Windows 10 count?
Meteor2 - Wednesday, March 15, 2017 - link
Methinks it's more to do with Safari having some serious optimisations for browser benchmarks baked in deep.

I'd like to see the A10 subjected to GB4 and SpecInt.
name99 - Wednesday, March 15, 2017 - link
The A10 GeekBench numbers are hardly secret. Believe me, they won't make you happy.
SPEC numbers, yeah, we're still waiting on those...
name99 - Wednesday, March 15, 2017 - link
Here's an example:
https://browser.primatelabs.com/v4/cpu/959859
Summary:

Single-Core Score 3515
Crypto Score 2425
Integer Score 3876
Floating Point Score 3365
Memory Score 3199

The even briefer summary is that basically every sub-benchmark has A10 at 1.5x to 2x the Kirin 960 score. FP is even more brutal with some scores at 3x, and SGEMM at ~4.5x.

(And that's the A10... The A10X will likely be out within a month, likely fabbed on TSMC 10nm, likely an additional ~50% faster...)
Meteor2 - Wednesday, March 15, 2017 - link
Thanks. Would love to see those numbers in Anandtech charts, and normalised for power.

HiSilicon Kirin 960: A Closer Look at Performance and Power

Final Words

Post Your Comment

86 Comments

View All Comments

BedfordTim - Tuesday, March 14, 2017 - link

name99 - Tuesday, March 14, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

name99 - Wednesday, March 15, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

name99 - Wednesday, March 15, 2017 - link

name99 - Wednesday, March 15, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

Log in

Don't have an account? Sign up now