Final Words

HiSilicon’s Kirin 950 delivered impressive performance and efficiency, raising our expectations for its successor. And on paper at least, the Kirin 960 seems better in every way. It incorporates ARM’s latest IP, including A73 CPUs, the new Mali-G71 GPU with more cores, and a CCI-550 interconnect. It offers other improvements too, such as a new modem that supports higher LTE speeds and UFS 2.1 support. But when it comes to performance and efficiency, the Kirin 960 improves in some areas and regresses in others.

The Kirin 960’s A73 CPU cores are marginally faster than the 950’s A72 cores when handling integer workloads, with a more noticeable lead over Qualcomm’s Kryo and the older A57. When looking at floating-point IPC, the opposite is true, with Qualcomm’s Kryo and Kirin 950’s A72 cores posting better results than the 960’s A73.

Some of this performance regression may be explained by Kirin 960’s memory performance. Both latency and read bandwidth improve for its larger 64KB L1 cache, but write bandwidth is lower than Kirin 950. The 960’s L2 cache bandwidth is also lower for both read and write. Its latency to main memory improves by 25%, however, and bandwidth improves by an impressive 69%.

What’s really disappointing (and puzzling) about Kirin 960, though, is that its CPU efficiency is actually worse than the 950’s. ARM did a lot of work to reduce the A73’s power consumption relative to the A72, but the Kirin 960’s A73 cores see a substantial power increase over the 950’s A72 cores. The poor efficiency numbers are likely a combination of HiSilicon’s specific implementation and the switch to the 16FFC process. This was definitely an unexpected result considering the Mate 9’s excellent battery life. Fortunately, Huawei was able to save power elsewhere, such as the display, to make up for the SoC’s power increase, but it’s difficult not to think about how much better the battery life could have been.

Power consumption for Kirin 960’s GPU is even worse, with peak power numbers that are entirely inappropriate for a smartphone. Part of the problem is poor efficiency, again likely a combination of implementation and process, which is only made worse by an overly aggressive 1037MHz peak operating point that only serves to improve the spec sheet and benchmark results.

The Kirin 960 is difficult to categorize. It’s definitely not a clear upgrade over the 950, but it does just enough things right that we cannot dismiss it outright either. For example, its generally improved integer performance and lower system memory latency give it an advantage over the 950 in many real-world workloads. We cannot completely condemn its GPU either, because its sustained performance, at least in the Mate 9’s large aluminum chassis, is on par with or better than competing flagship phones, as is its battery life when gaming. Certainly the Mate 9 proves that Kirin 960 is a viable flagship SoC as long as Huawei puts in the effort to work around its flaws. But with a new generation of 10nm SoCs just around the corner, those flaws will only become more apparent.

GPU Power Consumption and Thermal Stability
Comments Locked

86 Comments

View All Comments

  • Eden-K121D - Tuesday, March 14, 2017 - link

    Samsung only
  • Meteor2 - Wednesday, March 15, 2017 - link

    I think the 820 acquitted itself well here. The 835 could be even better.
  • name99 - Tuesday, March 14, 2017 - link

    "Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s."

    Well, sure, if you're judging by Intel standards...
    Apple has been able to sustainabout a 15% increase in IPC from A7 through A8, A9, and A10, while also ramping up frequency aggressively, maintaining power, and reducing throttling. But sure, not a BAD showing by ARM, the real issue is will they keep delivering this sort of improvement at least annually?

    Of more technical interest:
    - the largest jump is in mcf. This is a strongly memory-bound benchmark, which suggests a substantially improved prefetcher. In particular simplistic prefetchers struggle with it, suggesting a move beyond just next-line and stride prefetchers (or at least the smarts to track where these are doing more harm than good and switch them off.) People agree?

    - twolf appears to have the hardest branches to predict of the set, with vpr coming up second. So it's POSSIBLE (?) that their relative shortcomings reflect changes in the branch/fetch engine that benefit
    most apps but hurt specifically weird branching patterns?

    One thing that ARM has not made clear is where instruction fusion occurs, and so how it impacts the two-decode limit. If, for example, fusion is handled (to some extent anyway) as a pre-decode operation when lines are pulled into L1I, and if fusion possibilities are being aggressively pursued [basically all the ideas that people have floated --- compare+branch, large immediate calculation, op+storage (?), short (+8) branch+op => predication like POWER8 (?)] there could be a SUBSTANTIAL fraction of fused instruction going through the system so that the 2-wide decode is basically as good as the 3-wide of A72?
  • fanofanand - Wednesday, March 15, 2017 - link

    Once WinArm (or whatever they want to call it) is released, we will FINALLY be able to compare apples to apples when it comes to these designs. Right now there are mountains of speculation, but few people actually know where things are at. We will see just how performant Apple's cores are once they can be accurately compared to Ryzen/Core designs. I have the feeling a lot of Apple worshippers are going to be sorely disappointed. Time will tell.
  • name99 - Wednesday, March 15, 2017 - link

    We can compare Apple's ARM cores to the Intel cores in Apple laptops today, with both GeekBench and Safari. The best matchup I can find is this:
    https://browser.primatelabs.com/v4/cpu/compare/177...

    (I'd prefer to compare against the MacBook 12" 2016 edition with Skylake, but for some reason there seem to be no GB4 results for that.)

    This compares an iPhone (so ~5W max power?) against a Broadwell that turbo's up to 3.1 GHz (GB tends to run everything at the max turbo speed bcs it allows the core to cool between the [short] tests), and with TDP of 15W.

    Even so, the performance is comparable. When you normalize for frequency, you get that A10 is about 20% better IPC, so drops down to maybe 15% better IPC for Skylake.
    Of course that A10 runs at a lower (peak) frequency --- but also at much lower power.

    There's every reason to believe that the A10X will beat absolutely the equivalent Skylake chip in this class (not just m-class but also U-class), running at a frequency of ?between 3 and 3.5GHz? while retaining that 15-20% IPC advantage over Skylake and at a power of ?<10W?
    Hopefully we'll see in a few weeks --- the new iPads should be released either end March or beginning April.

    Point is --- I don't see why we need to wait for WinARM server --- specially since MS has made no commitment to selling WinARM to the public, all they've committed to is using ARM for Azure.
    Comparing GB4 or Safari on Apple devices gives us comparable compilers, comparable browsers, comparable OSs, comparable hardware design skill. I don't see what a Windows equivalent brings to the table that adds more value.
  • joms_us - Wednesday, March 15, 2017 - link

    Bwahaha keep dreamin iTard, GB is your most trusted benchmark. =D

    Why don't you run both machine with A10 and Celeron released in 2010. You will see how pathetic your A10 is in realworld apps.
  • name99 - Wednesday, March 15, 2017 - link

    When I was 10 years old, I was in the car and my father and his friend were discussing some technical chemistry. I was bored with this professional talk of pH and fractionation and synthesis, so after my father described some particular reagent he'd mixed up, I chimed in with "and then you drank it?", to which my father said "Oh be quiet. Listen to the adults and you might learn something." While some might have treated this as a horrible insult, the cause of all their later failures in life, I personally took it as serious advice and tried (somewhat successfully) to abide by it, to my great benefit.
    Thanks Dad!

    Relevance to this thread is an exercise left to the reader.
  • joms_us - Wednesday, March 15, 2017 - link

    Even the latest Ryzen is just barely equal or faster than Skylake clock per clock so what makes you think a worthless low-powered mobile chip will surpass them? A10 is not even better than SD821 on real-world apps comparison. Again real-world apps not Antutu, not Geekbench.
  • zodiacfml - Wednesday, March 15, 2017 - link

    Intel's chips are smaller than Apple's. Apple also has the luxury to spend much on the SoC.
  • Andrei Frumusanu - Tuesday, March 14, 2017 - link

    Stamp of approval.

Log in

Don't have an account? Sign up now