Kirin 980 Second Generation NPU - NNAPI Tested

We’ve tested the first generation Kirin NPU back in January in our Kirin 970 review – Back then, we were quite limited in terms of benchmarking tests we were able to run, and I mostly relied on Master Lu’s AI test. This is still around, and we’ve also used it in performance testing Apple’s new A12 neural engine. Unfortunately or the Mate 20’s, the benchmark isn’t compatible yet as it seemingly doesn’t use HiSilicon’s HiAI API on the phones, and falls back to a CPU implementation for processing.

Google had finalised the NNAPI back in Android 8.1, and how most of the time these things go, we first need an API to come out before we can see applications be able to make use of exotic new features such as dedicated neural inferencing engines.

“AI-Benchmark” is a new tool developed by Andrey Ignatov from the Computer Vision Lab at ETH Zürich in Switzerland. The new benchmark application, is as far as I’m aware, one of the first to make extensive use of Android’s new NNAPI, rather than relying on each SoC vendor’s own SDK tools and APIs. This is an important distinction to AIMark, as AI-Benchmark should be better able to accurately represent the resulting NN performance as expected from an application which uses the NNAPI.

Andrey extensive documents the workloads such as the NN models used as well as what their function is, and has also published a paper on his methods and findings.

One thing to keep in mind, is that the NNAPI isn’t just some universal translation layer that is able to magically run a neural network model on an NPU, but the API as well as the SoC vendor’s underlying driver must be able to support the exposed functions and be able to run this on the IP block. The distinction here lies between models which use features that are to date not yet supported by the NNAPI, and thus have to fall back to a CPU implementation, and models which can be hardware accelerated and operate on quantized INT8 or FP16 data. There’s also models relying on FP32 data, and here again depending on the underlying driver this can be either run on the CPU or for example on the GPU.

For the time being, I’m withholding from using the app’s scores and will simply rely on individual comparisons between each test’s inference time. Another presentational difference is that we’ll go through the test results based on the targeted model acceleration type.

AIBenchmark - 1a - The Life - CPU AIBenchmark - 6 - Ms.Universe - CPUAIBenchmark - 7 - Berlin Driving - CPU

The first three CPU tests rely on models which have functions that are not yet supported by the NNAPI. Here what matters for the performance is just the CPU performance as well as the performance response time. The latter I mention, because the workload is transactional in its nature and we are just testing a single image inference. This means that mechanisms such as DVFS and scheduler responsiveness can have a huge impact on the results. This is best demonstrated by the fact that my custom kernel of the Exynos 9810 in the Galaxy S9 performs significantly better than the stock kernel of the same chip of the Note9 in the same above results.

Still, comparing the Huawei P20 Pro (most up to date software stack with Kirin 970) to the new Mate 20, we see some really impressive results of the latter. This both showcases the performance of the A76 cores, as well as possibly improvements in HiSilicon’s DVFS/scheduler.

AIBenchmark - 1c - The Life - INT8AIBenchmark - 3 - Pioneers - INT8AIBenchmark - 5 - Cartoons - INT8

Moving onto the next set of tests, these are based on 8-bit integer quantized NN models. Unfortunately for the Huawei phones, HiSilicons NNAPI drivers still doesn’t seem to expose acceleration to the hardware. Andrey had shared with me that in communications with Huawei, is that they plan to rectify this in a future version of the driver.

Effectively, these tests also don’t use the NPU on the Kirins, and it’s again a showcase of the CPU performance.

On the Qualcomm devices, we see the OnePlus 6 and Pixel 3 far ahead in performance, even compared to the same chipset Galaxy S9+. The reason for this is that both of these phones are running a new updated NNAPI driver from Qualcomm which came along with the Android 9/P BSP update. Here acceleration if facilitated through the HVX DSPs.

AIBenchmark - 1b - The Life - FP16AIBenchmark - 2 - Zoo - FP16AIBenchmark - 4 - Masterpiece - FP16

Moving on to the FP16 tests, here we finally see the Huawei devices make use of the NPU, and post some leading scores both on the old and new generation SoCs. Here the Kirin 980’s >2x NPU improvement finally materialises, with the Mate 20 showcasing a big lead.

I’m not sure if the other devices are running the workloads on the CPU or on the GPU, and the OnePlus 6 seems to suffer from some very odd regression in its NNAPI drivers that makes it perform an order of magnitude worse than other platforms.

AIBenchmark - 8 - Berlin Driving - FP32

Finally on the last FP32 model test, most phones should be running the workload on the CPU again. There’s a more limited improvement on the part of the Mate 20.

Overall, AI-Benchmark was at least able to validate some of Huawei’s NPU performance claims, even though that the real conclusion we should be drawing from these results is that most devices with NNAPI drivers are currently just inherently immature and still very limited in their functionality, which sadly enough again is a sad contrast compared where Apple’s CoreML ecosystem is at today.

I refer back to my conclusion from early in the year regarding the Kirin 970: I still don’t see the NPU as something that obviously beneficial to users, simply because we just don’t have the software applications available to make use of the hardware. I’m not sure to what extent Huawei uses the NPU for camera processing, but other than such first-party use-cases, NPUs currently still seems something mostly inconsequential to device experience

First Cortex-A76 SoC - SPEC2006 Performance & Efficiency System Performance
Comments Locked

141 Comments

View All Comments

  • name99 - Friday, November 16, 2018 - link

    Andrei you are concentrating on the wrong thing. I don't care about the inadequacies of GB4's memory bandwidth test, or the device uncore, I care about the DRAM part of this.

    I understand you and anomouse are both claiming that LPDDR4-2133 means 4266 MT/s.
    OK, if that's true it's a dumb naming convention, but whatever. The point is, this claim goes directly against the entire thrust of the anandtech DDR5 article from a few days ago that I keep referring to, which states very clearly that something like DDR4-3200 means 3200MT/s

    THAT is the discrepancy I am trying to resolve.
  • ternnence - Friday, November 16, 2018 - link

    name99 , for mobile,LPDDR4x has 4266 spec , however desktop DDR4 rarely could get such frequency. So it is not LPDDR4-2133 has 4266MT/s, it is LPDDR4-4266 has 4266MT/s
  • ternnence - Friday, November 16, 2018 - link

    FYI,https://www.samsung.com/semiconductor/dram/lpddr4x... you could check this site.
  • name99 - Friday, November 16, 2018 - link

    FWIW wikipedia sees things the same way saying that
    https://en.wikipedia.org/wiki/DDR4_SDRAM
    eg DDR4-2133 means 2133MT/s

    This follows the exact same pattern as all previous SDRAM numbering. Up to DDR3 the multiplier was 2 (DDR), 4(DDR2) or 8(DDR3); with DDR4 the multiplier stays at 8 but the base clock doubles so from min of 100MHz it's now min of 200MHz.

    But these are internal details; the part that matters is that most authorities seem to agree that DDR4-2133 means 2133MT/s, each transaction normally 64-bits wide.

    Now there are SOME people claiming no, DDR4-2133 means 4266 MT/s
    - https://www.androidauthority.com/lpddr4-everything...
    claims this (but couches the claim is so much nonsensical techno-double-speak that I don't especially trust them)
    - so do you and anonomouse.

    So, like I said, WTF is going on here? We have a large pool of sources saying the sky is blue, and a different pool insisting that, no, the sky is green.
  • anonomouse - Friday, November 16, 2018 - link

    I never claimed that DDR4-2133 means 4266MT/s. I am instead claiming that there is no LPDDR4-2133.
  • anonomouse - Friday, November 16, 2018 - link

    I think the discrepancy here is just that you/they are mixing the naming conventions. DDR4-3200 means 3200MT/s. After an admittedly brief and cursory search, I don't see any references to Micron using the term LPDDR4-2133. I instead see every indication that they have LPDDR4 running at 2133MHz. Perhaps people here and there are mixing up the terminology, but when in doubt may as well just look at the actual memory clock or bandwidth being listed as that's ultimately what's importantly.
  • name99 - Friday, November 16, 2018 - link

    Yeah, I think you are correct. After looking in a few different places I think the following are all true:
    - The DDR4 guys tend to talk about MT/s and give the sorts of numbers I gave
    - The LPDDR4 guys tend to talk about Mb/s per pin (same as MT/s, but just shows a different culture) and tend to be working with substantially higher numbers.

    I *THINK* (corrections welcome) that
    (a) the way LPDDR4 is mounted (no DIMMs and sockets, rather it's direct mounting, either on the SoC as PoP, or extremely close to it on a dedicated substrate), allows for substantially higher frequencies than DDR4.
    (b) one's natural instinct (mine, and likely other people's) is that "of course DDR4 runs faster [fewer power concerns, etc]" so when you see LPDDR4 running faster (at say "4266") you assume this has to mean some sort of "silent" multiplication by 2, and what's actually meant is the equivalent of DDR4-2133 at 2133MT/s.
    (c) It certainly doesn't help that Micron at least is calling the 4266MT/s LPDDR4 as having a "2133MHz clock". I have no idea what that is supposed to mean given that the DDR4 "clock" runs at 1/8th transaction speed, so for DDR4 the clock of a 4266MT/s device would be 533MHz.

    So I think we have established that the actual speeds ARE 4266MT/s (or so) for LPDDR4.
    Left unresolved
    - these are generally higher than DDR4? Meaning that, sooner or later, PC users are going to have to choose between flexible RAM (DIMMs and sockets) or high speed RAM (PoP mounting, or superclose to the SoC on a substrate --- look at the A12X)?

    - Why is Micron calling something like LPDDR4-4266 as having a 2133MH clock? What does that refer to? I would assume that, like normal DDRx, the "low frequency clock" (what I've said would be 533MHz) is the speed for control transactions, and the 8x speed (4266Mb/s per pin) is the speed for bulk data flow?
  • ternnence - Friday, November 16, 2018 - link

    where do you get this "Micron lists their LPDDR4, for example, as LPDDR4-2133, NOT as LPDDR4-4266?"? just check Micron official site, they mark LPDDR4-4266, not LPDDR4-2133, to their 2133MHz ram.
  • ternnence - Friday, November 16, 2018 - link

    ddr means double data rate. 2133MH equals ram operates 2133 per second. but one operate produce two data output. MT/s equals million transfer per second. so LPDDR4-4266= 4266 million transfer per second = 2133 million Hz
  • name99 - Friday, November 16, 2018 - link

    The Micron datasheets, for example, numdram.pdf,
    https://www.micron.com/~/media/documents/products/...
    do exactly this.

Log in

Don't have an account? Sign up now