Inference Performance: APIs, Where Art Thou?

Having covered the new CPU complexes of both new Exynos and Snapdragon SoCs, up next is the new generation neural processing engines in each chip.

The Snapdragon 855 brings big performance improvements to the table thanks to a doubling of the HVX units inside the Hexagon 690 DSP. The HVX units in the last two generations of Snapdragon chips were the IP blocks who took the brunt of new integer neural network inferencing work, an area the IP is specifically adept at.

The new tensor accelerator inside of the Hexagon 690 was shown off by Qualcomm at the preview event back in January. Unfortunately one of the issues with the new block is that currently it’s only accessible through Qualcomm’s own SDK tools, and currently doesn’t offer acceleration for NNAPI workloads until later in the year with Android Q.

Looking at a compatibility matrix between what kind of different workloads are able to be accelerated by various hardware block in NNAPI reveals are quite sad state of things:

NNAPI SoC Block Usage Estimates
SoC \ Model Type INT8 FP16 FP32
Exynos 9820 GPU GPU GPU
Exynos 9810 GPU? GPU CPU
Snapdragon 855 DSP GPU GPU
Snapdragon 845 DSP GPU GPU
Kirin 980 GPU? NPU CPU

What stands out in particular is Samsung’s new Exynos 9820 chipset. Even though the SoC promises to come with an NPU that on paper is extremely powerful, the software side of things make it as if the block wouldn’t exist. Currently Samsung doesn’t publicly offer even a proprietary SDK for the new NPU, much less NNAPI drivers. I’ve been told that Samsung looks to address this later in the year, but how exactly the Galaxy S10 will profit from new functionality in the future is quite unclear.

For Qualcomm, as the HVX units are integer only, this means only quantised INT8 inference models are able to be accelerated by the block, with FP16 and FP32 acceleration falling back what should be GPU acceleration. It’s to be noted my matrix here could be wrong as we’re dealing with abstraction layers and depending on the model features required the drivers could run models on different IP blocks.

Finally, HiSilicon’s Kirin 980 currently only offers NNAPI acceleration for FP16 models for the NPU, with INT8 and FP32 models falling back to the CPU as the device are seemingly not using Arm’s NNAPI drivers for the Mali GPU, or at least not taking advantage of INT8 acceleration ine the same way Samsung's GPU drivers.

Before we even get to the benchmark figures, it’s clear that the results will be a mess with various SoCs performing quite differently depending on the workload.

For the benchmark, we’re using a brand-new version of Andrey Ignatov’s AI-Benchmark, namely the just released version 3.0. The new version tunes the models as well as introducing a new Pro-Mode that most interestingly now is able to measure sustained throughput inference performance. This latter point is important as we can have very different performance figures between one-shot inferences and back-to-back inferences. In the former case, software and DVFS can vastly overshadow the actual performance capability of the hardware as in many cases we’re dealing with timings in the 10’s or 100’s of milliseconds.

Going forward we’ll be taking advantage of the new benchmark’s flexibility and posting both instantaneous single inference times as well sequential throughput inference times; better showcasing and separating the impact of software and hardware capabilities.

There’s a lot of data here, so for the sake of brevity I’ll simply put up all the results up and we’ll go over the general analysis at the end:

AIBenchmark 3 - 1a - The Life - CPU (FP) AIBenchmark 3 - 1b - The Life - NNAPI (INT8) AIBenchmark 3 - 1c - The Life - NNAPI (FP16) AIBenchmark 3 - 2a - Zoo - NNAPI (INT8) AIBenchmark 3 - 2b - Zoo - CPU (FP) AIBenchmark 3 - 2c - Zoo - NNAPI (FP16) AIBenchmark 3 - 3a - Pioneers - CPU (INT) AIBenchmark 3 - 3b - Pioneers - NNAPI (INT8) AIBenchmark 3 - 3c - Pioneers - NNAPI (FP16) AIBenchmark 3 - 4 - Let's Play! - CPU (FP) AIBenchmark 3 - 5a - Masterpiece - NNAPI (INT8) AIBenchmark 3 - 5b - Masterpiece - NNAPI (FP16) AIBenchmark 3 - 6b - Cartoons - NNAPI (FP16) AIBenchmark 3 - 7a - Ms.Universe - CPU (INT) AIBenchmark 3 - 7b - Ms.Universe - CPU (FP) AIBenchmark 3 - 8 - Blur iT! - CPU (FP) AIBenchmark 3 - 9 - Berlin Driving - NNAPI (FP16) AIBenchmark 3 - 10a - WESPE-dn - NNAPI (FP16) AIBenchmark 3 - 10b - WESPE-dn - NNAPI (FP32)

As initially predicted, the results are extremely spread across all the SoCs.

The new tests also include workloads that are solely using TensorFlow libraries on the CPU, so the results not only showcase NNAPI accelerator offloading but can also serve as a CPU benchmark.

In the CPU-only tests, we see the Snapdragon 855 and Exynos 9820 being in the lead, however there’s a notable difference between the two when it comes to their instantaneous vs sequential performance. The Snapdragon 855 is able to post significantly better single inference figures than the Exynos, although the latter catches up in longer duration workloads. Inherently this is a software characteristic difference between the two chips as although Samsung has improved scheduler responsiveness in the new chip, it still lags behind the Qualcomm variant.

In INT8 workloads there is no contest as Qualcomm is far ahead of the competition in NNAPI benchmarks simply due to the fact that they’re the only vendor being able to offload this to an actual accelerator. Samsung’s Exynos 9820 performance here actually has also drastically improved thanks to the new Mali G76’s new INT8 dot-product instructions. It’s odd that the same GPU in the Kirin 980 doesn’t show the same improvements, which could be due to not up-to-date Arm GPU NNAPI drives on the Mate 20.

The FP16 performance crown many times goes to the Kirin 980 NPU, but in some workloads it seems as if they fall back to the GPU, and in those cases Qualcomm’s GPU clearly has the lead.

Finally for FP32 workloads it’s again the Qualcomm GPU which takes an undisputed lead in performance.

Overall, machine inferencing performance today is an absolute mess. In all the chaos though Qualcomm seems to be the only SoC supplier that is able to deliver consistently good performance, and its software stack is clearly the best. Things will evolve over the coming months, and it will be interesting to see what Samsung will be able to achieve in regards to their custom SDK and NNAPI for the Exynos NPU, but much like Huawei’s Kirin NPU it’s all just marketing until we actually see the software deliver on the hardware capabilities, something which may take longer than the actual first year active lifespan of the new hardware.

SPEC2006: Almost Performance Parity at Worse Efficiency System Performance
Comments Locked

229 Comments

View All Comments

  • Andrei Frumusanu - Monday, April 1, 2019 - link

    We don't have any good methodology on things like signal, network (does any site test this *accurately*?).

    As for the UI bits, it's something I wanted to have in the piece but also didn't want to further delay the article another week. In general OneUI is Samsung's by far best user interface and has fantastic features without them feeling like gimmicks. It's currently in my opinion the best variant of Android, though I'm sure some Google users will get angry at me for saying that.
  • GreenMeters - Monday, April 1, 2019 - link

    "If you’re a reader in the US or other Snapdragon markets, you can stop reading here and feel happy about your purchase or go ahead and buy the Galaxy S10+."

    Unfortunately, no, you can't feel happy about it, because once again the Snapdragon variant has its bootloader locked. So your expensive purchase that could easily have a 5+ year lifespan with an open source OS providing up-to-date security and features is now artificially limited to 2 years of Samsung's lousy support.
  • XelaChang - Monday, April 1, 2019 - link

    Quite disappointing for Exinos, especially the audio. Going to look into Huawei P30 instead.
  • Quantumz0d - Monday, April 1, 2019 - link

    Hello Andrei, huge thanks for the solid piece. I don't think there are any editors out who does this type of analysis. The most superb part was the battery analysis, just fantastic. I remember your piece on the Note 9 as well.

    Because smartphones with soldered/sealed batteries are a pain with 2 Yr EOL of cycles due to aggressive current/power/volts/cycles. I wish when you cover the LG. Maybe kindly have a look at their Qnovo. Replacing at end user is so bad, ruins the IP rating and hard to source. Samsung improvising this is a really good news.

    Next the Camera Hole points all are valid. Its worse than a notch with that absurdly thicc status bar and the stupid icons on the right side. An eye sore with dead pixels. Samsung showed in China for an under screen camera perhaps the Vertical integration you mentioned due to the Exynos applies here as well, perhaps the cost as well..

    One UI though perhaps feels polished but its too childish/kid friendly to me and excessively rounded like iOS instead of stock Pie/Q, that is bad IMO.

    Still have to read up on the Camera/Display. Also I think you should mention one great advantage that Exynos has - Bootloader Unlock. Without that QSD version is just a paperweight, zero ownership, zero tuning. IMO a brick.

    Also good to hear about the speaker system performance, apple mentions it always its surprising how they didn't yet offered a good quality, finally those AKG buds are very very bad. I heard them, their tiny driver is horrible in low end and mid range, its shameful. I'm not up to date with recent audio progress but at $100 we can get RHA MA 750/ Final Audio / iBasso IT01/ TFZ King II / Mee Audio P1 / FiiO F9 and Pro / Dunu Titan1 and ton of IEMs with far more superior quality.

    I hope they get the damn AKM chips into their phones and compete with LG ESS and take the Audio seriously, its a shame that LG doesn't advertise ESS anymore but Meridian collaboration.

    Finally the Audio DAC part, sometimes being 100% accurate doesnt necessarily mean best, my iPod 5.5G Wolfson DAC before CL merger many people say the iPod 6G+ Classics are better due to the ball roll off they mention on the Wolfson 5.5G DAC, I have both of them running same OS (Aftermarket stable Linux based Rockbox) and the Cirrus Logic G classic sounds fatiguing to me, metallic and lack of thump vs the 5.5G. I think maybe your impression is also similar. I heard the 835s Acoustic in my Car with my friend's Note 8 US version and it was hollow and lack of any texture and rumble. The iPod beats it by a HUGE margin both of th 6G and 5.5G and 5.5G being better, the V30s ESS sounds more balanced vs the 5.5G as in clear at the expense of soundstage (in car more significant) and sharpness being higher but retains excellent sub bass. All this is subjective. Just to let you know..

    Thanks.
  • Quantumz0d - Monday, April 1, 2019 - link

    Correction.

    > Apple mentions it always its surprising how they didn't yet.

    Apple stands at top as one of the best speakers on an iPad/iPhone.
  • Andrei Frumusanu - Tuesday, April 2, 2019 - link

    The iPhone XS improved it, but the S10 beats it handily in speaker quality.
  • Quantumz0d - Tuesday, April 2, 2019 - link

    Wow, that's really surprising and great news. Thank you for the information. I'll stop by to a Best buy near to me and check it out.

    Perhaps they'll improve on their new S5 855 Tablet (hopefully with jack, unlike S5e) because the Tab S4 is outright beaten to pulp by iPad Pro 2017.
  • s.yu - Friday, April 5, 2019 - link

    Wow beating Apple at audio is definitely something special.
  • Quantumz0d - Monday, April 1, 2019 - link

    Damn another typo

    >Ball roll off

    Its Bass Roll off. And 6G missing before "classic sounds"
  • watersb - Tuesday, April 2, 2019 - link

    What an incredible opportunity to compare two leading SoC architectures.

Log in

Don't have an account? Sign up now