Machine Learning Inference Performance

AIMark 3

AIMark makes use of various vendor SDKs to implement the benchmarks. This means that the end-results really aren’t a proper apples-to-apples comparison, however it represents an approach that actually will be used by some vendors in their in-house applications or even some rare third-party app.

鲁大师 / Master Lu - AIMark 3 - InceptionV3 鲁大师 / Master Lu - AIMark 3 - ResNet34 鲁大师 / Master Lu - AIMark 3 - MobileNet-SSD 鲁大师 / Master Lu - AIMark 3 - DeepLabV3

In AIMark 3, the benchmark uses each vendor’s proprietary SDK in order to accelerate the NN workloads most optimally. For Qualcomm’s devices, this means that seemingly the benchmark is also able to take advantage of the new Tensor cores. Here, the performance improvements of the new Snapdragon 865 chip is outstanding, posting in 2-3x performance compared to its predecessor.

AIBenchmark 3

AIBenchmark takes a different approach to benchmarking. Here the test uses the hardware agnostic NNAPI in order to accelerate inferencing, meaning it doesn’t use any proprietary aspects of a given hardware except for the drivers that actually enable the abstraction between software and hardware. This approach is more apples-to-apples, but also means that we can’t do cross-platform comparisons, like testing iPhones.

We’re publishing one-shot inference times. The difference here to sustained performance inference times is that these figures have more timing overhead on the part of the software stack from initialising the test to actually executing the computation.

AIBenchmark 3 - NNAPI CPU

We’re segregating the AIBenchmark scores by execution block, starting off with the regular CPU workloads that simply use TensorFlow libraries and do not attempt to run on specialized hardware blocks.

AIBenchmark 3 - 1 - The Life - CPU/FP AIBenchmark 3 - 2 - Zoo - CPU/FP AIBenchmark 3 - 3 - Pioneers - CPU/INT AIBenchmark 3 - 4 - Let's Play - CPU/FP AIBenchmark 3 - 7 - Ms. Universe - CPU/FP AIBenchmark 3 - 7 - Ms. Universe - CPU/INT AIBenchmark 3 - 8 - Blur iT! - CPU/FP

Starting off with the CPU accelerated benchmarks, we’re seeing some large improvements of the Snapdragon 865. It’s particularly the FP workloads that are seeing some big performance increases, and it seems these improvements are likely linked to the microarchitectural improvements of the A77.

AIBenchmark 3 - NNAPI INT8

AIBenchmark 3 - 1 - The Life - INT8 AIBenchmark 3 - 2 - Zoo - Int8 AIBenchmark 3 - 3 - Pioneers - INT8 AIBenchmark 3 - 5 - Masterpiece - INT8 AIBenchmark 3 - 6 - Cartoons - INT8

INT8 workload acceleration in AI Benchmark happens on the HVX cores of the DSP rather than the Tensor cores, for which the benchmark currently doesn’t have support for. The performance increases here are relatively in line with what we expect in terms of iterative clock frequency increases of the IP block.

AIBenchmark 3 - NNAPI FP16

AIBenchmark 3 - 1 - The Life - FP16 AIBenchmark 3 - 2 - Zoo - FP16 AIBenchmark 3 - 3 - Pioneers - FP16 AIBenchmark 3 - 5 - Masterpiece - FP16 AIBenchmark 3 - 6 - Cartoons - FP16 AIBenchmark 3 - 9 - Berlin Driving - FP16 AIBenchmark 3 - 10 - WESPE-dn - FP16

FP16 acceleration on the Snapdragon 865 through NNAPI is likely facilitated through the GPU, and we’re seeing iterative improvements in the scores. Huawei’s Mate 30 Pro is in the lead in the vast majority of the tests as it’s able to make use of its NPU which support FP16 acceleration, and its performance here is quite significantly ahead of the Qualcomm chipsets.

AIBenchmark 3 - NNAPI FP32

AIBenchmark 3 - 10 - WESPE-dn - FP32

Finally, the FP32 test should be accelerated by the GPU. Oddly enough here the QRD865 doesn’t fare as well as some of the best S855 devices. It’s to be noted that the results here today were based on an early software stack for the S865 – it’s possible and even very likely that things will improve over the coming months, and the results will be different on commercial devices.

Overall, there’s again a conundrum for us in regards to AI benchmarks today, the tests need to be continuously developed in order to properly support the hardware. The test currently doesn’t make use of the Tensor cores of the Snapdragon 865, so it’s not able to showcase one of the biggest areas of improvement for the chipset. In that sense, benchmarks don’t really mean very much, and the true power of the chipset will only be exhibited by first-party applications such as the camera apps, of the upcoming Snapdragon 865 devices.

System Performance GPU Performance & Power
Comments Locked

178 Comments

View All Comments

  • Bulat Ziganshin - Monday, December 16, 2019 - link

    The Spec2006 tables show that A13 has performance similar to x86 desktop chips, which may be considered as revolution. Can you please add frequencies of the chips (both x86 and Apple) too, at least some estimations? Also, what are the memory configs (freq/CAS/...)? It will be also interesting to see x86 chips in individual SPEC benchmarks so we can analyze what are the weak and string points of Apple architecture.
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    The Apple chips are running near their peak frequencies, with some subtests being slightly throttled due to power. The 9900K was at 5GHz, the 3950X at 4.6-4.65GHz, 3200CL16 on the desktop parts.

    I added the detailed overview over all chips; here's it again: https://images.anandtech.com/doci/15207/SPEC2006_o...
  • unclevagz - Monday, December 16, 2019 - link

    It would be nice if some contemporary x86 laptop chips could be added to that list (Ryzen/Ice Lake/Coffee Lake...) just for ease of comparison between ARM and x86 mobile chips.
  • sam_ - Monday, December 16, 2019 - link

    Any strong reason for these tests being compiled with -mcpu=cortex-a53 on Android/Linux?

    One might expect for SoCs with 8.2 on all cores there may be some uplift from at least targeting cortex-a55, if not cortex-a75?

    When you're expecting to run on a big core, forcing the compiler to target a in-order core which can only execute one ASIMD instruction per cycle seems likely to restrict the perf (unrolling insufficiently etc.). Certainly seems a bit unfair for aarch64 vs. x64 comparison, and probably makes the apple SoCs look better too (assuming XCode isn't targeting a LITTLE core by default). It also likely makes newer bigger cores look worse than they should vs. older cores with smaller OoO windows.

    I get not wanting to target compilation to every CPU individually, but would be interesting to know how much of an effect this has; perhaps this could contribute to the expected IPC gains for FP not being achieved?
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    The tuning models only have very minor impact on the performance results. Whilst using the respective models for each µarch can give another 1-1.5% boost in some tests, as an overall average across all micro-architectures I found that giving the A53 model results in the highest performance. This is compared to not supplying any model at all, or using the common A57 model.

    The A55 model just points to the A53 scheduling model, so they're the same.
  • sam_ - Monday, December 16, 2019 - link

    Hmm, I took a look at LLVM and the scheduling model is indeed the same for A53 and A55, but A55 should enable instruction generation for the various extensions introduced since v8.0. I can believe that for spec 2006 8.1 atomics/SQRDMLAH/fp16/dot product/etc. instructions don't get generated.

    It looks like not much attention has been paid to tweaking the LLVM backend for more recent big cores than A57, beyond getting the features right for instruction generation, so I can believe cortex-a53 still ends up within a couple of percent of more specific tuning. Probably means there's more work to be done on LLVM.

    If it is easy to test I think it would be interesting to try cortex-a57, or maybe exynos-m4 tuning on a77 because these targets do seem to unroll more aggressively than other cortex-X targets with the current LLVM backend.
    I made a toy example on godbolt: https://godbolt.org/z/8i9U5- , though for this particular loop I think a77 would have the vector integer MLA unit saturated with unroll by 2 (and is probably memory bound!), still the other targets would seem more predisposed to exposing instruction level parallelism.
  • Andrei Frumusanu - Tuesday, December 17, 2019 - link

    I pointed out to Arm that there's not much optimisations going on in terms of the models, but they said that they're not putting a lot of effort into that, and instead trying to optimise the general Arm64 target.

    I tested the A57 targets in the past, I'll have a look again on things like the M4 tuning over the coming months as I finally get to port SPEC2017.
  • Quantumz0d - Monday, December 16, 2019 - link

    Sigh another comment on the x86 vs A series. Why dont people understand running an x86 code on ARM will have a massive impact in performance ? How do people think a fanless BGA processor with sub 10W design beat an x86 in realworld just because it has Muh Benchwarrior ? There are so many possible workloads from SIMD, HT/SMT, ALU.

    Having scalability is also the key. Look at x86 AMD and Intel how they do it by making a Large Wafer and having multi SKUs with LGA/PGA (AM4) sockets allowing for maximum robustness.

    ARM is all about efficiency and economical bandwidth and it won't scale like x86 for all workloads. If you add AVX its dead. And Freq scaling with HT/SMT. Add the TSMC N7 which is only fit for mobile SoCs. Ryzen don't scale much into clocks because of this limitation.

    ARM is always Custom if you see as per Vendor. Its bad. Look at MediaTek trash no GPL policy. Huawei as well. Except QcommCAF and Exynos. Its a shame that TI OMAP left.
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    > Why dont people understand running an x86 code on ARM will have a massive impact in performance ?

    Nobody even mentioned anything regarding this, you're going off on a nonsensical rant yet again. For once, please keep the comments section level-headed.
  • Quantumz0d - Monday, December 16, 2019 - link

    What ? Its a genuine point. ARM based 8c processors Windows machines like Surface Pro X can only emulate 32bit x86 code. 64bit isnt here and running both emulation will have am impact (slow) That's what I mean. They need native code to run and rival.

    Rant ? Benches = Realworld right. How come a user is able to see an OP7 Pro breeze through and not lag and offer shitty performance vs an iPhone ? I saw with my own OP3 downclocked on Sultan ROM due to the high clockspeed bug on 82x platform not just me, So many other users. GB score and benches do not only mean performance esp in ARM arena.

    Except for bragging rights, This is pure Whiteknighting.

Log in

Don't have an account? Sign up now