System Performance

System performance on the QRD865 was a bit of a tricky topic, as we’ve seen that the same chipset can differ quite a lot depending on the software implementation done by the vendor. For the performance preview this year, Qualcomm again integrated a “Performance” mode on the test devices, alongside the default scheduler and DVFS behaviour of the BSP delivered to vendors.

There’s a fine line between genuine “Performance” modes as implemented on commercial devices such as from Samsung and Huawei, which make tunings to the DVFS and schedulers which increase performance while remaining reasonable in their aggressiveness, and more absurd “cheating” performance modes such as implemented by OPPO for example, which simply ramp up the minimum frequencies of the chip.

Qualcomm’s performance mode on the QRD865 is walking this fine line – it’s extremely aggressive in that it’s ramping up the chipset to maximum frequency in ~30ms. It’s also having the little cores start at a notably higher frequency than in the default mode. Nevertheless, it’s still a legitimate operation mode, although I do not expect very many devices to be configured in this way.

The default mode on the other hand is quite similar to what we’ve seen on the Snapdragon 855 QRD last year, but the issue is that this was also rather conservative and many popular devices such as the Galaxy S10 were configured to be more aggressive. Whilst the default config of the QRD865 should be representative of most devices next year, I do expect many of them to do better than the figures represented by this config.

PCMark Work 2.0 - Web Browsing 2.0

Starting off with the web browsing test, we’re seeing the big difference in performance scaling between the two chipsets. The test here is mostly sensible to the performance scaling of the A55 cores. The QRD865 in the default more is more conservative than some existing S855 devices, which is why it performs worse in those situations. On the other hand, the performance results of the QRD865 here are also extremely aggressive and receives the best results out there amongst our current device range. I expect commercial devices to fall in somewhere between the two extremes.

PCMark Work 2.0 - Video Editing

The video editing test nowadays is no longer performance sensitive and most devices fall in the same result range.

PCMark Work 2.0 - Writing 2.0

The writing test is amongst the most important and representative of daily performance of a device, and here the QRD865 does well in both configurations. The Mate 30 Pro with the Kirin 990 is the only other competitive device at this performance level.

PCMark Work 2.0 - Photo Editing 2.0

The Photo Editing test makes use of RenderScript and GPU acceleration, and here it seems the new QRD865 makes some big improvements. Performance is a step-function higher than previous generation devices.

PCMark Work 2.0 - Data Manipulation

Finally, the data manipulation test oddly enough falls in middle of the pack for both performance modes. I’m not too sure as to why this is, but we’ve seen the test being quite sensible to scheduler or even OS configurations.

PCMark Work 2.0 - Performance

Generally, the QRD865 phone landed at the top of the rankings in PCMark.

Web Benchmarks

Speedometer 2.0 - OS WebView WebXPRT 3 - OS WebView JetStream 2 - OS Webview

The web benchmarks results presented here were somewhat disappointing. The QRD865 really didn’t manage to differentiate itself from the rest of the Android pack even though it was supposed to be roughly 20-25% ahead in theory. I’m not sure what the limitation here is, but the 5-10% increases are well below what we had hoped for. For now, it seems like the performance gap to Apple’s chips remains significant.

System Performance Conclusion

Overall, we expect system performance of Snapdragon 865 devices to be excellent. Commercial devices will likely differ somewhat in terms of their scores as I do not expect them to be configured exactly the same as the QRD865. I was rather disappointed with the web benchmarks as the improvements were quite meagre – in hindsight it might be a reason as to why Arm didn’t talk about them at all during the Cortex-A77 launch.

CPU Performance & Efficiency: SPEC2006 Machine Learning Inference Performance
Comments Locked

178 Comments

View All Comments

  • Bulat Ziganshin - Monday, December 16, 2019 - link

    The Spec2006 tables show that A13 has performance similar to x86 desktop chips, which may be considered as revolution. Can you please add frequencies of the chips (both x86 and Apple) too, at least some estimations? Also, what are the memory configs (freq/CAS/...)? It will be also interesting to see x86 chips in individual SPEC benchmarks so we can analyze what are the weak and string points of Apple architecture.
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    The Apple chips are running near their peak frequencies, with some subtests being slightly throttled due to power. The 9900K was at 5GHz, the 3950X at 4.6-4.65GHz, 3200CL16 on the desktop parts.

    I added the detailed overview over all chips; here's it again: https://images.anandtech.com/doci/15207/SPEC2006_o...
  • unclevagz - Monday, December 16, 2019 - link

    It would be nice if some contemporary x86 laptop chips could be added to that list (Ryzen/Ice Lake/Coffee Lake...) just for ease of comparison between ARM and x86 mobile chips.
  • sam_ - Monday, December 16, 2019 - link

    Any strong reason for these tests being compiled with -mcpu=cortex-a53 on Android/Linux?

    One might expect for SoCs with 8.2 on all cores there may be some uplift from at least targeting cortex-a55, if not cortex-a75?

    When you're expecting to run on a big core, forcing the compiler to target a in-order core which can only execute one ASIMD instruction per cycle seems likely to restrict the perf (unrolling insufficiently etc.). Certainly seems a bit unfair for aarch64 vs. x64 comparison, and probably makes the apple SoCs look better too (assuming XCode isn't targeting a LITTLE core by default). It also likely makes newer bigger cores look worse than they should vs. older cores with smaller OoO windows.

    I get not wanting to target compilation to every CPU individually, but would be interesting to know how much of an effect this has; perhaps this could contribute to the expected IPC gains for FP not being achieved?
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    The tuning models only have very minor impact on the performance results. Whilst using the respective models for each µarch can give another 1-1.5% boost in some tests, as an overall average across all micro-architectures I found that giving the A53 model results in the highest performance. This is compared to not supplying any model at all, or using the common A57 model.

    The A55 model just points to the A53 scheduling model, so they're the same.
  • sam_ - Monday, December 16, 2019 - link

    Hmm, I took a look at LLVM and the scheduling model is indeed the same for A53 and A55, but A55 should enable instruction generation for the various extensions introduced since v8.0. I can believe that for spec 2006 8.1 atomics/SQRDMLAH/fp16/dot product/etc. instructions don't get generated.

    It looks like not much attention has been paid to tweaking the LLVM backend for more recent big cores than A57, beyond getting the features right for instruction generation, so I can believe cortex-a53 still ends up within a couple of percent of more specific tuning. Probably means there's more work to be done on LLVM.

    If it is easy to test I think it would be interesting to try cortex-a57, or maybe exynos-m4 tuning on a77 because these targets do seem to unroll more aggressively than other cortex-X targets with the current LLVM backend.
    I made a toy example on godbolt: https://godbolt.org/z/8i9U5- , though for this particular loop I think a77 would have the vector integer MLA unit saturated with unroll by 2 (and is probably memory bound!), still the other targets would seem more predisposed to exposing instruction level parallelism.
  • Andrei Frumusanu - Tuesday, December 17, 2019 - link

    I pointed out to Arm that there's not much optimisations going on in terms of the models, but they said that they're not putting a lot of effort into that, and instead trying to optimise the general Arm64 target.

    I tested the A57 targets in the past, I'll have a look again on things like the M4 tuning over the coming months as I finally get to port SPEC2017.
  • Quantumz0d - Monday, December 16, 2019 - link

    Sigh another comment on the x86 vs A series. Why dont people understand running an x86 code on ARM will have a massive impact in performance ? How do people think a fanless BGA processor with sub 10W design beat an x86 in realworld just because it has Muh Benchwarrior ? There are so many possible workloads from SIMD, HT/SMT, ALU.

    Having scalability is also the key. Look at x86 AMD and Intel how they do it by making a Large Wafer and having multi SKUs with LGA/PGA (AM4) sockets allowing for maximum robustness.

    ARM is all about efficiency and economical bandwidth and it won't scale like x86 for all workloads. If you add AVX its dead. And Freq scaling with HT/SMT. Add the TSMC N7 which is only fit for mobile SoCs. Ryzen don't scale much into clocks because of this limitation.

    ARM is always Custom if you see as per Vendor. Its bad. Look at MediaTek trash no GPL policy. Huawei as well. Except QcommCAF and Exynos. Its a shame that TI OMAP left.
  • Andrei Frumusanu - Monday, December 16, 2019 - link

    > Why dont people understand running an x86 code on ARM will have a massive impact in performance ?

    Nobody even mentioned anything regarding this, you're going off on a nonsensical rant yet again. For once, please keep the comments section level-headed.
  • Quantumz0d - Monday, December 16, 2019 - link

    What ? Its a genuine point. ARM based 8c processors Windows machines like Surface Pro X can only emulate 32bit x86 code. 64bit isnt here and running both emulation will have am impact (slow) That's what I mean. They need native code to run and rival.

    Rant ? Benches = Realworld right. How come a user is able to see an OP7 Pro breeze through and not lag and offer shitty performance vs an iPhone ? I saw with my own OP3 downclocked on Sultan ROM due to the high clockspeed bug on 82x platform not just me, So many other users. GB score and benches do not only mean performance esp in ARM arena.

    Except for bragging rights, This is pure Whiteknighting.

Log in

Don't have an account? Sign up now