SPEC2006 Performance: Reaching Desktop Levels

It’s been a while now since we attempted SPEC on an iOS device – for various reasons we weren’t able to continue with that over the last few years. I know a lot of people were looking forward to us picking back up from where we left, and I’m happy to share that I’ve spent some time getting a full SPEC2006 harness back to work.

SPEC2006 is an important industry standard benchmark and differentiates itself from other workloads in that the datasets that it works on are significantly larger and more complex. While GeekBench 4 has established itself as a popular benchmark in the industry – and I do praise on the efforts on having a full cross-platform benchmark – one does have to take into account that it’s still relatively on the light side in terms of program sizes and the data sizes of its workloads. As such, SPEC2006 is much better as a representative benchmark that fully exhibits more details of a given microarchitecture, especially in regards to the memory subsystem performance.

The following SPEC figures are declared as estimates, as they were not submitted and officially validated by SPEC. The benchmark libraries were compiled with the following settings:

  • Android: Toolchain: NDK r16 LLVM compiler, Flags: -Ofast, -mcpu=cortex-A53
  • iOS: Toolchain: Xcode 10, Flags: -Ofast

On iOS, 429.mcf was a problem case as the kernel memory allocator generally refuses to allocate the single large 1.8GB chunk that the program requires (even on the new 4GB iPhones). I’ve modified the benchmark to use only half the amount of arcs, thus roughly reducing the memory footprint to ~1GB. The reduction in runtime has been measured on several platforms and I’ve applied a similar scaling factor to the iOS score – which I estimate to being +-5% accurate. The remaining workloads were manually verified and validated for correct execution.

The performance measurement was run in a synthetic environment (read: bench fan cooling the phones) where we assured thermals wouldn’t be an issue for the 1-2 hours it takes to complete a full suite run.

In terms of data presentation, I’m following of earlier articles this year such as the Snapdragon 845 and Exynos 9810 evaluation in our Galaxy S9 review.

When measuring performance and efficiency, it’s important to take three metrics into account: Evidently, the performance and runtime of a benchmark, which in the graphs below is represented on the right axis, growing from the right. Here the bigger the figures, the more performant a SoC/CPU has benchmarked. The labels represent the SPECspeed scores.

On the left axis, the bars are representing the energy usage for the given workload. The bars grow from the left, and a longer bar means more energy used by the platform. A platform is more energy efficient when the bars are shorter, meaning less energy used. The labels showcase the average power used in Watts, which is still an important secondary metric to take into account in thermally constrained devices, as well as the total energy used in Joules, which is the primary efficiency metric.

The data is ordered as in the legend, and colour coded by different SoC vendor as well as shaded by the different generations. I’ve kept the data to the Apple A12, A11, Exynos 9810 (at 2.7 and 2.3GHz), Exynos 8895, Snapdragon 845 and Snapdragon 835. This gives us an overview of all relevant CPU microarchitectures over the last two years.

Starting off with the SPECint2006 workloads:

The A12 clocks in at 5% higher than the A11 in most workloads, however we have to keep in mind we can’t really lock the frequencies on iOS devices, so this is just an assumption of the runtime clocks during the benchmarks. In SPECint2006, the A12 performed an average of 24% better than the A11.

The smallest increases are seen in 456.hmmer and 464.h264ref – both of these tests are the two most execution bottlenecked tests in the suite. As the A12 seemingly did not really have any major changes in this regard, the small increase can be mainly attributed to the higher frequency as well as the improvements in the cache hierarchy.

The improvements in 445.gobmk are quite large at 27% - the characteristics of the workload here are bottlenecks in the store address events as well as branch mispredictions. I did measure that the A12 had some major change in the way stores across cache lines were handled, as I’m not seeing significant changes in the branch predictor accuracy.

403.gcc partly, and most valid for 429.mcf, 471.omnetpp, 473.Astar and 483.xalancbmk are sensible to the memory subsystem and this is where the A12 just has astounding performance gains from 30 to 42%. It’s clear that the new cache hierarchy and memory subsystem has greatly paid off here as Apple was able to pull off one of the most major performance jumps in recent generations.

When looking at power efficiency – overall the A12 has improved by 12% - but we have to remember that we’re talking about 12% less energy at peak performance. The A12 showcasing 24% better performance means were comparing two very different points at the performance/power curve of the two SoCs.

In the benchmarks where the performance gains were the largest – the aforementioned memory limited workloads – we saw power consumption rise quite significantly. So even though 7nm promised power gains, Apple's opted to spend more energy than what the new process node has saved, so average power across the totality of SPECint2006 did go up from ~3.36W on the A11 to 3.64W on the A12.

Moving on to SPECfp2006, we are looking at the C and C++ benchmarks, as we have no Fortran compiler in XCode, and it is incredibly complicated to get one working for Android as it’s not part of the NDK, which has a deprecated version of GCC.

SPECfp2006 has a lot more tests that are very memory intensive – out of the 7 tests, only 444.namd, 447.dealII, and 453.povray don’t see major performance regressions if the memory subsystem isn’t up to par.

Of course this majorly favours the A12, as the average gain for SPECfp is 28%. 433.milc here absolutely stands out with a massive 75% gain in performance. The benchmark is characterised by being instruction store limited – again part of the Vortex µarch that I saw a great improvement in. The same analysis applies to 450.soplex – a combination of the superior cache hierarchy and memory store performance greatly improves the perf by 42%.

470.lbm is an interesting workload for the Apple CPUs as they showcase multi-factor performance advantages over competing Arm and Samsung cores. Qualcomm’s Snapdragon 820 Kryo CPU oddly enough still outperforms the recent Android SoCs. 470.lbm is characterised by extremely large loops in the hottest piece of code. Microarchitectures can optimise such workloads by having (larger) instruction loop buffers, where on a loop iteration the core would bypass the decode stages and fetch the instructions from the buffer. It seems that Apple’s microarchitecture has some kind of such a mechanism. The other explanation is also the vector execution performance of the Apple cores – lbm’s hot loop makes heavy use of SIMD, and Apple’s 3x execution throughput advantage is also likely a heavy contributor to the performance.

Similar to SPECint, the SPECfp workload which saw the biggest performance jumps also saw an increase in their power consumption. 433.milc saw an increase from 2.7W to 4.2W, again with a 75% performance increase.

Overall the power consumption has seen a jump from 3.65W up to 4.27W. The overall energy efficiency has increased in all tests but 482.sphinx3, where the power increase hit the maximum across all SPEC workloads for the A12 at 5.35W. The total energy used for SPECfp2006 for the A12 is 10% lower than the A11.

I didn’t have time to go back and measure the power for the A10 and A9, but generally they’re in line around 3W for SPEC. I did run the performance benchmarks, and here’s an aggregate performance overview of the A9 through to the A12 along with the most recent Android SoCs, for those who are looking into comparing past Apple generations.

Overall the new A12 Vortex cores and the architectural improvements on the SoC’s memory subsystem give Apple’s new piece of silicon a much higher performance advantage than Apple’s marketing materials promote. The contrast to the best Android SoCs have to offer is extremely stark – both in terms of performance as well as in power efficiency. Apple’s SoCs have better energy efficiency than all recent Android SoCs while having a nearly 2x performance advantage. I wouldn’t be surprised that if we were to normalise for energy used, Apple would have a 3x performance lead.

This also gives us a great piece of context for Samsung’s M3 core, which was released this year: the argument that higher power consumption brings higher performance only makes sense when the total energy is kept within check. Here the Exynos 9810 uses twice the energy over last year’s A11 – at a 55% performance deficit.

Meanwhile Arm’s Cortex A76 is scheduled to arrive inside the Kirin 980 as part of the Huawei Mate 20 in just a couple of weeks – and I’ll be making sure we’re giving the new flagship a proper examination and placing among current SoCs in our performance and efficiency graph.

What is quite astonishing, is just how close Apple’s A11 and A12 are to current desktop CPUs. I haven’t had the opportunity to run things in a more comparable manner, but taking our server editor, Johan De Gelas’ recent figures from earlier this summer, we see that the A12 outperforms a moderately-clocked Skylake CPU in single-threaded performance. Of course there’s compiler considerations and various frequency concerns to take into account, but still we’re now talking about very small margins until Apple’s mobile SoCs outperform the fastest desktop CPUs in terms of ST performance. It will be interesting to get more accurate figures on this topic later on in the coming months.

The A12 Vortex CPU µarch: Massive Memory Improvements The A12 Tempest CPU & NN Performance Tests
Comments Locked

253 Comments

View All Comments

  • eastcoast_pete - Sunday, October 7, 2018 - link

    Apple's strength (supremacy) in the performance of their SoCs really lies in the fine-tuned match of apps and especially low-level software that make good use of excellent hardware. What happens when that doesn't happen was outlined in detail by Andrei in his reviews of Samsung's Mongoose M3 SoC - to use a famous line from a movie that "could've been a contender", but really isn't. Apple's tight integration is the key factor that a more open ecosytem (Android) has a hard time matching; however, Google and (especially) Qualcomm leave a lot of possible performance improvements on the table by really poor collaboration; for example, GPU-assisted computing is AWOL for Android - not a smart move when you try to compete against Apple.
  • varase - Tuesday, October 23, 2018 - link

    I have serious doubts that Android would even run on an A12 SoC - I thought Apple trashed ARMv7 when it went to A11.
  • Strafeb - Saturday, October 6, 2018 - link

    It would be interesting to see comparison of screen efficiency of iPhone XR's low res LCD screen, and also some of LG's pOLED screens like in V40.
  • Alistair - Saturday, October 6, 2018 - link

    The Xeon Platinum 8176 is a 28 core, $9000 Intel server CPU, based on Skylake. In single threaded performance, the iPhone XS outperforms it by 12 percent for integers, despite its lower clock speed. If the iPhone were to run at 3.8ghz, the Apple A12 would outperform Intel's CPU by 64 percent on average for integer tests.

    iPhone XS and A12 numbers from: https://www.anandtech.com/show/13392/the-iphone-xs...

    Xeon numbers from: https://www.anandtech.com/show/12694/assessing-cav...

    spreadsheet: https://docs.google.com/spreadsheets/d/1ipKIh4i56o...

    image of chart: https://i.imgur.com/IAupi9p.jpg

    Think about that, the iPhone's CPU IPC (performance per clock) is already higher in integer performance now. Those tests include: spam filter, compression, compiling, vehicle scheduling, game ai, protein seq. analyses, chess, quantum simulation, video encoding, network sim, pathfinding, and xml processing. Test takes hours to run.
  • SanX - Saturday, October 6, 2018 - link

    Yes, and while Apple and all other mobile processor manufacturers charge $5 per core, Intel $300
  • yeeeeman - Saturday, October 6, 2018 - link

    It might be faster in single thread, but in MT it gets toasted by the Xeon. The Xeon is 9000$ for a few reasons:
    - it is an enterprise chip;
    - it supports ecc;
    - it supports up to 8 cpus on a board;
    - it supports tons of ram, a LOT of memory channels;
    - it has almost 40MB of L3 cache, compared to 8mb in a12;
    - it has a ring bus architecture meaning all those cores have very low latency between them and to memory;
    - it has CISC instructions, meaning that when you get out of basic phone apps and you start doing scientific/database/HPC stuff, you will see a lot of benefits and performance improvements from executing a single instruction for a specific operation, compared to the RISC nature of A12;
    - it supports AVX512, needed for high performance computing. In this, the A12 would get smashed;
    - and many more;
    So the Xeon 8180 is still an mighty impressive chip and Intel has invested some real thought and experience into making it. Things that Apple doesn't have.
    I get it, it is nice to see Apple having a chip with this much compute power in such a low TDP and it is due to the fact that x86 chips have a lot of extra stuff added in for legacy. But don't get carried away with this, what Apple is doing now from uArch point of view is not new. Desktop chip have had this stuff 15 years ago. The difference is that Apple works on the latest fabrication process and doesn't care about x86 legacy.
  • Alistair - Saturday, October 6, 2018 - link

    "It might be faster in single thread, but in MT it gets toasted by the Xeon"

    That is totally irrelevant. Obviously Apple could easily make a chip with more cores. Just like Cavium's Thunder. 8 x A12 Vortex cores would beat an 8 core Xeon in integer calculations easily enough.
  • eastcoast_pete - Sunday, October 7, 2018 - link

    Agree on your points re. the XEON. However, I'd still like to see Apple launch CPUs/iGPUs based on their design especially in the laptop space, where Intel still rules and charges premium prices. If nothing else, Apple getting into that game would fan the flames under Intel's chair that AMD is trying to kindle (started to work for desktop CPUs). In the end, we all benefit if Chipzilla either gets off its enormous bottom(line) and innovates more, or gets pushed to the side by superior tech. So, even as a non-Apple user: go Apple, go!
  • Constructor - Sunday, October 7, 2018 - link

    - it has CISC instructions, meaning that when you get out of basic phone apps and you start doing scientific/database/HPC stuff, you will see a lot of benefits and performance improvements from executing a single instruction for a specific operation, compared to the RISC nature of A12;

    CISC instructions generally don't really do much more than RISC ones do – they just have more addressing modes while RISC is almost always register-to-register with separate Load & Store.

    That just doesn' make any difference any more because the bottleneck is not instruction fetching (as it once was in the old times) but actually execution unit pipeline congestion, including of the Load & Store units.

    - it supports AVX512, needed for high performance computing. In this, the A12 would get smashed;

    There's already a scalable vector extention for ARM which Apple could adopt if that was actually a bottleneck. And even the existing vector units aren't anything to scoff at – the issue is more that Intel CPUs are forced to drop down to half their nominal clock once you actually use AVX512; It could actually be more efficient to optimize the regular vetor units for ful lspeed operation to make up for it.

    So the Xeon 8180 is still an mighty impressive chip and Intel has invested some real thought and experience into making it. Things that Apple doesn't have.

    We actually have no clue what Apple is investing in behind closed doors until they slam it on the table as a finished product ready for sale!
  • tipoo - Thursday, October 18, 2018 - link

    I'm hoping Apple takes the ARM switch as an opportunity to bring an ARM AVX-512 equivalent down to more products, like the iMac.

Log in

Don't have an account? Sign up now