SoC Analysis: CPU Performance

Now that we’ve had a chance to take a look at A9X’s design and a bit on the difference between the x86 and ARM ISAs, let’s take a look at A9X’s performance at a lower level.

From a CPU perspective A9X is just a higher clocked implementation of the dual-core Twister CPU design we first saw on A9 last year. As a result the fundamentals of the CPU architecture have not changed relative to A9. However A9X relative to A8X drops down from three CPU cores to two, so among the factors we’ll want to look at is how Apple has been impacted by dropping down to two faster cores.

We’ll start things off with Geekbench, 3, which gives us a fairly low-level look at CPU performance.

Geekbench 3 - Integer Performance
  A9X A8X % Advantage
AES ST
1.17 GB/s
0.98 GB/s
19%
AES MT
2.85 GB/s
3.16 GB/s
-10%
Twofish ST
120.7 MB/s
64.0 MB/s
89%
Twofish MT
228.3 MB/s
182.7 MB/s
25%
SHA1 ST
1.03 GB/s
0.53 GB/s
94%
SHA1 MT
1.95 GB/s
1.48 GB/s
32%
SHA2 ST
205.8 MB/s
119.1 MB/s
73%
SHA2 MT
395.5 MB/s
330.6 MB/s
20%
BZip2Comp ST
8.95 MB/s
5.71 MB/s
57%
BZip2Comp MT
17.0 MB/s
16.6 MB/s
2%
Bzip2Decomp ST
14.7 MB/s
8.98 MB/s
64%
Bzip2Decomp MT
28.1 MB/s
25.2 MB/s
12%
JPG Comp ST
33.7 MP/s
20.6 MP/s
64%
JPG Comp MT
64.4 MP/s
60.8 MP/s
6%
JPG Decomp ST
89.2 MP/s
53.0 MP/s
68%
JPG Decomp MT
166.5 MP/s
153.9 MP/s
8%
PNG Comp ST
2.11 MP/s
1.35 MP/s
56%
PNG Comp MT
4.04 MP/s
3.82 MP/s
6%
PNG Decomp ST
31.5 MP/s
18.7 MP/s
68%
PNG Decomp MT
56.9 MP/s
56.3 MP/s
1%
Sobel ST
138.3 MP/s
82.5 MP/s
68%
Sobel MT
258.7 MP/s
225.6 MP/s
15%
Lua ST
3.25 MB/s
1.68 MB/s
93%
Lua MT
6.02 MB/s
4.60 MB/s
31%
Dijkstra ST
10.1 Mpairs/s
6.70 Mpairs/s
51%
Dijkstra MT
17.6 Mpairs/s
16.0 Mpairs/s
10%

The interesting thing about Geekbench is that as a result of being a lower-level test the bulk of its tests scale up well with CPU core counts, as the benchmark can just spawn more threads. Consequently I wasn’t entirely sure what to expect here, as this presents the tri-core A8X with a much better than average scaling opportunity, making it especially harsh on the A9X.

But what the results show us is that even by dropping back down to two CPU cores, A9X does very well overall. The single-threaded results are greatly improved, with A9X offering better than a 50% single-threaded perf gain in the majority of the sub-tests. Meanwhile even with the multi-threaded tests, A9X only loses once, on AES. Otherwise two higher clocked Twister cores are beating three lower clocked Typhoon cores by anywhere between a few percent up to 32%. In this sense Geekbench is something of a worst-case scenario, as real-world software rarely benefits from additional cores this well (this being part of the reason why A8 and A9 did so well relative to quad Cortex-A57 designs), so it’s promising to see that even in this worst-case scenario A9X can deliver meaningful performance gains over A8X.

Geekbench 3 - Floating Point Performance
  A9X A8X % Advantage
BlackScholes ST
14.9 Mnodes/s
8.52 Mnodes/s
75%
BlackScholes MT
28.2 Mnodes/s
24.9 Mnodes/s
13%
Mandelbrot ST
2.23 GFLOPS
1.27 GFLOPS
76%
Mandelbrot MT
4.27 GFLOPS
3.66 GFLOPS
17%
Sharpen Filter ST
2.10 GFLOPS
1.08 GFLOPS
94%
Sharpen Filter MT
4.01 GFLOPS
3.12 GFLOPS
29%
Blur Filter ST
2.68 GFLOPS
1.53 GFLOPS
75%
Blur Filter MT
5.08 GFLOPS
4.47 GFLOPS
14%
SGEMM ST
6.77 GFLOPS
4.12 GFLOPS
64%
SGEMM MT
12.7 GFLOPS
11.6 GFLOPS
9%
DGEMM ST
3.32 GFLOPS
2.02 GFLOPS
64%
DGEMM MT
6.21 GFLOPS
5.61 GFLOPS
11%
SFFT ST
3.52 GFLOPS
1.92 GFLOPS
83%
SFFT MT
6.67 GFLOPS
5.40 GFLOPS
24%
DFFT ST
3.21 GFLOPS
1.80 GFLOPS
78%
DFFT MT
6.02 GFLOPS
5.11 GFLOPS
18%
N-Body ST
1.41 Mpairs/s
0.78 Mpairs/s
81%
N-Body MT
2.69 Mpairs/s
2.34 Mpairs/s
15%
Ray Trace ST
4.99 MP/s
2.96 MP/s
69%
Ray Trace MT
9.56 MP/s
8.64 MP/s
11%

The story with Geekbench 3 floating point performance is much the same. Performance never regresses, even in multi-threaded workloads. In lightly threaded floating point workloads A9X is going to walk all over A8X, and in multi-threaded workloads we’re still looking at anywhere between a 9% and a 29% performance gain. This goes to show just how powerful Twister is relative to Typhoon, especially with A9X’s much higher clockspeeds factored in. And it lends a lot of support to Apple’s ongoing design philosophy of favoring a smaller number of high performance (and now higher-clocked) cores.

SPEC CPU 2006

Moving on, our other lower-level benchmark for this review is SPECint2006. Developed by the Standard Performance Evaluation Corporation, SPECint2006 is the integer component of their larger SPEC CPU 2006 benchmark. As was the case with SPEC CPU 2000 before it, SPEC CPU 2006 is designed by a committee of technology firms to offer a consistent and meaningful cross-platform benchmark that can compare systems of different performance levels and architectures. Among cross-platform benchmarks SPEC CPU is generally held in high regard, and while it is but one collection of benchmarks and like all benchmarks should not be taken as the be-all end-all of benchmarks on its own, it provides us with a very important look at CPU performance that we otherwise cannot get.

SPECint2006 is the successor to the SPECint2000 test we’ve been using periodically for the last couple of years now. Initially released in 2006, SPECint2006 is still SPEC’s current-generation CPU integer benchmark. We’ve wanted to switch to SPECint2006 for some time now, but have been held back by the overall low performance of tablet SoCs, which lacked the speed and memory to run SPECint2006 and to do so in a reasonable amount of time. However now thanks to the greater performance and greater memory of A9X, we’re finally able to run SPEC’s current-generation CPU benchmark on a tablet.

SPECint2006 is composed of 12 sub-benchmarks, testing a wide variety of scenarios from video compression to PERL execution to AI. This is a non-graphical benchmark and I believe it’s reasonable to argue that the benchmark set itself leans towards server high performance computing/workstation use cases, but with that said even if it’s not a perfect fit for tablet use cases it offers a lot of real-world tests that give us a good variety of different workloads to benchmark CPUs with. SPECint2006 scores are in turn reported as a ratio, measuring how many times faster a tested system is against the SPEC reference system, a 1997 Sun Ultrasparc Ultra Enterprise 2 server, which is based around a 296 MHz UltraSPARC II CPU.

CINT2006 (Integer Component of SPEC CPU2006):
Benchmark Language Application Area Description
400.perlbench
Programming Language  Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
401.bzip2
Compression  Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
403.gcc
C Compiler  Based on gcc Version 3.2, generates code for Opteron.
429.mcf
Combinatorial Optimization  Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
445.gobmk
Artificial Intelligence: Go  Plays the game of Go, a simply described but deeply complex game.
456.hmmer
Search Gene Sequence  Protein sequence analysis using profile hidden Markov models (profile HMMs)
458.sjeng
Artificial Intelligence: chess  A highly-ranked chess program that also plays several chess variants.
462.libquantum
C
Physics / Quantum Computing Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.
464.h264ref
Video Compression  A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2
471.omnetpp
C++ 
Discrete Event Simulation  Uses the OMNet++ discrete event simulator to model a large Ethernet campus network.
473.astar
C++ 
Path-finding Algorithms  Pathfinding library for 2D maps, including the well known A* algorithm.
483.xalancbmk
C++ 
XML Processing  A modified version of Xalan-C++, which transforms XML documents to other document types.

Although designed as a CPU-intensive benchmark, it’s important to note that SPECint2006 is officially labeled as “stressing a system's processor, memory subsystem and compiler.” The memory subsystem aspect is fairly self-explanatory – it’s difficult to test a CPU without testing the memory as well except in the cases of trivial workloads that can fit in a CPU’s caches – however the compiler aspect calls for special attention. As SPECint2006 is a cross-platform benchmark in the truest sense of the word, it’s impossible to offer a single binary for all platforms – especially platforms that had yet to be designed in 2006 such as ARMv8 – and, simply put, the moment you begin compiling benchmarks for different systems using different compilers, the performance of the compiler becomes a factor of benchmark performance as well.

As a result, and unlike many of the other benchmarks we run here, it’s important to note that compilers play a big part in SPECint2006 performance, and this is by design. Compiler authors can and do optimize for SPEC CPU, with the ultimate goal of giving the tested CPU the best chance to achieve the best possible performance in this benchmark; the compiler should not hold back the CPU. However in turn, all results must be validated, so overly aggressive compilers that generate bad code will be caught and failed. The end result is that in a cross-platform scenario with different binaries, SPECint2006 isn’t quite as apples-to-apples as our more traditional benchmarks, but it offers us a unique look at cross-platform CPU performance.

For our testing we’re using optimized binaries generated for Apple’s A8X/A9X SoCs and Intel’s Broadwell/Skylake processors respectively. The following compiler flags were used.

Apple ARMv8: XCode 7 (LLVM), -Ofast

Intel x86: Intel C++ Compiler 16, -xCORE-AVX2 -ipo -mdynamic-no-pic -O3 -no-prec-div -fp-model fast=2 -m32 -opt-prefetch -ansi-alias -stdlib=libstdc++

Finally, of SPECint2006’s 12 sub-benchmarks, our current harness is only able to run 10 of them on the iPad Pro at this time, as 473.astar and 483.xalancbmk are failing on the iPad. So the following is not a complete run of SPECint2006, and for the purposes of SPEC CPU are officially classified as performance estimates.

To start things off, let’s look at the Apple-to-Apple comparison, pitting A9X against A8X.

SPECint_base2006 - Estimated Scores - A9X vs. A8X
  A9X A8X A9X vs. A8X %
400.perlbench
25.0
14.1
78%
401.bzip2
17.6
11.5
54%
403.gcc
20.5
12.4
65%
429.mcf
18.7
N/A
N/A
445.gobmk
23.4
13.0
80%
456.hmmer
25.1
14.1
79%
458.sjeng
23.6
13.6
73%
462.libquantum
74.6
49.2
52%
464.h264ref
41.3
24.0
72%
471.omnetpp
10.3
8.0
29%

Unsurprisingly, A9X is leaps and bounds ahead here. The smallest gain is with 471.omnetpp, a discrete event simulator, where A9X holds a 29% lead. Otherwise A9X takes a significant lead, beating A8X by upwards of 80% in 445.gobmk, a Go (board game) AI benchmark.

Calling back to our iPhone 6s review for a moment, A9X has a much larger advantage vs. A8X with SPECint2006 as compared to A9 vs. A8 on SPECint2000. A good deal of this has to do with A9X’s significant clockspeed bump versus A8X, but at the same time this also illustrates how the newer SPECint2006 rates A9X and Twister even more highly than A8X/Typhoon. As we’ve seen time and time again, Twister is a much faster CPU core than the already fast Typhoon, and this is a big part of why Apple continues to top our ARM benchmarks.

Last but certainly not least however is our main event, A9X versus Intel’s Core M CPUs. As we’re finally able to run SPECint2006 on an Apple SoC, this is the first chance we’ve had to compare Apple and Intel CPUs using SPEC, so it’s exciting to finally be able to make this comparison.

At the same time this comparison not just for academic curiosity; as Apple has significantly improved their CPU design with every generation and has quickly moved to newer manufacturing processes, they have been closing the architecture and manufacturing gap with Intel. Twister and Skylake are fairly similar designs, both implementing a wide execution pipeline with a focus on achieving a high IPC, and in this latest generation of devices, coupling that with a fairly high 2GHz+ clockspeed. Over the years Apple and Intel have approached this problem from different angles – Apple built up from phones to tablets while Intel built down from desktops to tablets – but the end result is that the two have ended up in a similar place in terms of basic architecture design goals. Meanwhile from a manufacturing standpoint Intel is arguably still roughly a generation ahead with their 14nm FinFET process – naming aside, their transistors are smaller than TSMC’s 16nm FinFET – so Apple is the underdog from this point of view.

The burning question is of course is whether Apple’s CPU designs are catching up to the performance of Intel’s Core lineup, thanks to the continual iteration of architecture and manufacturing on the Apple side, versus the slower rate of growth we’ve seen over the last few generations with Intel’s Core lineup. The iPad Pro in turn finally gives us the opportunity to try to answer that question, as the faster SoC coupled with a form factor and TDP closer to regular Core M devices gives us the most apples-to-apples comparison yet.

To that end we have assembled a smorgasbord of Core M devices to compare to the iPad Pro and A9X SoC. Perhaps the most apple-to-apple comparison is the iPad Pro versus the 2015 MacBook; though approaching a year old, this is still Apple’s current generation MacBook, with our base model incorporating an older Broadwell-based Core M-5Y31. Also from the Broadwell generation we have an ASUS Transformer Book T300 Chi, which uses a high-end Core M-5Y71, to showcase the performance of Intel’s highest clocked Core M processors. Finally, from the latest Skylake generation we have the ASUS ZenBook UX305CA, which incorporates Intel’s base-tier Core m3-6Y30 CPU.

Finally, it should be noted that to keep testing as close as possible, all of these devices are passively cooled, and that as a result all of these devices are also TDP/heat throttling though much of the SPECint2006 benchmark. Ultimately what we’re measuring here is not the peak performance of each system, but rather its sustained performance under the TDP limitations of their respective designs. If unrestricted, undoubtedly all of these devices would score higher.

SPECint_base2006 - Estimated Scores - A9X vs. Intel Broadwell/Skylake
  A9X Core M-5Y31
(2015 MacBook)
Core M-5Y71
(Asus T300 Chi)
Core m3-6Y30
(Asus UX305CA)
A9X vs MacBook %
Base/Turbo Freq 2.26GHz 0.9/2.4GHz 1.2/2.9GHz 0.9/2.2GHz  
400.perlbench
25.0
21.7
28.5
24.4
15%
401.bzip2
17.6
14.6
19.6
15.3
21%
403.gcc
20.5
22.8
31.1
28.2
-10%
429.mcf
18.7
35.9
46.7
38
-48%
445.gobmk
23.4
16.9
23.7
18
38%
456.hmmer
25.1
43.9
61.9
48.1
-43%
458.sjeng
23.6
19.2
26.1
19.3
23%
462.libquantum
74.6
292
476
409
-74%
464.h264ref
41.3
38.4
49.7
37.3
8%
471.omnetpp
10.3
16.3
23.7
20.6
-37%

As this is a fairly dense lineup I’m not going to call out every figure, but let’s focus on a few key areas. First, on A9X versus the Core M-5Y31 (MacBook), the advantage flips between each device as each test hits upon different strengths and weaknesses of each CPU’s architecture. Overall each device wins half of the benchmarks, however the Core M powered MacBook wins by a larger average margin. In other words, the iPad Pro is competitive with the MacBook depending on the test, however on average it ends up trailing in performance.

Relative to the MacBook, the iPad Pro does best in 445.gobmk, the Go benchmark, while its largest deficit is with 462.libquantum. The latter is a particularly interesting case as the benchmark is very easy to vectorize, giving us perhaps our best look at the vector performance of Twister versus Broadwell, and how well their respective compilers can actually vectorize it. The end result has the Intel platforms solidly in the lead here, hinting that Intel still has better vector performance at this time.

Shifting gears to the Asus ZenBook UX305CA and its newer Skylake based Core m3-6Y30, to little surprise Skylake closes the gap with A9X in the benchmarks where Core M was losing, and pulls further ahead in the benchmarks where it was winning. Despite this the two systems split the number of wins at 5 each, but in the cases where the ZenBook is winning it’s very clearly winning. Overall Skylake sees some decent performance improvements relative to the Broadwell CPU in our MacBook – with the exact gains depending on the test – allowing it to widen the gap compared to the A9X. Overall A9X is still competitive in specific scenarios, but on average it definitely trails the Skylake Core m3.

Finally, going back to Broadwell we have the ASUS Transformer Book T300 Chi, which incorporates a high-end Core M-5Y71 processor. This is still officially a 4.5W TDP processor, and as a result this essentially measures Broadwell Core M’s best case performance. With a maximum CPU clockspeed of 2.9GHz as compared to the slower low-end Skylake and Broadwell CPUs, the T300 Chi unsurprisingly beats the iPad Pro in every single benchmark. At best the two are neck-and-neck with Apple’s best benchmark, 445.gobmk, but otherwise it’s a clear and very significant lead for Intel’s fastest Broadwell Core M processor.

In the end, what to take away from this depends on how you want to read the results and what you believe the most important CPU comparison is. As Apple doesn’t use multiple bins/clockspeeds of A9X processors, this muddles the comparison some since there’s a significant difference in performance between Intel’s fastest and slowest Core M processors, and at the same time Intel’s official list prices put every CPU except the top-bin Core m7-6Y75 at the same price of $281.

Ultimately I think it’s reasonable to say that Intel’s Core M processors hold a CPU performance edge over iPad Pro and the A9X SoC. Against Intel’s slowest chips A9X is competitive, but as it stands A9X can’t keep up with the faster chips. However by the same metric there’s no question that Apple is closing the gap; A9X can compete with both Broadwell and Skylake Core M processors, and that’s something Apple couldn’t claim even a generation ago. That it’s only against the likes of Core m3 means that Apple still has a way to go, particularly as A9X still loses by more than it wins, but it’s significant progress in a short period of time. And I’ll wager that it’s closer than Intel would like to be, especially if Apple puts A9X into a cheaper iPad Air in the future.

SoC Analysis: On x86 vs ARMv8 System Performance
Comments Locked

408 Comments

View All Comments

  • FunBunny2 - Sunday, January 24, 2016 - link

    -- Remember the original x86? What a horrendous, incompetently conceived turd!

    well. legend has it that IBM chose Intel over Motorola just because Intel a BK waiting to happen, thus easy to manipulate. Motorola, at that time with the 68K family, was the King Kong of microprocessors. or so the legend says.
  • Constructor - Sunday, January 24, 2016 - link

    Wouldn't surprise me much. It's also said the priority at IBM was to just head off the emerging threat of companies like Apple (with the Apple II back then, whose construction the IBM PC closely copied) but absolutely not do anything to impact the then-dominant IBM mainframe business, so the IBM PC had to be relatively weak and limited.
  • RafaelHerschel - Saturday, January 23, 2016 - link

    A lot of companies use 8 year old PCs without any problems.

    For most professionals a big monitor and a full sized keyboard plus a mouse are the keys to productivity.
  • Constructor - Saturday, January 23, 2016 - link

    To limited productivity in various cases where the only reason their workers even have to have and walk to a desk is that they don't have any mobile devices available which could serve the same purposes where the actual work is being done.

    That doesn't apply to every workplace, of course, but to quite a bunch of them.
  • FunBunny2 - Saturday, January 23, 2016 - link

    -- Pretty much any computer is disposable in 4 years because the shelf life for hardware before it goes obsolete is about 3 years.

    yes. and no. yes, Intel keeps making ever more big chips with, arguably, faster cpu. most of the real estate for years has been used by non-cpu functions. even an i7 is really an SoC. Intel gets monopoly control of computing.

    the reason pc sales have tanked in the last decade or so is simple: except for gearhead gamers, a Pentium does what most folks want to do good enough. it used to be that Windoze Next demanded the Intel Next processor just to run Word or Excel. not any more.

    used to be: "the top 10 applications for the PC are spreadsheets, word processing, email..." still is.
  • Relic74 - Saturday, February 27, 2016 - link

    Really, you have no problem moving files around in iOS, yeah, sorry but I don't believe you. I haven't met a single person, Dia hard Apple nuts as well that couldn't stand the lack of a decent file-management system. It's probably the worst I have ever seen on any OS. Everytime I download, edit and than upload to the cloud I create at least 4 copies if the same file. My system is littered with duplicates. File-managers in iOS do very little to alleviate this problem, it's just another place to hold more copies.
  • MaxIT - Saturday, February 13, 2016 - link

    When was last time you actually use an iDevice ? iOS 5 ?
    The whole argument about the file system is utterly ridiculous and outdated...
    You can manage your files in a lot of different ways in iOS as of today.

    Do you need a parallel port or a VGA exit on your tablet in 2016 ?
  • Sc0rp - Friday, January 22, 2016 - link

    Well, as someone that uses an iPad Pro, I like the idea of taking my work wherever I go and not having to deal with the overhead that comes from a traditional OS.
  • 10101010 - Friday, January 22, 2016 - link

    From what I see in my work, a lot of people think the same way. iOS is simple, reliable, consistent, and offers far less maintenance and security headaches compared to a traditional OS. The iPad Pro is showing up mostly in a "paper and pencil" replacement role, i.e. a role where a purpose-built tablet makes sense.

    Sure, files are clunky to access in iOS, but this also means that malicious apps can't get to your files. It's a compromise that many seem very willing to make vs. the near total lack of security in the Windows file system, for example.
  • Murloc - Saturday, January 23, 2016 - link

    download OrCad Capture and try drawing a circuit with it.

    Now think about how you can do the same with a smaller touch screen and NO mouse (so big buttons are a no-no because fingers) at the same speed.
    Can you fit the whole interface in the screen? If no, then component insertion is already slower than a PC, so a 10 years old school desktop computer wins, the tablet loses.

    This software requires no computational power at all.

Log in

Don't have an account? Sign up now