SoC Analysis: CPU Performance

Now that we’ve had a chance to take a look at A9X’s design and a bit on the difference between the x86 and ARM ISAs, let’s take a look at A9X’s performance at a lower level.

From a CPU perspective A9X is just a higher clocked implementation of the dual-core Twister CPU design we first saw on A9 last year. As a result the fundamentals of the CPU architecture have not changed relative to A9. However A9X relative to A8X drops down from three CPU cores to two, so among the factors we’ll want to look at is how Apple has been impacted by dropping down to two faster cores.

We’ll start things off with Geekbench, 3, which gives us a fairly low-level look at CPU performance.

Geekbench 3 - Integer Performance
  A9X A8X % Advantage
AES ST
1.17 GB/s
0.98 GB/s
19%
AES MT
2.85 GB/s
3.16 GB/s
-10%
Twofish ST
120.7 MB/s
64.0 MB/s
89%
Twofish MT
228.3 MB/s
182.7 MB/s
25%
SHA1 ST
1.03 GB/s
0.53 GB/s
94%
SHA1 MT
1.95 GB/s
1.48 GB/s
32%
SHA2 ST
205.8 MB/s
119.1 MB/s
73%
SHA2 MT
395.5 MB/s
330.6 MB/s
20%
BZip2Comp ST
8.95 MB/s
5.71 MB/s
57%
BZip2Comp MT
17.0 MB/s
16.6 MB/s
2%
Bzip2Decomp ST
14.7 MB/s
8.98 MB/s
64%
Bzip2Decomp MT
28.1 MB/s
25.2 MB/s
12%
JPG Comp ST
33.7 MP/s
20.6 MP/s
64%
JPG Comp MT
64.4 MP/s
60.8 MP/s
6%
JPG Decomp ST
89.2 MP/s
53.0 MP/s
68%
JPG Decomp MT
166.5 MP/s
153.9 MP/s
8%
PNG Comp ST
2.11 MP/s
1.35 MP/s
56%
PNG Comp MT
4.04 MP/s
3.82 MP/s
6%
PNG Decomp ST
31.5 MP/s
18.7 MP/s
68%
PNG Decomp MT
56.9 MP/s
56.3 MP/s
1%
Sobel ST
138.3 MP/s
82.5 MP/s
68%
Sobel MT
258.7 MP/s
225.6 MP/s
15%
Lua ST
3.25 MB/s
1.68 MB/s
93%
Lua MT
6.02 MB/s
4.60 MB/s
31%
Dijkstra ST
10.1 Mpairs/s
6.70 Mpairs/s
51%
Dijkstra MT
17.6 Mpairs/s
16.0 Mpairs/s
10%

The interesting thing about Geekbench is that as a result of being a lower-level test the bulk of its tests scale up well with CPU core counts, as the benchmark can just spawn more threads. Consequently I wasn’t entirely sure what to expect here, as this presents the tri-core A8X with a much better than average scaling opportunity, making it especially harsh on the A9X.

But what the results show us is that even by dropping back down to two CPU cores, A9X does very well overall. The single-threaded results are greatly improved, with A9X offering better than a 50% single-threaded perf gain in the majority of the sub-tests. Meanwhile even with the multi-threaded tests, A9X only loses once, on AES. Otherwise two higher clocked Twister cores are beating three lower clocked Typhoon cores by anywhere between a few percent up to 32%. In this sense Geekbench is something of a worst-case scenario, as real-world software rarely benefits from additional cores this well (this being part of the reason why A8 and A9 did so well relative to quad Cortex-A57 designs), so it’s promising to see that even in this worst-case scenario A9X can deliver meaningful performance gains over A8X.

Geekbench 3 - Floating Point Performance
  A9X A8X % Advantage
BlackScholes ST
14.9 Mnodes/s
8.52 Mnodes/s
75%
BlackScholes MT
28.2 Mnodes/s
24.9 Mnodes/s
13%
Mandelbrot ST
2.23 GFLOPS
1.27 GFLOPS
76%
Mandelbrot MT
4.27 GFLOPS
3.66 GFLOPS
17%
Sharpen Filter ST
2.10 GFLOPS
1.08 GFLOPS
94%
Sharpen Filter MT
4.01 GFLOPS
3.12 GFLOPS
29%
Blur Filter ST
2.68 GFLOPS
1.53 GFLOPS
75%
Blur Filter MT
5.08 GFLOPS
4.47 GFLOPS
14%
SGEMM ST
6.77 GFLOPS
4.12 GFLOPS
64%
SGEMM MT
12.7 GFLOPS
11.6 GFLOPS
9%
DGEMM ST
3.32 GFLOPS
2.02 GFLOPS
64%
DGEMM MT
6.21 GFLOPS
5.61 GFLOPS
11%
SFFT ST
3.52 GFLOPS
1.92 GFLOPS
83%
SFFT MT
6.67 GFLOPS
5.40 GFLOPS
24%
DFFT ST
3.21 GFLOPS
1.80 GFLOPS
78%
DFFT MT
6.02 GFLOPS
5.11 GFLOPS
18%
N-Body ST
1.41 Mpairs/s
0.78 Mpairs/s
81%
N-Body MT
2.69 Mpairs/s
2.34 Mpairs/s
15%
Ray Trace ST
4.99 MP/s
2.96 MP/s
69%
Ray Trace MT
9.56 MP/s
8.64 MP/s
11%

The story with Geekbench 3 floating point performance is much the same. Performance never regresses, even in multi-threaded workloads. In lightly threaded floating point workloads A9X is going to walk all over A8X, and in multi-threaded workloads we’re still looking at anywhere between a 9% and a 29% performance gain. This goes to show just how powerful Twister is relative to Typhoon, especially with A9X’s much higher clockspeeds factored in. And it lends a lot of support to Apple’s ongoing design philosophy of favoring a smaller number of high performance (and now higher-clocked) cores.

SPEC CPU 2006

Moving on, our other lower-level benchmark for this review is SPECint2006. Developed by the Standard Performance Evaluation Corporation, SPECint2006 is the integer component of their larger SPEC CPU 2006 benchmark. As was the case with SPEC CPU 2000 before it, SPEC CPU 2006 is designed by a committee of technology firms to offer a consistent and meaningful cross-platform benchmark that can compare systems of different performance levels and architectures. Among cross-platform benchmarks SPEC CPU is generally held in high regard, and while it is but one collection of benchmarks and like all benchmarks should not be taken as the be-all end-all of benchmarks on its own, it provides us with a very important look at CPU performance that we otherwise cannot get.

SPECint2006 is the successor to the SPECint2000 test we’ve been using periodically for the last couple of years now. Initially released in 2006, SPECint2006 is still SPEC’s current-generation CPU integer benchmark. We’ve wanted to switch to SPECint2006 for some time now, but have been held back by the overall low performance of tablet SoCs, which lacked the speed and memory to run SPECint2006 and to do so in a reasonable amount of time. However now thanks to the greater performance and greater memory of A9X, we’re finally able to run SPEC’s current-generation CPU benchmark on a tablet.

SPECint2006 is composed of 12 sub-benchmarks, testing a wide variety of scenarios from video compression to PERL execution to AI. This is a non-graphical benchmark and I believe it’s reasonable to argue that the benchmark set itself leans towards server high performance computing/workstation use cases, but with that said even if it’s not a perfect fit for tablet use cases it offers a lot of real-world tests that give us a good variety of different workloads to benchmark CPUs with. SPECint2006 scores are in turn reported as a ratio, measuring how many times faster a tested system is against the SPEC reference system, a 1997 Sun Ultrasparc Ultra Enterprise 2 server, which is based around a 296 MHz UltraSPARC II CPU.

CINT2006 (Integer Component of SPEC CPU2006):
Benchmark Language Application Area Description
400.perlbench
Programming Language  Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an email indexer), and specdiff (SPEC's tool that checks benchmark outputs).
401.bzip2
Compression  Julian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather than doing I/O.
403.gcc
C Compiler  Based on gcc Version 3.2, generates code for Opteron.
429.mcf
Combinatorial Optimization  Vehicle scheduling. Uses a network simplex algorithm (which is also used in commercial products) to schedule public transport.
445.gobmk
Artificial Intelligence: Go  Plays the game of Go, a simply described but deeply complex game.
456.hmmer
Search Gene Sequence  Protein sequence analysis using profile hidden Markov models (profile HMMs)
458.sjeng
Artificial Intelligence: chess  A highly-ranked chess program that also plays several chess variants.
462.libquantum
C
Physics / Quantum Computing Simulates a quantum computer, running Shor's polynomial-time factorization algorithm.
464.h264ref
Video Compression  A reference implementation of H.264/AVC, encodes a videostream using 2 parameter sets. The H.264/AVC standard is expected to replace MPEG2
471.omnetpp
C++ 
Discrete Event Simulation  Uses the OMNet++ discrete event simulator to model a large Ethernet campus network.
473.astar
C++ 
Path-finding Algorithms  Pathfinding library for 2D maps, including the well known A* algorithm.
483.xalancbmk
C++ 
XML Processing  A modified version of Xalan-C++, which transforms XML documents to other document types.

Although designed as a CPU-intensive benchmark, it’s important to note that SPECint2006 is officially labeled as “stressing a system's processor, memory subsystem and compiler.” The memory subsystem aspect is fairly self-explanatory – it’s difficult to test a CPU without testing the memory as well except in the cases of trivial workloads that can fit in a CPU’s caches – however the compiler aspect calls for special attention. As SPECint2006 is a cross-platform benchmark in the truest sense of the word, it’s impossible to offer a single binary for all platforms – especially platforms that had yet to be designed in 2006 such as ARMv8 – and, simply put, the moment you begin compiling benchmarks for different systems using different compilers, the performance of the compiler becomes a factor of benchmark performance as well.

As a result, and unlike many of the other benchmarks we run here, it’s important to note that compilers play a big part in SPECint2006 performance, and this is by design. Compiler authors can and do optimize for SPEC CPU, with the ultimate goal of giving the tested CPU the best chance to achieve the best possible performance in this benchmark; the compiler should not hold back the CPU. However in turn, all results must be validated, so overly aggressive compilers that generate bad code will be caught and failed. The end result is that in a cross-platform scenario with different binaries, SPECint2006 isn’t quite as apples-to-apples as our more traditional benchmarks, but it offers us a unique look at cross-platform CPU performance.

For our testing we’re using optimized binaries generated for Apple’s A8X/A9X SoCs and Intel’s Broadwell/Skylake processors respectively. The following compiler flags were used.

Apple ARMv8: XCode 7 (LLVM), -Ofast

Intel x86: Intel C++ Compiler 16, -xCORE-AVX2 -ipo -mdynamic-no-pic -O3 -no-prec-div -fp-model fast=2 -m32 -opt-prefetch -ansi-alias -stdlib=libstdc++

Finally, of SPECint2006’s 12 sub-benchmarks, our current harness is only able to run 10 of them on the iPad Pro at this time, as 473.astar and 483.xalancbmk are failing on the iPad. So the following is not a complete run of SPECint2006, and for the purposes of SPEC CPU are officially classified as performance estimates.

To start things off, let’s look at the Apple-to-Apple comparison, pitting A9X against A8X.

SPECint_base2006 - Estimated Scores - A9X vs. A8X
  A9X A8X A9X vs. A8X %
400.perlbench
25.0
14.1
78%
401.bzip2
17.6
11.5
54%
403.gcc
20.5
12.4
65%
429.mcf
18.7
N/A
N/A
445.gobmk
23.4
13.0
80%
456.hmmer
25.1
14.1
79%
458.sjeng
23.6
13.6
73%
462.libquantum
74.6
49.2
52%
464.h264ref
41.3
24.0
72%
471.omnetpp
10.3
8.0
29%

Unsurprisingly, A9X is leaps and bounds ahead here. The smallest gain is with 471.omnetpp, a discrete event simulator, where A9X holds a 29% lead. Otherwise A9X takes a significant lead, beating A8X by upwards of 80% in 445.gobmk, a Go (board game) AI benchmark.

Calling back to our iPhone 6s review for a moment, A9X has a much larger advantage vs. A8X with SPECint2006 as compared to A9 vs. A8 on SPECint2000. A good deal of this has to do with A9X’s significant clockspeed bump versus A8X, but at the same time this also illustrates how the newer SPECint2006 rates A9X and Twister even more highly than A8X/Typhoon. As we’ve seen time and time again, Twister is a much faster CPU core than the already fast Typhoon, and this is a big part of why Apple continues to top our ARM benchmarks.

Last but certainly not least however is our main event, A9X versus Intel’s Core M CPUs. As we’re finally able to run SPECint2006 on an Apple SoC, this is the first chance we’ve had to compare Apple and Intel CPUs using SPEC, so it’s exciting to finally be able to make this comparison.

At the same time this comparison not just for academic curiosity; as Apple has significantly improved their CPU design with every generation and has quickly moved to newer manufacturing processes, they have been closing the architecture and manufacturing gap with Intel. Twister and Skylake are fairly similar designs, both implementing a wide execution pipeline with a focus on achieving a high IPC, and in this latest generation of devices, coupling that with a fairly high 2GHz+ clockspeed. Over the years Apple and Intel have approached this problem from different angles – Apple built up from phones to tablets while Intel built down from desktops to tablets – but the end result is that the two have ended up in a similar place in terms of basic architecture design goals. Meanwhile from a manufacturing standpoint Intel is arguably still roughly a generation ahead with their 14nm FinFET process – naming aside, their transistors are smaller than TSMC’s 16nm FinFET – so Apple is the underdog from this point of view.

The burning question is of course is whether Apple’s CPU designs are catching up to the performance of Intel’s Core lineup, thanks to the continual iteration of architecture and manufacturing on the Apple side, versus the slower rate of growth we’ve seen over the last few generations with Intel’s Core lineup. The iPad Pro in turn finally gives us the opportunity to try to answer that question, as the faster SoC coupled with a form factor and TDP closer to regular Core M devices gives us the most apples-to-apples comparison yet.

To that end we have assembled a smorgasbord of Core M devices to compare to the iPad Pro and A9X SoC. Perhaps the most apple-to-apple comparison is the iPad Pro versus the 2015 MacBook; though approaching a year old, this is still Apple’s current generation MacBook, with our base model incorporating an older Broadwell-based Core M-5Y31. Also from the Broadwell generation we have an ASUS Transformer Book T300 Chi, which uses a high-end Core M-5Y71, to showcase the performance of Intel’s highest clocked Core M processors. Finally, from the latest Skylake generation we have the ASUS ZenBook UX305CA, which incorporates Intel’s base-tier Core m3-6Y30 CPU.

Finally, it should be noted that to keep testing as close as possible, all of these devices are passively cooled, and that as a result all of these devices are also TDP/heat throttling though much of the SPECint2006 benchmark. Ultimately what we’re measuring here is not the peak performance of each system, but rather its sustained performance under the TDP limitations of their respective designs. If unrestricted, undoubtedly all of these devices would score higher.

SPECint_base2006 - Estimated Scores - A9X vs. Intel Broadwell/Skylake
  A9X Core M-5Y31
(2015 MacBook)
Core M-5Y71
(Asus T300 Chi)
Core m3-6Y30
(Asus UX305CA)
A9X vs MacBook %
Base/Turbo Freq 2.26GHz 0.9/2.4GHz 1.2/2.9GHz 0.9/2.2GHz  
400.perlbench
25.0
21.7
28.5
24.4
15%
401.bzip2
17.6
14.6
19.6
15.3
21%
403.gcc
20.5
22.8
31.1
28.2
-10%
429.mcf
18.7
35.9
46.7
38
-48%
445.gobmk
23.4
16.9
23.7
18
38%
456.hmmer
25.1
43.9
61.9
48.1
-43%
458.sjeng
23.6
19.2
26.1
19.3
23%
462.libquantum
74.6
292
476
409
-74%
464.h264ref
41.3
38.4
49.7
37.3
8%
471.omnetpp
10.3
16.3
23.7
20.6
-37%

As this is a fairly dense lineup I’m not going to call out every figure, but let’s focus on a few key areas. First, on A9X versus the Core M-5Y31 (MacBook), the advantage flips between each device as each test hits upon different strengths and weaknesses of each CPU’s architecture. Overall each device wins half of the benchmarks, however the Core M powered MacBook wins by a larger average margin. In other words, the iPad Pro is competitive with the MacBook depending on the test, however on average it ends up trailing in performance.

Relative to the MacBook, the iPad Pro does best in 445.gobmk, the Go benchmark, while its largest deficit is with 462.libquantum. The latter is a particularly interesting case as the benchmark is very easy to vectorize, giving us perhaps our best look at the vector performance of Twister versus Broadwell, and how well their respective compilers can actually vectorize it. The end result has the Intel platforms solidly in the lead here, hinting that Intel still has better vector performance at this time.

Shifting gears to the Asus ZenBook UX305CA and its newer Skylake based Core m3-6Y30, to little surprise Skylake closes the gap with A9X in the benchmarks where Core M was losing, and pulls further ahead in the benchmarks where it was winning. Despite this the two systems split the number of wins at 5 each, but in the cases where the ZenBook is winning it’s very clearly winning. Overall Skylake sees some decent performance improvements relative to the Broadwell CPU in our MacBook – with the exact gains depending on the test – allowing it to widen the gap compared to the A9X. Overall A9X is still competitive in specific scenarios, but on average it definitely trails the Skylake Core m3.

Finally, going back to Broadwell we have the ASUS Transformer Book T300 Chi, which incorporates a high-end Core M-5Y71 processor. This is still officially a 4.5W TDP processor, and as a result this essentially measures Broadwell Core M’s best case performance. With a maximum CPU clockspeed of 2.9GHz as compared to the slower low-end Skylake and Broadwell CPUs, the T300 Chi unsurprisingly beats the iPad Pro in every single benchmark. At best the two are neck-and-neck with Apple’s best benchmark, 445.gobmk, but otherwise it’s a clear and very significant lead for Intel’s fastest Broadwell Core M processor.

In the end, what to take away from this depends on how you want to read the results and what you believe the most important CPU comparison is. As Apple doesn’t use multiple bins/clockspeeds of A9X processors, this muddles the comparison some since there’s a significant difference in performance between Intel’s fastest and slowest Core M processors, and at the same time Intel’s official list prices put every CPU except the top-bin Core m7-6Y75 at the same price of $281.

Ultimately I think it’s reasonable to say that Intel’s Core M processors hold a CPU performance edge over iPad Pro and the A9X SoC. Against Intel’s slowest chips A9X is competitive, but as it stands A9X can’t keep up with the faster chips. However by the same metric there’s no question that Apple is closing the gap; A9X can compete with both Broadwell and Skylake Core M processors, and that’s something Apple couldn’t claim even a generation ago. That it’s only against the likes of Core m3 means that Apple still has a way to go, particularly as A9X still loses by more than it wins, but it’s significant progress in a short period of time. And I’ll wager that it’s closer than Intel would like to be, especially if Apple puts A9X into a cheaper iPad Air in the future.

SoC Analysis: On x86 vs ARMv8 System Performance
POST A COMMENT

406 Comments

View All Comments

  • ddriver - Friday, January 22, 2016 - link

    Should have named in iPad XL or something, this device will barely suit the need of any professional. Performance is good, but without supporting professional software, the hardware is useless. Reply
  • Coztomba - Friday, January 22, 2016 - link

    And why would anyone bother to make professional software if the hardware wasn't capable? They needed a starting off point to say "Hey we can produce the hardware to run pro apps on a iPad. It's only going to get better. Start developing!" Reply
  • ddriver - Friday, January 22, 2016 - link

    If anyone could stimulate software companies to do that, I guess that would be apple with its mountains of money and strong sales. They do have enough resources to do the software themselves.

    Mobile device hardware has been capable of professional workloads for at least 2-3 years. Nobody bothered to do it. Big software companies did not port their applications to ARM, instead they made cheap, crippled lesser versions. This is IMO a stupid move, they probably did this to promote their professional software to common folk, but it would have been more lucrative to bring professional software to mobile platforms.

    There are 2 main issues with mobile platforms - memory and CPU performance. Modern software is very bloated memory consumption wise, especially software relying on managed languages, the latter are also significantly slower in terms of performance than languages like C or C++.

    There is one big issue with legacy professional software - it originates back from the days developers were locked in platform specific application development APIs, so it represents a significant effort to port them to mobile platforms - essentially, most of the stuff needs to be rewritten.

    But a rewrite in faster and more efficient language, taking advantage of contemporary technology such as OpenCL can easily bring professional software to mobile platforms at an experience as good as that of desktop workstations. Naturally, more efficiently written software will also run that much better on powerful desktop machines as well.
    Reply
  • mr_tawan - Friday, January 22, 2016 - link

    Not all pro are in the multimedia industry, you know :-).

    For most office workers, for instance, the only things they might need are notetaking (onenote), email (outlook), wordprocessor (word), and calendar (onenote). Most all tablet are capable to all of that, but it is a bit awkard to work with (due to the missing stylus, and not-so-comfy keyboard).

    With iPad Pro which, well, address this issue in the same way as the Surface Pro by adding keyboard and stylus to the tablet. It's much easier to use the table extensively (rather than just browsing web and watching video, which is hardly described as a profession job). So I personally think that adding these two options could takes the iPad into the 'professional' realm.

    I think that Apple would love to have iPad to complement MacBook (and Mac Pro), rather than to compete. If you need more power than just by Mac Pro :-).
    Reply
  • ddriver - Friday, January 22, 2016 - link

    Yeah, why use one device when you can buy and lug around two devices instead. Reply
  • melgross - Friday, January 22, 2016 - link

    There's actually quite a lot of professional software available on iOS, and has been for some time. I suppose if you do t use iOS, and so do t know what's a bailable, you can say that little is available, but it's simply not true. Microsoft itself had about two dozen professional apps on iOS. You really need to look through the App Store. Reply
  • ddriver - Friday, January 22, 2016 - link

    What would those 24 ("two dozen") professional microsoft apps be? Reply
  • Dave Bothell - Saturday, January 23, 2016 - link

    Word, Excel, PowerPoint, Outlook, OWA for iPad, Sunrise Calendar, OneDrive, OneDrive for Business, OneNote, SmartGlass, Skype, Bing for iPad, Remote Desktop, Lync, Office 365 Admin, Intune, Azure Authenticator, Sway, SharePoint Newsfeed, Dynamics CRM, Dynamics Business Analyzer, Dynamics Time Management, PowerApps, Global Startup Directory. There's more, but you asked for 24. Reply
  • xerandin - Saturday, January 23, 2016 - link

    Smartglass isn't a professional app--it controls Xbox 360s or Xbox Ones, depending on which version of the app you install. Reply
  • dsraa - Sunday, January 24, 2016 - link

    you forgot bing.....bing isnt a professional app either. Reply

Log in

Don't have an account? Sign up now