Apple iPhone XS Review Addendum: Small Core and NN Performance

by Andrei Frumusanu on October 12, 2018 6:10 AM EST

32 Comments | Add A Comment

32 Comments

Last week we published our iPhone XS and XS Max review, in which went into great depth into the various aspects of the phones, especially into the section regarding the new-fangled A12’s CPU performance. However I wanted to dig a bit deeper into CPU performance than I had time for in the initial review, which I'm finally able to get around to now. The A12’s small cores were especially something I wanted to have in the article, as Apple's small cores haven't been very well investigated to date. As it’s still an important topic, I’m posting that part here as a pipeline as well as integrating it as an additional page in the review:

The A12 Tempest µarch: A Fierce Small Core

Apple had first introduced a “small” CPU core alongside the Twister cores in the A10 SoC, powering the iPhone 7 generation. We’ve never really had the opportunity to dissect these cores, and over the years there was a bit of mystery around them as to what they’re capable of.

Apple’s introduction of a heterogeneous CPU topology in one sense was one of the biggest validations for Arm designs. Having separate low(er)-power CPUs on a SoC is a simple matter of physics: It’s just not possible to have bigger microarchitectures scale down power as efficiently as if you would just use a separate smaller block. Even in a mythical perfectly clock-gated microarchitecture, you would not be able to combat the static leakage present in bigger CPU cores, and thus this would come with the negative consequence of being part of the everyday power consumption on a device, even for small workloads. Power gating the big CPU cores, and instead shifting to much smaller CPU in contrast, helps alleviate static leakage, as well as (if designed as such) improving the dynamic leakage power efficiency.

The Tempest cores in the A12 are now the third iteration of this “small” microarchitecture, and since the A11 they are now fully heterogeneous and work independently of the big cores. But the question is, is this actually the third iteration, or did Apple do something more interesting?

The Tempest core is a 3-wide out-of-order microarchitecture: Already out of the gate this means it has very little to do with Arm’s own “little” cores, such as the A53 and A55, as these are simpler in-order designs.

The Tempest core’s execution pipelines are also relatively few: There are just two main pipelines that are capable of simple ALU operations; meanwhile one of them also does integer and FP multiplications, and the other is able to do FP additions. Essentially we just have two primary execution ports to each of the more complex pipelines behind them. Meanwhile in addition to the two main pipelines, there’s also a dedicated combined load/store port.

Now what is very interesting here is that this essentially looks identical to Apple’s Swift microarchitecture from Apple's A6 SoC. It’s not very hard to imagine that Apple would have recycled this design, ported it to 64-bit, and they now use it as a lean out-of-order machine serving as the lower power CPU core. If this is indeed Swift derived, then on top of the three execution ports described above, we should also find a dedicated port for integer and fp divisions, such as not to block the main pipelines whenever such an instruction is fed.

The Tempest cores clock up to a maximum of 1587MHz and are served by 32KB instruction and data caches, as well as an increased shared 2MB L2 cache that uses power management to partially power down SRAM banks.

In terms of power efficiency, the Tempest cores were essentially my prime candidate to try to get to some sort of apples-to-apples comparison between the A11 and A12 for power efficiency. I haven’t seen major differences in the cores besides the bigger L2, and Apple has also kept the frequencies similar. Unfortunately, "similar" isn't identical in this case; because the small cores on the A11 can boost up to 1694MHz when there’s only one thread active on them, I had no really good way to also measure performance at iso-frequency.

I did run SPEC at an equal 1587MHz frequency by simply having a second dummy thread spinning on another core while the main workloads were benchmarking. And I did try to get some power figures through this method by regression testing the impact of the dummy thread. However the power was near identical to the figures I measured at 1694MHz. As a result I dropped the idea, and we'll just have to just keep in mind that the A11’s Mistal cores were running 6.7% faster in the following benchmarks:

Much like on the Vortex big cores, the biggest improvements for the new Tempest cores are found in the memory-sensitive benchmarks. The benchmarks in which Tempest loses to Mistral are mainly execution bound, and because of the frequency disadvantage, there’s no surprise that the A12 lost in this particular single-threaded small core scenario.

Overall, besides the memory improvements, the new Tempest cores looks very similar in performance to last year’s Mistral cores. This is great as we can also investigate the power efficiency, and maybe learn something more concrete about the advantages of TSMC's 7nm manufacturing process.

Unfortunately, the energy efficiency improvements are somewhat inconclusive, and more so maybe disappointing. Looking at the SPECint2006 workloads overall, the Tempest-powered A12 was 35% more energy efficient than the Mistral-powered A11. Because the Mistral cores were running at a higher frequency in this test, the actual efficiency gains for A12 would likely be even less at an ISO-frequency level. Granted, we’re still looking at a general ISO-performance comparison here, as the memory improvements in A12 were able to push the Tempest cores to an integer suite score nearly identical to the higher-clocked Mistral cores.

In the overall FP benchmarks, Tempest was only 17% more efficient, even though it did perform better than the A11’s Mistral cores.

Putting the A11 and A12 small cores in comparison with their big brothers as well as the competition from Arm, there’s not much surprise in terms of the results. Compared to the big Apple cores, the small cores only offer a third to a fourth of the performance, but they also use less than half the energy.

What did surprise me a lot was seeing just how well Apple’s small cores compare to Arm’s Cortex-A73 under SPECint. Here Apple’s small cores almost match the performance of Arm’s high-performance cores from ust 2 years ago. In SPEC's integer workloads, A12 Tempest is nearly equivalent to a 2.1GHz A73.

However in the SPECfp workloads, the small cores aren’t competitive. Not having dedicated floating-point execution resources puts the cores at a disadvantage, though they still offer great energy efficiency.

Apple’s small cores in general are a lot more performant that one would think. I’ve gathered some incomplete SPEC numbers on Arm’s A55 (it takes ages!) and in general the performance difference here is 2-3x depending on the benchmark. In recent years I’ve felt that Arm’s little core performance range has become insufficient in many workloads, and this may also be why we’re going to see a lot more three-tiered SoCs (such as the Kirin 980) in the coming future. As it stands, the gap between the maximum performance of the little cores and the most efficient low performance point of the big continues to grow into one direction. All of which makes me wonder whether it’s still worth it to stay with an in-order microarchitecture for Arm's efficiency cores.

Neural Network Inferencing Performance on the A12

Another big, mysterious aspect of the new A12 was the SoC's new neural engine, which Apple advertises as designed in-house. As you may have noticed in the die shot, it’s a quite big silicon block, very much equaling the two big Vortex CPU cores in size.

To my surprise, I found out that Master Lu’s AImark benchmark also supports iOS, and better still it uses Apple's CoreML framework to accelerate the same inference models as on Android. I ran the benchmark on the latest iPhone generations, as well as a few key Android devices.

鲁大师 / Master Lu - AImark - Inception V3 鲁大师 / Master Lu - AImark - ResNet34 鲁大师 / Master Lu - AImark - VGG16

Overall, Apple’s 8x performance claims weren’t quite confirmed in this particular test suite, but we see solid improvements of 4-6.5x. There’s one catch here in regards to the older iPhones: as you can see in the results, the A11-based iPhone X performs quite similarly to previous generation phones. What’s happening here is that Apple’s executing CoreML on the GPU. It seems to me that the NPU in the A11 might have never been exposed publicly via APIs.

The Huawei P20 Pro’s Kirin 970 falls roughly 2.5x behind the new A12 – which coincidentally exactly matches the advertised 2TOPs vs 5TOPs throughout capabilities of both SoC’s respective NPUs. Here the new Kirin 980 should be able to significantly close the gap.

Qualcomm’s Snapdragon 845 also performs very well, trading blows with the Kirin 970. AImark uses the SNPE framework for inference acceleration, as it doesn’t support the NNAPI as of yet. The Pixel 2 and Note9 offered terrible results here as they both had to fall back to CPU accelerated libraries.

In terms of power, I’m not too comfortable publishing power on the A12 because of how the workload was visibly transactional: The actual inferencing workload bumped up power consumption up to 5.5W, with lower gaps in-between. Without actually knowing what is happening in-between the bursts of activity, the average power figures for the whole test run can vary greatly. Nevertheless, the fact that Apple’s willing to go up to 5.5W means that they’re very much pushing the power envelope here and going for the highest burst performance. The GPU-accelerated iPhone’s power peaked in the 2.3W to 5W range depending on the inference model.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

32 Comments

View All Comments

mkozakewich - Sunday, October 14, 2018 - link
'-ant' already exists, and is a real suffix. 'Perform' already exists, and is a real word. You can combine word-parts in many different ways, enough that a dictionary would require an order of magnitude more pages if we wanted to collect each possibility.
GC2:CS - Friday, October 12, 2018 - link
Well. The GPU can consume up to 6W, single thread CPU load can go up to 4W and NPU consumes up to 5,5 W.

And TDP is like 3-4W ? This cannot continue forever like that, transactional load optimized or not.
Then how much of all those different parts of the chip can be used at once without blowing up ? How well would that fit into a laptop ? What if a three year torn up and disolved battery has to power this ?

Considering A12 actually matches A10X quite easilly, I think A12X is going to be truly desktop grade.
Constructor - Sunday, October 14, 2018 - link
I find the expected switch of the new iPad Pros from Lightning to USB-C somewhat peculiar, too.

That port change to the iPads is still just a rumour at this point, but it could fit perfectly into a strategy of announcing the Mac platform switch at next WWDC and offering a macOS developer beta for those iPad Pros until actual ARM Macs were available later in the year.

The iPads could then be used as interim developer Macs with a standard USB-C port so you could easily connect standard Mac peripherals (plus Bluetooth for keyboard and mouse).

It's not a strong piece of evidence, it would just fit rather effortlessly.
HStewart - Monday, October 15, 2018 - link
"That port change to the iPads is still just a rumour at this point, but it could fit perfectly into a strategy of announcing the Mac platform switch at next WWDC and offering a macOS developer beta for those iPad Pros until actual ARM Macs were available later in the year."

ARM based Mac has been rummer for years maybe even a decade. Even though the original mac's were closer to RISC machines with 68xxx and PowerPC series - but those days are long gone.

I have yet to see a real performance comparison of same benchmark of ARM vs x86 based CPU's. I am not talking about Web stuff

I just think the ARM cores are used for different purpose - IE Apps don't needs the power of desktop applications.

Plus something completely off the CPU's environments - it appears that MacOS line has not even done anything with touch screens and iOS line has not done anything with mouse and keyboard. It appears Apple is trying to keep them separated - maybe because they would love one day to kill the MacOS line.

Like any thing else you see on internet, this is just my opinion.
Constructor - Tuesday, October 16, 2018 - link

ARM based Mac has been rummer for years maybe even a decade.

It is almost certain that there are a few MacBooks at Apple with prototype Axx boards running experimental macOS builds, just for evaluation. Those rumours have been popping up consistently enough to make that plausible, but those will not be mass-produced like that. So the usual sequence would be an announcement at WWDC and some preliminary developer machines – back then it was PC motherboards in a PowerMac case, no it could be regular iPads conscripted for the task.

I have yet to see a real performance comparison of same benchmark of ARM vs x86 based CPU's. I am not talking about Web stuff

Geekbench with its suite of native tests puts the A12 cores at about desktop i5 level. It's mostly passive cooling and tiny pocket-sized batteries limiting multicore performance.

I just think the ARM cores are used for different purpose - IE Apps don't needs the power of desktop applications.

That's long in the past. Today they do, especially on iOS.

Plus something completely off the CPU's environments - it appears that MacOS line has not even done anything with touch screens and iOS line has not done anything with mouse and keyboard. It appears Apple is trying to keep them separated - maybe because they would love one day to kill the MacOS line.

That wouldn't change in my speculative scenario – iPads as developer Macs would not use their touch capability but would be used with keyboard and mouse/trackpad.

And especially looking at Windows I still think the separation keeps making sense.

Like any thing else you see on internet, this is just my opinion.

Of course, and the same applies to my own posts! 🙂
HStewart - Monday, October 15, 2018 - link
" I think A12X is going to be truly desktop grade."

Maybe for some applications, but real desktop applications and not just apps - ARM basically does not have the power to replace it.

A good sign of this when iPad Pro can actually create applications for the iPad Pro and iPhone. I don't believe development tools are available that actual run on iPad Pro. I am talking about native development and not across a VPN or other remote connection
Constructor - Tuesday, October 16, 2018 - link

Maybe for some applications, but real desktop applications and not just apps - ARM basically does not have the power to replace it.

The Axx CPUs don't use licensed ARM cores, they're just compatible to the ARMv8 instruction set. And that is not a limitation – quite the contrary!

A good sign of this when iPad Pro can actually create applications for the iPad Pro and iPhone. I don't believe development tools are available that actual run on iPad Pro. I am talking about native development and not across a VPN or other remote connection

That is not limited by CPU performance but by touch UI paradigms and the current state of code signing requirements under iOS.

Even the soon to be replaced current iPads are already more powerful than regular developer machines only a few years back.
dudedud - Friday, October 12, 2018 - link
"I’ve gathered some incomplete SPEC numbers on Arm’s A55 (it takes ages!) and in general the performance difference here is 2-3x depending on the benchmark"

Nice, but how much more power the tempest core consume vs those -very slow- A55 (or even A53)?
Andrei Frumusanu - Friday, October 12, 2018 - link
Tempest is more efficient.
name99 - Friday, October 12, 2018 - link
But surprisingly not THAT efficient. Half the energy for a quarter of the performance? That's a much worse tradeoff than I was expecting. I'm guessing this means one of
- either Apple tunes the OS so that the small core USUALLY runs at a lower (and much more efficient) frequency? OR
- a SPEC2006 type of workload exercises the uncore in a way that Apple does not expect for its uses of the small cores, and much of the energy is being burned in that uncore?

More generally, DAMN Andrei! I think we are all incredibly grateful that you joined Ars.
I hope you'll stay on the Apple beat for many years! And that you'll expand your domain to cover similar analyses of the new iPads (and, what the hell, the new Apple Watch; why not even the Apple TV!)

Apple iPhone XS Review Addendum: Small Core and NN Performance

The A12 Tempest µarch: A Fierce Small Core

Neural Network Inferencing Performance on the A12

Post Your Comment

32 Comments

View All Comments

mkozakewich - Sunday, October 14, 2018 - link

GC2:CS - Friday, October 12, 2018 - link

Constructor - Sunday, October 14, 2018 - link

HStewart - Monday, October 15, 2018 - link

Constructor - Tuesday, October 16, 2018 - link

HStewart - Monday, October 15, 2018 - link

Constructor - Tuesday, October 16, 2018 - link

dudedud - Friday, October 12, 2018 - link

Andrei Frumusanu - Friday, October 12, 2018 - link

name99 - Friday, October 12, 2018 - link

Log in

Don't have an account? Sign up now