A7 SoC Explained

I’m still surprised by the amount of confusion around Apple’s CPU cores, so that’s where I’ll start. I’ve already outlined how ARM’s business model works, but in short there are two basic types of licenses ARM will bestow upon its partners: processor and architecture. The former involves implementing an ARM designed CPU core, while the latter is the creation of an ARM ISA (Instruction Set Architecture) compatible CPU core.

NVIDIA and Samsung, up to this point, have gone the processor license route. They take ARM designed cores (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate them into custom SoCs. In NVIDIA’s case the CPU cores are paired with NVIDIA’s own GPU, while Samsung licenses GPU designs from ARM and Imagination Technologies. Apple previously leveraged its ARM processor license as well. Until last year’s A6 SoC, all Apple SoCs leveraged CPU cores designed by and licensed from ARM.

With the A6 SoC however, Apple joined the ranks of Qualcomm with leveraging an ARM architecture license. At the heart of the A6 were a pair of Apple designed CPU cores that implemented the ARMv7-A ISA. I came to know these cores by their leaked codename: Swift.

At its introduction, Swift proved to be one of the best designs on the market. An excellent combination of performance and power consumption, the Swift based A6 SoC improved power efficiency over the previous Cortex A9 based design. Swift also proved to be competitive with the best from Qualcomm at the time. Since then however, Qualcomm has released two evolutions of its CPU core (Krait 300 and Krait 400), and pretty much regained performance leadership over Apple. Being on a yearly release cadence, this is Apple’s only attempt to take back the crown for the next 12 months.

Following tradition, Apple replaces its A6 SoC with a new generation: A7.

With only a week to test battery life, performance, wireless and cameras on two phones, in addition to actually using them as intended, there wasn’t a ton of time to go ridiculously deep into the new SoC’s architecture. Here’s what I’ve been able to piece together thus far.

First off, based on conversations with as many people in the know as possible, as well as just making an educated guess, it’s probably pretty safe to say that the A7 SoC is built on Samsung’s 28nm HK+MG process. It’s too early for 20nm at reasonable yields, and Apple isn’t ready to move some (not all) of its operations to TSMC.

The jump from 32nm to 28nm results in peak theoretical scaling of 76.5% (the same design on 28nm can be no smaller than 76.5% of the die area at 32nm). In reality, nothing ever scales perfectly so we’re probably talking about 80 - 85% tops. Either way that’s a good amount of room for new features.

At its launch event Apple officially announced both die size for the A7 (102mm^2) as well as transistor count (over 1 billion). Don’t underestimate the magnitude of both of these disclosures. The technical folks at Cupertino are clearly winning some battle to talk more about their designs and not less. We’re not yet at the point where I’m getting pretty diagrams and a deep dive, but it’s clear that Apple is beginning to open up more (and it’s awesome).

Apple has never previously disclosed transistor count. I also don’t know if this “over 1 billion” figure is based on a schematic or layout transistor count. The only additional detail I have is that Apple is claiming a near doubling of transistors compared to the A6. Looking at die sizes and taking into account scaling from the process node shift, there’s clearly a more fundamental change to the chip’s design. It is possible to optimize a design (and transistors) for area, which seems to be what has happened here.

The CPU cores are, once again, a custom design by Apple. These aren’t Cortex A57 derivatives (still too early for that), but rather some evolution of Apple’s own Swift architecture. I’ll dive into specifics of what I’ve been able to find in a moment. To answer the first question on everyone’s mind, I believe there are two of these cores on the A7. Before I explain how I arrived at this conclusion, let’s first talk about cores and clock speeds.

I always thought the transition from 2 to 4 cores happened quicker in mobile than I had expected. Thankfully there are some well threaded apps that have been able to take advantage of more than two cores and power gating keeps the negative impact of the additional cores down to a minimum. As we saw in our Moto X review however, two faster cores are still better for most uses than four cores running at lower frequencies. NVIDIA forced everyone’s hand in moving to 4 cores earlier than they would’ve liked, and now you pretty much can’t get away with shipping anything less than that in an Android handset. Even Motorola felt necessary to obfuscate core count with its X8 mobile computing system. Markets like China seem to also demand more cores over better ones, which is why we see such a proliferation of quad-core Cortex A5/A7 designs. Apple has traditionally been sensible in this regard, even dating back to core count decisions in its Macs. I remembering reviewing an old iMac and pitting it against a Dell XPS One at the time. This was in the pre-power gating/turbo days. Dell went the route of more cores, while Apple chose for fewer, faster ones. It also put the CPU savings into a better GPU. You can guess which system ended out ahead.

In such a thermally constrained environment, going quad-core only makes sense if you can properly power gate/turbo up when some cores are idle. I have yet to see any mobile SoC vendor (with the exception of Intel with Bay Trail) do this properly, so until we hit that point the optimal target is likely two cores. You only need to look back at the evolution of the PC to come to the same conclusion. Before the arrival of Nehalem and Lynnfield, you always had to make a tradeoff between fewer faster cores and more of them. Gaming systems (and most users) tended to opt for the former, while those doing heavy multitasking went with the latter. Once we got architectures with good turbo, the 2 vs 4 discussion became one of cost and nothing more. I expect we’ll follow the same path in mobile.

Then there’s the frequency discussion. Brian and I have long been hinting at the sort of ridiculous frequency/voltage combinations mobile SoC vendors have been shipping at for nothing more than marketing purposes. I remember ARM telling me the ideal target for a Cortex A15 core in a smartphone was 1.2GHz. Samsung’s Exynos 5410 stuck four Cortex A15s in a phone with a max clock of 1.6GHz. The 5420 increases that to 1.7GHz. The problem with frequency scaling alone is that it typically comes at the price of higher voltage. There’s a quadratic relationship between voltage and power consumption, so it’s quite possibly one of the worst ways to get more performance. Brian even tweeted an image showing the frequency/voltage curve for a high-end mobile SoC. Note the huge increase in voltage required to deliver what amounts to another 100MHz in frequency.

The combination of both of these things gives us a basis for why Apple settled on two Swift cores running at 1.3GHz in the A6, and it’s also why the A7 comes with two cores running at the same max frequency. Interestingly enough, this is the same max non-turbo frequency Intel settled at for Bay Trail. Given a faster process (and turbo), I would expect to see Apple push higher frequencies but without those things, remaining conservative makes sense. I verified frequency through a combination of reporting tools and benchmarks. While it’s possible that I’m wrong, everything I’ve run on the device (both public and not) points to a 1.3GHz max frequency.

Verifying core count is a bit easier. Many benchmarks report core count, I also have some internal tools that do the same - all agreed on the same 2 cores/2 threads conclusion. Geekbench 3 breaks out both single and multithreaded performance results. I checked with the developer to ensure that the number of threads isn’t hard coded. The benchmark queries the max number of logical CPUs before spawning that number of threads. Looking at the ratio of single to multithreaded performance on the iPhone 5s, it’s safe to say that we’re dealing with a dual-core part:

Geekbench 3 Single vs. Multithreaded Performance - Apple A7
  Integer FP
Single Threaded 1471 1339
Multi Threaded 2872 2659
A7 Advantage 1.97x 1.99x
Peak Theoretical 2C Advantage 2.00x 2.00x

Now the question is, what’s changed in these cores?

 

Introduction, Hardware & Cases After Swift Comes Cyclone
Comments Locked

464 Comments

View All Comments

  • Wilco1 - Thursday, September 19, 2013 - link

    The Geekbench results are indeed skewed by AES encryption. The author claimed AES was the only benchmark where they use hardware acceleration when available. There has been a debate on fixing the weighting or to place hardware accelerated benchmarks in a separate category to avoid skewing the results. So I'm hoping a future version will fix this.

    As for cross-platform benchmarking, Geekbench currently uses the default platform compiler (LLVM on iOS, GCC on Android, VC++ on Windows). So there will be compiler differences that skew results slightly. However this is also what you'd get if you built the same application for iOS and Android.
  • smartypnt4 - Thursday, September 19, 2013 - link

    A lot of the other stuff in Geekbench seems to be fairly representative, though. Except a few of the FP ones like the blur and sharpen tests...

    It surely can't be hard to have Geekbench omit those results. I think if they did that, you'd see that the A7 is roughly 50-60% faster than the A6 instead of 100% faster, but I'm not sure there. I'd have to go and do work to figure that out. Which is annoying :-)
  • name99 - Wednesday, September 18, 2013 - link

    I'd agree with the tweaks you suggest: (improved memory controller and prefetcher, doubling of L2, larger branch predictor tables).

    There is also scope for a wider CPU. Obviously the most simple-minded widening of a CPU substantially increases power, but there are ways to limit the extra power without compromising performance too much, if you are willing to spend the transistors. I think Apple is not just willing to spend the transistors, but will have them available to spend once they ditch 32-bit compatibility. At that point they can add a fourth decoder, use POWER style blocking of instructions to reduce retirement costs, and add whatever extra pipes make sense.
    The most useful improvement (in my experience) would be to up the L1 from being able to handle one load+store cycle to two loads+ one store per cycle, but I don't know what the power cost of that is --- may be too high.

    On the topic of minor tweaks, do we know what the page size used by iOS is? If they go from 4K to 16K and/or add support for large pages, they could get a 10% of so speed boost just from better TLB coverage.
    (And what's Android's story on this front? Do they stick with standard 4K pages, or do they utilize 16 or 64K pages and/or large pages?)
  • extide - Wednesday, September 18, 2013 - link

    Those are some pretty generous numbers you pulled out of your hat there. It's not as easy as just do this and that and bam, you have something to compete with Intel Core series stuff. No. I mean yeah, Apple has done a great job here and I wish someone else was making CPU's like this for the Android phones but oh well.
  • name99 - Wednesday, September 18, 2013 - link

    "Now, I will agree that this does prove that if Apple really wanted to, they could build something to compete with Haswell in terms of raw throughput."

    I agree with your point, but I think we should consider what an astonishing statement this is.
    Two years ago Apple wasn't selling it's own CPU. They burst onto the scene and with their SECOND device they're at an IPC and a performance/watt that equals Intel! Equals THE competitor in this space, the guys who are using the best process on earth.

    If you don't consider that astonishing, you don't understand what has happened here.

    (And once again I'd make my pitch that THIS shows what Intel's fatal flaw is. The problem with x86 is not that it adds area to a design, or that it slows it down --- though it does both. The problem is that it makes design so damn complex that you're constantly lagging; and you're terrified of making large changes because you might screw up.
    Apple, saddled with only the much smaller ARM overhead, has been vastly more nimble than Intel.
    And it's only going to get worse if, as I expect, Apple ditches 32-bit ARM as soon as they can, in two years or so, giving them an even easier design target...)

    What's next for Apple?
    At the circuit level, I expect them to work hard to make their CPU as good at turboing as Intel. (Anand talked about this.)
    At the ISA level, I expect their next major target to be some form of hardware transactional memory --- it just makes life so much easier, and, even though they're at two cores today, they know as well as anyone that the future is more cores. You don't have to do TM the way Intel has done it; the solution IBM used for POWER8 is probably a better fit for ARM. And of course if Apple do this (using their own extensions, because as far as I know ARM doesn't yet even have a TM spec) it's just one more way in which they differentiate their world from the commodity ARM world.
  • smartypnt4 - Wednesday, September 18, 2013 - link

    @extide: agreed.

    @name99: It is very astonishing indeed. Then again, a high profile company like Apple has no problem attracting some of the best talent via compensation and prestige.

    They've still got quite a long way to match Haswell, in any case. But the throughput is technically there to rival Intel if they wanted to. I would hope that Haswell contains a much more advanced branch predictor and prefetcher than what Apple has, but you never know. My computer architecture professor always said that everything in computer architecture has already been discovered. The question now is when will it be advantageous to spend the transistors to implement the most complicated designs.

    The next year is going to be very interesting, indeed.
  • Bob Todd - Wednesday, September 18, 2013 - link

    How many crows did you stuff down after claiming BT would be slower than A15 and even A12? Remember posting this about integer performance?

    "Silverthorne < A7 < A9 < A9R4 < Silvermont < A12 < Bobcat < A15 < Jaguar"

    Apple's A7 looks great, but you've made so many utterly ridiculous Intel performance bashing posts that it's pretty much impossible to take anything you say seriously.
  • Wilco1 - Wednesday, September 18, 2013 - link

    BT has indeed far lower IPC than A15 just like I posted - pretty much all benchmark results confirm that. On Geekbench 3 A15 is 23-25% faster clock for clock on integer and FP.

    The jury is still out on A12 vs BT as we've seen no performance results for A12 so far. So claiming I was wrong is not only premature but also incorrect as the fact is that Bay Trail is slower.
  • Wilco1 - Wednesday, September 18, 2013 - link

    Also new version with A7 and A57 now looks like this:

    Silverthorne < A7 < A9 < A9R4 < Silvermont < A12 < Bobcat < A15 < Jaguar < A57 < Apple A7
  • Bob Todd - Wednesday, September 18, 2013 - link

    Cherry picking a single benchmark which is notoriously inaccurate at comparisons across platforms/architectures doesn't make you "right", it just makes you look like more of a troll. Bay Trail has better integer performance than Jaguar (at near identical base clocks), so by your own ranking above it *has* to be faster than A12 and A15.

    You show up in every ARM article spouting the same drivel over and over again, yet you were mysteriously absent in the Bay Trail performance preview. Here's the link if you want to try to find a way to spin more FUD.

    http://anandtech.com/show/7314/intel-baytrail-prev...

    Apple's A7 looks great, and IT is still the powerhouse of mobile graphics. The A7 version in the iPad should be a beast. None of that makes most of your comments any less loony.

Log in

Don't have an account? Sign up now