A7 SoC Explained

I’m still surprised by the amount of confusion around Apple’s CPU cores, so that’s where I’ll start. I’ve already outlined how ARM’s business model works, but in short there are two basic types of licenses ARM will bestow upon its partners: processor and architecture. The former involves implementing an ARM designed CPU core, while the latter is the creation of an ARM ISA (Instruction Set Architecture) compatible CPU core.

NVIDIA and Samsung, up to this point, have gone the processor license route. They take ARM designed cores (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate them into custom SoCs. In NVIDIA’s case the CPU cores are paired with NVIDIA’s own GPU, while Samsung licenses GPU designs from ARM and Imagination Technologies. Apple previously leveraged its ARM processor license as well. Until last year’s A6 SoC, all Apple SoCs leveraged CPU cores designed by and licensed from ARM.

With the A6 SoC however, Apple joined the ranks of Qualcomm with leveraging an ARM architecture license. At the heart of the A6 were a pair of Apple designed CPU cores that implemented the ARMv7-A ISA. I came to know these cores by their leaked codename: Swift.

At its introduction, Swift proved to be one of the best designs on the market. An excellent combination of performance and power consumption, the Swift based A6 SoC improved power efficiency over the previous Cortex A9 based design. Swift also proved to be competitive with the best from Qualcomm at the time. Since then however, Qualcomm has released two evolutions of its CPU core (Krait 300 and Krait 400), and pretty much regained performance leadership over Apple. Being on a yearly release cadence, this is Apple’s only attempt to take back the crown for the next 12 months.

Following tradition, Apple replaces its A6 SoC with a new generation: A7.

With only a week to test battery life, performance, wireless and cameras on two phones, in addition to actually using them as intended, there wasn’t a ton of time to go ridiculously deep into the new SoC’s architecture. Here’s what I’ve been able to piece together thus far.

First off, based on conversations with as many people in the know as possible, as well as just making an educated guess, it’s probably pretty safe to say that the A7 SoC is built on Samsung’s 28nm HK+MG process. It’s too early for 20nm at reasonable yields, and Apple isn’t ready to move some (not all) of its operations to TSMC.

The jump from 32nm to 28nm results in peak theoretical scaling of 76.5% (the same design on 28nm can be no smaller than 76.5% of the die area at 32nm). In reality, nothing ever scales perfectly so we’re probably talking about 80 - 85% tops. Either way that’s a good amount of room for new features.

At its launch event Apple officially announced both die size for the A7 (102mm^2) as well as transistor count (over 1 billion). Don’t underestimate the magnitude of both of these disclosures. The technical folks at Cupertino are clearly winning some battle to talk more about their designs and not less. We’re not yet at the point where I’m getting pretty diagrams and a deep dive, but it’s clear that Apple is beginning to open up more (and it’s awesome).

Apple has never previously disclosed transistor count. I also don’t know if this “over 1 billion” figure is based on a schematic or layout transistor count. The only additional detail I have is that Apple is claiming a near doubling of transistors compared to the A6. Looking at die sizes and taking into account scaling from the process node shift, there’s clearly a more fundamental change to the chip’s design. It is possible to optimize a design (and transistors) for area, which seems to be what has happened here.

The CPU cores are, once again, a custom design by Apple. These aren’t Cortex A57 derivatives (still too early for that), but rather some evolution of Apple’s own Swift architecture. I’ll dive into specifics of what I’ve been able to find in a moment. To answer the first question on everyone’s mind, I believe there are two of these cores on the A7. Before I explain how I arrived at this conclusion, let’s first talk about cores and clock speeds.

I always thought the transition from 2 to 4 cores happened quicker in mobile than I had expected. Thankfully there are some well threaded apps that have been able to take advantage of more than two cores and power gating keeps the negative impact of the additional cores down to a minimum. As we saw in our Moto X review however, two faster cores are still better for most uses than four cores running at lower frequencies. NVIDIA forced everyone’s hand in moving to 4 cores earlier than they would’ve liked, and now you pretty much can’t get away with shipping anything less than that in an Android handset. Even Motorola felt necessary to obfuscate core count with its X8 mobile computing system. Markets like China seem to also demand more cores over better ones, which is why we see such a proliferation of quad-core Cortex A5/A7 designs. Apple has traditionally been sensible in this regard, even dating back to core count decisions in its Macs. I remembering reviewing an old iMac and pitting it against a Dell XPS One at the time. This was in the pre-power gating/turbo days. Dell went the route of more cores, while Apple chose for fewer, faster ones. It also put the CPU savings into a better GPU. You can guess which system ended out ahead.

In such a thermally constrained environment, going quad-core only makes sense if you can properly power gate/turbo up when some cores are idle. I have yet to see any mobile SoC vendor (with the exception of Intel with Bay Trail) do this properly, so until we hit that point the optimal target is likely two cores. You only need to look back at the evolution of the PC to come to the same conclusion. Before the arrival of Nehalem and Lynnfield, you always had to make a tradeoff between fewer faster cores and more of them. Gaming systems (and most users) tended to opt for the former, while those doing heavy multitasking went with the latter. Once we got architectures with good turbo, the 2 vs 4 discussion became one of cost and nothing more. I expect we’ll follow the same path in mobile.

Then there’s the frequency discussion. Brian and I have long been hinting at the sort of ridiculous frequency/voltage combinations mobile SoC vendors have been shipping at for nothing more than marketing purposes. I remember ARM telling me the ideal target for a Cortex A15 core in a smartphone was 1.2GHz. Samsung’s Exynos 5410 stuck four Cortex A15s in a phone with a max clock of 1.6GHz. The 5420 increases that to 1.7GHz. The problem with frequency scaling alone is that it typically comes at the price of higher voltage. There’s a quadratic relationship between voltage and power consumption, so it’s quite possibly one of the worst ways to get more performance. Brian even tweeted an image showing the frequency/voltage curve for a high-end mobile SoC. Note the huge increase in voltage required to deliver what amounts to another 100MHz in frequency.

The combination of both of these things gives us a basis for why Apple settled on two Swift cores running at 1.3GHz in the A6, and it’s also why the A7 comes with two cores running at the same max frequency. Interestingly enough, this is the same max non-turbo frequency Intel settled at for Bay Trail. Given a faster process (and turbo), I would expect to see Apple push higher frequencies but without those things, remaining conservative makes sense. I verified frequency through a combination of reporting tools and benchmarks. While it’s possible that I’m wrong, everything I’ve run on the device (both public and not) points to a 1.3GHz max frequency.

Verifying core count is a bit easier. Many benchmarks report core count, I also have some internal tools that do the same - all agreed on the same 2 cores/2 threads conclusion. Geekbench 3 breaks out both single and multithreaded performance results. I checked with the developer to ensure that the number of threads isn’t hard coded. The benchmark queries the max number of logical CPUs before spawning that number of threads. Looking at the ratio of single to multithreaded performance on the iPhone 5s, it’s safe to say that we’re dealing with a dual-core part:

Geekbench 3 Single vs. Multithreaded Performance - Apple A7
  Integer FP
Single Threaded 1471 1339
Multi Threaded 2872 2659
A7 Advantage 1.97x 1.99x
Peak Theoretical 2C Advantage 2.00x 2.00x

Now the question is, what’s changed in these cores?

 

Introduction, Hardware & Cases After Swift Comes Cyclone
POST A COMMENT

464 Comments

View All Comments

  • ddriver - Wednesday, September 18, 2013 - link

    My basis for this conclusion is how the article is structured, the carefully picking of benchmarks and selective comparisons. This is clearly visible and has nothing to do with the actual chip specifications. It has nothing to do with execution mode specific details. So no, I don't have problem with facts, unlike you.

    Furthermore, that 30% number you were focused on is hardly impressive and proportional to the claims this article is making. In a workload that would take an hour, 30% is a noticeable improvement, but for typical phone applications this is not the case.
    Reply
  • Dooderoo - Wednesday, September 18, 2013 - link

    The structure of the article and the benchmarks are mostly the same as they use in most reviews, excluding some Android specific benchmarks. Where exactly do you see "carefully picking of benchmarks and selective comparisons". Put differently: what benchmarks should they include to convince you there is no "cunning deceit" at work?

    What claims in the article are not proportional with the 30% (actually more) performance gain?

    I won't even comment on the "not a noticeable improvement" bit.
    Reply
  • andrewaggb - Wednesday, September 18, 2013 - link

    My issue with all the benchmarks is that they are mostly synthetic. The most meaningful benchmarks are the applications you plan to use and the usage patterns you are targeting. Synthetics are fascinating, but I think it's generally a mistake to buy anything based on them. Reply
  • notddriver - Thursday, September 19, 2013 - link

    Um, so if a 30% improvement is hardly impressive and irrelevant to phones, then isn't the entire concept of reviewing phones on the basis of hardware performance also irrelevant? Which would make your complaints about the biased-yet-insignificant-review as vital as a debate over whether Harry Potter or Spiderman would be better at defending Metropolis.

    Incidentally, my iPhone 5 is powerful enough that I never notice any issues—as I'm sure the last generation of Android phones would be. But if your going to go to town over a dozen or more comments about a topic, at least pretend that it matters a tiny bit. Just good form.
    Reply
  • oRdchaos - Wednesday, September 18, 2013 - link

    I've seen people all over the web get very worked up about people's phrasing with regard to 64-bit. Would you prefer the title of the section were "Performance gains from a 64-bit architecture and the new ARMv8 instruction set"? People keep arguing that 64-bit in a vacuum doesn't give much performance gain. But there is no vacuum.

    I think the article is very clear to point out where gains are from additional instructions, versus a doubling of the register bit width, versus improved memory subsystem/cache. I'm sure when they get chances to write more of their own tests, they'll be able to pinpoint things further.
    Reply
  • sfaerew - Wednesday, September 18, 2013 - link

    engadget is multi-thread geekbench performance. tegra4 4cores vs A7 2cores Reply
  • Spoony - Wednesday, September 18, 2013 - link

    - You are correct, there are no native cross-platform benches used. Which ones do you suggest Anandtech use? We all know Geekbench is essentially meaningless across platforms.

    - If you are talking about this engadget review: http://www.engadget.com/2013/09/17/iphone-5s-revie... It appears that Nvidia SHIELD (Tegra 4) led in only one benchmark out of six. This makes your statement incorrect. LG G2 is more competitive. Need we repeat how inaccurate Geekbench is cross-platform. It is as apples-to-oranges as the JS tests.

    - I believe what Anandtech was attempting to show with the encryption was the difference ARMv8 ISA makes. In fact the title of that somewhat sensational chart is "AArch64 vs. AArch32 Performance Comparison". So while you are right, the encryption tests are handled in a fundamentally different way, that way is part of the ISA and is an advantage of AArch64, and thus is valid in the chart.

    - It will be curious to see whether Qualcomm can deliver A7 like performance using ARMv7 with extended features. My position is no, which is the whole point of that entire page of the review. ARMv8 is actually enabling some additional performance due to ISA efficiency and more features.

    - I think noticeable memory footprint bloat of a 64-bit executable is completely ridiculous. But to see if I was right, I did some testing. It's getting a bit hard to find fat binaries to take apart these days, most things are x86_64 only. But I found a few. I computed for three separate applications, took the average, and it looks like about a 9% increase in executable size. Considering that executables themselves are a tiny part of any application's assets, I think it is completely insubstantial. If you calculate the increase in executable size versus the size of the whole application package, it averages to a less than 1% increase.

    I too am a bit sad that Apple didn't increase the RAM, and also equally sad that connectivity was left out this rev. I continue to be sad that there is not a more serious storage controller inside the phone. You make some valid points, but I think you also make some erroneous ones. The question with phone SoCs is: Is this a well balanced platform along the axes of performance (GPU and CPU), power consumption (thus heat), and features. I believe that the A7 is well calibrated. Obviously Qualcomm is also doing great work, and perhaps their SoCs are equally as well calibrated.
    Reply
  • ddriver - Wednesday, September 18, 2013 - link

    - This is entirely his decision, considering writing those reviews is his job, not mine. He can either use actual native benchmarks which reflect the performance of the actual hardware, or call it JS VM performance instead of CPU performance, because different JS implementations across platforms are entirely meaningless.

    - there is only one geekbench test at engadged. That is what I said "geekbench" - I did not imply it was faster in all tests in the engadged review, I don't know why you insinuate I did so. IIRC snapdragon 800 is actually a little slower in the CPU department than tegra 4, and only faster in the graphics department.

    - the boost in encryption is completely disproportional to other improvements and is due to hardware implementations, not 64bit execution mode. So, if anything, it should be a graph or chart on its own, instead of using it to bulk up the chart that is supposed to be indicative of integer performance improvements in 64bit mode.

    - maybe v7 chips with 128 bit SIMD units will not deliver quite the performance of A7, because there is more to the subject than the width of the registers (the number of registers doesn't really matter that much), like supported instructions. At any rate, v7 chips are still quad core, which means 4x128bit SIMD units compared to the 2 on A7, albeit with a few extra supported instructions. Until any native benchmark that guarantees saturation of SIMD units pops out, it will be a foolish thing to make a concrete statement on the subject. But boosting v7 SIMD units to 128bit width will at least make it competitive in number crunching scenarios, which use SIMD 99% of the time.

    - this is very relative, you can store the same data in three different containers and get a completely different footprint - a vector will only use a single pointer, since it is continuous in memory, a forward list will use a pointer for every data element, a linked list will use two. Depending on the requirements, you may need faster arbitrary inserts and deletions, which will require a linked list, and in the case of a single byte datatype, a 32bit single element will be 12 bytes because of padding and alignment, while in 64bit mode the size will grow to 24 bytes, which is exactly double. Granted, this is the other extremity of the "less than 1%" you came up with, truth is results will vary in between depending on the workload.

    As I said in the first post, the wise thing would be to reserve judgement until mass availability, mostly because I know corporate practices involving exclusive reviews prior to availability, which are a pronounced determining factor to the initial rate of sales. In short, apple is in the position to be greatly rewarded for imposing some cheating requirements on early exclusive reviewers. And at least in this aspect I think everyone will disagree, apple is not the kind of company to let such an opportunity go to waste.
    Reply
  • ddriver - Wednesday, September 18, 2013 - link

    *no one will disagree Reply
  • Dug - Wednesday, September 18, 2013 - link

    I will.
    "As I said in the first post, the wise thing would be to reserve judgement until mass availability, mostly because I know corporate practices involving exclusive reviews prior to availability, which are a pronounced determining factor to the initial rate of sales. In short, apple is in the position to be greatly rewarded for imposing some cheating requirements on early exclusive reviewers. And at least in this aspect I think everyone will disagree, apple is not the kind of company to let such an opportunity go to waste."

    Prove it and stop making assumptions.
    Reply

Log in

Don't have an account? Sign up now