Apple's Cyclone Microarchitecture Detailed

by Anand Lal Shimpi on March 31, 2014 2:10 AM EST

182 Comments | Add A Comment

182 Comments

The most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:

As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.

Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.

Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).

Apple Custom CPU Core Comparison
	Apple A6	Apple A7
CPU Codename	Swift	Cyclone
ARM ISA	ARMv7-A (32-bit)	ARMv8-A (32/64-bit)
Issue Width	3 micro-ops	6 micro-ops
Reorder Buffer Size	45 micro-ops	192 micro-ops
Branch Mispredict Penalty	14 cycles	16 cycles (14 - 19)
Integer ALUs	2	4
Load/Store Units	1	2
Load Latency	3 cycles	4 cycles
Branch Units	1	2
Indirect Branch Units	0	1
FP/NEON ALUs	?	3
L1 Cache	32KB I$ + 32KB D$	64KB I$ + 64KB D$
L2 Cache	1MB	1MB
L3 Cache	-	4MB

As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.

I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.

On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.

I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:

Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.

Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.

It wasn't until I wrote this piece that Apple's codenames started to make sense. Swift was quick, but Cyclone really does stir everything up. The earlier than expected introduction of a consumer 64-bit ARMv8 SoC caught pretty much everyone off guard (e.g. Qualcomm's shift to vanilla ARM cores for more of its product stack).

The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).

Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

182 Comments

View All Comments

TylerGrunter - Monday, March 31, 2014 - link
I guess you would need also to do some architectural changes, but I doubt there is anything that can't be overcome. In fact the Cyclone cores have very similar performance to Haswell cores in geekbench when clocked at similar frequencies.
EG with Celeron
http://browser.primatelabs.com/geekbench3/compare/...
And with i3-4010Y
http://browser.primatelabs.com/geekbench3/compare/...
Anders CT - Tuesday, April 1, 2014 - link
Sure, the A7 could reach the performance of desktop cpus in the few milliseconds before it melted away.
grubbymits - Monday, March 31, 2014 - link
with a pipe that wide, does this mean they'll try to get some SMT in there? they've stuck with dual-core, but if they want to be 'desktop' class, they're going to need more threads. A 6-wide pipe is a lot of silicon to just leave idle for most of the time.
MrSpadge - Monday, March 31, 2014 - link
Yeah.. SMT and put such chips into servers. Power efficiency seems to be brilliant. I wonder which packing density they could achieve if they left all that mobile SoC stuff out, which a server won't need.
name99 - Monday, March 31, 2014 - link
SMT strikes me as an unlikely path. These are not throughput cores, and the additional complexity of SMT is a distraction at this stage. 2x SMT gets you about 25% improvement, and that's not going to change (because it's basically tied in to 2x threads means each thread gets half the cache, so memory traffic goes up).

What would not surprise if it comes in the future, or is already in place, is automatic throttling of the pipe to reduce power. The idea would be that either driven by SW and/or HW the CPU could switch to a lower power mode which is, say, 3-wide, and an even lower 1-wide mode. In these lower modes a bunch of other features would also shrink to reduce power (eg the register file shuts down some of its ports). One could imagine, for example, that the HW detects on average the throughput it's seeing and, if that shrinks below a certain level and the SW has told it it's OK, the assumption is we're spending all our time waiting on memory, so let's switch to 3-wide mode and stay there for the next million instructions.

There have been features like this in other CPUs. For example in power-saving mode some CPUs have throttled I-fetch which then throttles the rest of the pipeline. Even doing that (which is really easy) in a way that's informed by how many instructions are being retired per cycle would be useful in saving power.
jasonelmore - Monday, March 31, 2014 - link
my theory is apple purposely put 1 GB of RAM in iPad's and iPhones to make the device get outdated faster. They knew they built a good cpu, Maybe to good, so they put 1gb of ram. which really hurts the ipad air's, and ipad mini retina's UI performance. They ui constantly dips below 30 FPS, and typing on my Air is very frustrating. sometimes there is a 2 second lag when hitting a letter on the keyboard.
André - Monday, March 31, 2014 - link
iOS 7.1 has vastly improved any problems I had with the iPad Air, although yeah, 1 GB of RAM will be what limit these devices in the near-future.
carpetbomberz - Monday, March 31, 2014 - link
Love the investigative tech journalism you and your sources are doing here. Keep up the good work, you are now achieving the level of the old Byte Magazine under Tom Halfhill (with much less cooperation I might add from the companies and products being covered). Kudos to you.
krumme - Monday, March 31, 2014 - link
Yeaa. Damn fine digging, and interesting information!
tipoo - Monday, March 31, 2014 - link
The rumors point to A8 seeing a move to quad core for Apple, I wonder if those are true. I think they could still improve single threaded performance, but the 2x gains will be harder and harder to come by, so maybe they will finally go the "throw more cores in" route.

Apple's Cyclone Microarchitecture Detailed

Post Your Comment

182 Comments

View All Comments

TylerGrunter - Monday, March 31, 2014 - link

Anders CT - Tuesday, April 1, 2014 - link

grubbymits - Monday, March 31, 2014 - link

MrSpadge - Monday, March 31, 2014 - link

name99 - Monday, March 31, 2014 - link

jasonelmore - Monday, March 31, 2014 - link

André - Monday, March 31, 2014 - link

carpetbomberz - Monday, March 31, 2014 - link

krumme - Monday, March 31, 2014 - link

tipoo - Monday, March 31, 2014 - link

Log in

Don't have an account? Sign up now