After Swift Comes Cyclone Oscar

I was fortunate enough to receive a tip last time that pointed me at some LLVM documentation calling out Apple’s Swift core by name. Scrubbing through those same docs, it seems like my leak has been plugged. Fortunately I came across a unique string looking at the iPhone 5s while it booted:

I can’t find any other references to Oscar online, in LLVM documentation or anywhere else of value. I also didn’t see Oscar references on prior iPhones, only on the 5s. I’d heard that this new core wasn’t called Swift, referencing just how different it was. Obviously Apple isn’t going to tell me what it’s called, so I’m going with Oscar unless someone tells me otherwise.

Oscar is a CPU core inside M7, Cyclone is the name of the Swift replacement.

Cyclone likely resembles a beefier Swift core (or at least Swift inspired) than a new design from the ground up. That means we’re likely talking about a 3-wide front end, and somewhere in the 5 - 7 range of execution ports. The design is likely also capable of out-of-order execution, given the performance levels we’ve been seeing.

Cyclone is a 64-bit ARMv8 core and not some Apple designed ISA. Cyclone manages to not only beat all other smartphone makers to ARMv8 but also key ARM server partners. I’ll talk about the whole 64-bit aspect of this next, but needless to say, this is a big deal.

The move to ARMv8 comes with some of its own performance enhancements. More registers, a cleaner ISA, improved SIMD extensions/performance as well as cryptographic acceleration are all on the menu for the new core.

Pipeline depth likely remains similar (maybe slightly longer) as frequencies haven’t gone up at all (1.3GHz). The A7 doesn’t feature support for any thermal driven CPU (or GPU) frequency boost.

The most visible change to Apple’s first ARMv8 core is a doubling of the L1 cache size: from 32KB/32KB (instruction/data) to 64KB/64KB. Along with this larger L1 cache comes an increase in access latency (from 2 clocks to 3 clocks from what I can tell), but the increase in hit rate likely makes up for the added latency. Such large L1 caches are quite common with AMD architectures, but unheard of in ultra mobile cores. A larger L1 cache will do a good job keeping the machine fed, implying a larger/more capable core.

The L2 cache remains unchanged in size at 1MB shared between both CPU cores. L2 access latency is improved tremendously with the new architecture. In some cases I measured L2 latency 1/2 that of what I saw with Swift.

The A7’s memory controller sees big improvements as well. I measured 20% lower main memory latency on the A7 compared to the A6. Branch prediction and memory prefetchers are both significantly better on the A7.

I noticed large increases in peak memory bandwidth on top of all of this. I used a combination of custom tools as well as publicly available benchmarks to confirm all of this. A quick look at Geekbench 3 (prior to the ARMv8 patch) gives a conservative estimate of memory bandwidth improvements:

Geekbench 3.0.0 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
Apple A7 1.3GHz 5.24 GB/s 5.21 GB/s 5.74 GB/s 5.71 GB/s
Apple A6 1.3GHz 4.93 GB/s 3.77 GB/s 3.63 GB/s 3.62 GB/s
A7 Advantage 6% 38% 58% 57%

We see anywhere from a 6% improvement in memory bandwidth to nearly 60% running the same Stream code. I’m not entirely sure how Geekbench implemented Stream and whether or not we’re actually testing other execution paths in addition to (or instead of) memory bandwidth. One custom piece of code I used to measure memory bandwidth showed nearly a 2x increase in peak bandwidth. That may be overstating things a bit, but needless to say this new architecture has a vastly improved cache and memory interface.

Looking at low level Geekbench 3 results (again, prior to the ARMv8 patch), we get a good feel for just how much the CPU cores have improved.

Geekbench 3.0.0 Compute Performance
  Integer (ST) Integer (MT) FP (ST) FP (MT)
Apple A7 1.3GHz 1065 2095 983 1955
Apple A6 1.3GHz 750 1472 588 1165
A7 Advantage 42% 42% 67% 67%

Integer performance is up 44% on average, while floating point performance is up by 67%. Again this is without 64-bit or any other enhancements that go along with ARMv8. Memory bandwidth improves by 35% across all Geekbench tests. I confirmed with Apple that the A7 has a 64-bit wide memory interface, and we're likely talking about LPDDR3 memory this time around so there's probably some frequency uplift there as well.

The result is something Apple refers to as desktop-class CPU performance. I’ll get to evaluating those claims in a moment, but first, let’s talk about the other big part of the A7 story: the move to a 64-bit ISA.

A7 SoC Explained The Move to 64-bit
Comments Locked

464 Comments

View All Comments

  • Dug - Wednesday, September 18, 2013 - link

    "maybe you should hire a developer to write native cross platform benchmark tools"
    WHY? It is not going to make any difference. Developers aren't writing native cross platform programs. If they can take advantage of anything that's in the system, then show it off.
    That would be like telling car manufacturers to redesign a hybrid to gas only to compare with all the other gas only cars.
  • ddriver - Wednesday, September 18, 2013 - link

    "Developers aren't writing native cross platform programs"

    Maybe it is about time you crawl from under the rock you are living under... Any even remotely concerned with performance and efficiency application pretty much mandates it is a native application. It would be incredibly stupid to not do it, considering the "closest" to native language Java is like 2-3 times slower and users 10-20 times as much memory.
  • Dug - Wednesday, September 18, 2013 - link

    Exactly my point! "native cross platform" Each cross-platform solution can only support a subset of the functionality included in each native platform.

    It doesn't get you anywhere to produce a native cross platform benchmark tool.

    Again you have to mitigate to names and snide comments because you are wrong.
  • ddriver - Wednesday, September 18, 2013 - link

    What you talk about is I/O, events and stuff like that. When it comes to pure number crunching the same code can execute perfectly well for every platform it is complied against. Actually, some modern frameworks go even further than that and provide ample abstractions. For example, the same GUI application can run on Windows, Linux, MacOS, iOS and Android, apart from a few other minor platforms.
  • Anand Lal Shimpi - Wednesday, September 18, 2013 - link

    Ultimately the benchmarking problem is being fixed, just not on the time scale that we want it to. I figured we'd be better off by now, and in many ways we are (WebXPRT, Browsermark are both steps in the right direction, we have more native tools under Android now) but part of the problem is there was a long period of uncertainty around what OSes would prevail. Now that question is finally being answered and we're seeing some real investment in benchmarks. Trust me, I tried to do a lot behind the scenes over the past 4 years (some of which Brian and I did recently) but this stuff takes time. I remember going through this in the early days of the PC industry too though, I know how it all ends - it'll just take a little time to get there.

    Actually I think 128-bit registers might've been optional on v7.

    The only reason encryption results are in that table is because that's how Geekbench groups them. There's no nefarious purpose there (note that it's how we've always reported the Geekbench results, as they are reported in the test themselves).

    In my experience with the 5s I haven't noticed any performance regressions compared to the 5/5c. I'm not saying they don't exist and I'll continue to hunt, it's just that they aren't there now. I believe I established the reasoning for why you'd want to do this early, and again we're talking about at most 12 months before they should start the move to 64-bit anyways. Apple tends to like its ISA transitions to be as quick and painless as possible, and moving early to ARMv8 makes a lot of sense in that light. Sure they are benefiting from the marketing benefits of having a feature that no one else does, but what company doesn't do that?

    I don't believe the move to 64-bit with Cyclone was driven first and foremost by marketing. Keep in mind that this architecture was designed when a bunch of certain ex-AMDers were over there too...

    Take care,
    Anand
  • BrooksT - Wednesday, September 18, 2013 - link

    Why would Anand write cross-platform benchmarks that have no connection to real world usage? Especially when you then complain that the 64 bit coverage isn't real world enough?
  • ddriver - Wednesday, September 18, 2013 - link

    For starters, putting the encryption results in their own graph, like every other review before that, and side to side comparison between geekbench ST/MT scores for A7 and competing v7 chips would be a good start toward a more objective and less biased article.

    And I know I am asking a lot, but an edit feature in the comment section is long overdue...
  • TheBretz - Wednesday, September 18, 2013 - link

    For what it's worth this is NOT a case of LITERALLY comparing "Apples" and "Oranges" - it is a case of comparing "Apple" and many other manufacturers, but there was no fruit involved in the comparison, only smarthphones and tablets.
  • ddriver - Wednesday, September 18, 2013 - link

    Apples to oranges is a figure of speech, it has nothing to do with the company apple... It concerns comparing incomparable objects which is the case of completely different JS implementations on iOS and Android.
  • Arbee - Wednesday, September 18, 2013 - link

    Please name any case when AT's benchmarks and reviews have been proven to be biased or inaccurate. There's a reason the writers at other sites consider AT the gold standard for solid technical commentary (Engadget, Gizmodo, and the Verge all regularly credit AT on technical stories). As far as bias, have you *heard* Brian cooing about practically wanting to marry the Nexus 5? ;-)

    I think what actually happened here is that apparently Apple engineers listen to the AT podcast, because aside from 802.11ac and the screen size the 5S is designed almost perfectly to AT's well-known and often-stated specifications. It hits all of Anand's chip architecture geekery hot buttons in a way that Samsung's mashups of off-the-shelf parts never will, and they used Brian's exact line "Bigger pixels means better pictures" in the presentation. And naturally, if someone gives you what you want, you're likely to be happy with it. This is why people have Amazon gift lists ;-)

    Krait's 128 bit SIMD definitely helps, but it won't match true v8 architecture designs. I've written commercially shipping ARM assembly, and there's a *lot* of cruft in the older ISA that v8 cleans right up. And it lets compilers generate *much* more favorable code. I'll be surprised if the next Snapdragons aren't at least 32-bit v8. Qualcomm has been pretty forward-looking aside from their refusal to cooperate with the open-source community (Freedreno FTW).

    As far as 64 bit on less than 4 GB of RAM, it enables applications to more freely operate on files in NAND without taking up huge amounts of RAM (via mmap(), which the Linux kernel in Android of course also has). Apps like Loopy HD and MultiTrack DAW (not to mention Apple's own iMovie and GarageBand) will definitely be able to take advantage.

Log in

Don't have an account? Sign up now