After Swift Comes Cyclone Oscar

I was fortunate enough to receive a tip last time that pointed me at some LLVM documentation calling out Apple’s Swift core by name. Scrubbing through those same docs, it seems like my leak has been plugged. Fortunately I came across a unique string looking at the iPhone 5s while it booted:

I can’t find any other references to Oscar online, in LLVM documentation or anywhere else of value. I also didn’t see Oscar references on prior iPhones, only on the 5s. I’d heard that this new core wasn’t called Swift, referencing just how different it was. Obviously Apple isn’t going to tell me what it’s called, so I’m going with Oscar unless someone tells me otherwise.

Oscar is a CPU core inside M7, Cyclone is the name of the Swift replacement.

Cyclone likely resembles a beefier Swift core (or at least Swift inspired) than a new design from the ground up. That means we’re likely talking about a 3-wide front end, and somewhere in the 5 - 7 range of execution ports. The design is likely also capable of out-of-order execution, given the performance levels we’ve been seeing.

Cyclone is a 64-bit ARMv8 core and not some Apple designed ISA. Cyclone manages to not only beat all other smartphone makers to ARMv8 but also key ARM server partners. I’ll talk about the whole 64-bit aspect of this next, but needless to say, this is a big deal.

The move to ARMv8 comes with some of its own performance enhancements. More registers, a cleaner ISA, improved SIMD extensions/performance as well as cryptographic acceleration are all on the menu for the new core.

Pipeline depth likely remains similar (maybe slightly longer) as frequencies haven’t gone up at all (1.3GHz). The A7 doesn’t feature support for any thermal driven CPU (or GPU) frequency boost.

The most visible change to Apple’s first ARMv8 core is a doubling of the L1 cache size: from 32KB/32KB (instruction/data) to 64KB/64KB. Along with this larger L1 cache comes an increase in access latency (from 2 clocks to 3 clocks from what I can tell), but the increase in hit rate likely makes up for the added latency. Such large L1 caches are quite common with AMD architectures, but unheard of in ultra mobile cores. A larger L1 cache will do a good job keeping the machine fed, implying a larger/more capable core.

The L2 cache remains unchanged in size at 1MB shared between both CPU cores. L2 access latency is improved tremendously with the new architecture. In some cases I measured L2 latency 1/2 that of what I saw with Swift.

The A7’s memory controller sees big improvements as well. I measured 20% lower main memory latency on the A7 compared to the A6. Branch prediction and memory prefetchers are both significantly better on the A7.

I noticed large increases in peak memory bandwidth on top of all of this. I used a combination of custom tools as well as publicly available benchmarks to confirm all of this. A quick look at Geekbench 3 (prior to the ARMv8 patch) gives a conservative estimate of memory bandwidth improvements:

Geekbench 3.0.0 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
Apple A7 1.3GHz 5.24 GB/s 5.21 GB/s 5.74 GB/s 5.71 GB/s
Apple A6 1.3GHz 4.93 GB/s 3.77 GB/s 3.63 GB/s 3.62 GB/s
A7 Advantage 6% 38% 58% 57%

We see anywhere from a 6% improvement in memory bandwidth to nearly 60% running the same Stream code. I’m not entirely sure how Geekbench implemented Stream and whether or not we’re actually testing other execution paths in addition to (or instead of) memory bandwidth. One custom piece of code I used to measure memory bandwidth showed nearly a 2x increase in peak bandwidth. That may be overstating things a bit, but needless to say this new architecture has a vastly improved cache and memory interface.

Looking at low level Geekbench 3 results (again, prior to the ARMv8 patch), we get a good feel for just how much the CPU cores have improved.

Geekbench 3.0.0 Compute Performance
  Integer (ST) Integer (MT) FP (ST) FP (MT)
Apple A7 1.3GHz 1065 2095 983 1955
Apple A6 1.3GHz 750 1472 588 1165
A7 Advantage 42% 42% 67% 67%

Integer performance is up 44% on average, while floating point performance is up by 67%. Again this is without 64-bit or any other enhancements that go along with ARMv8. Memory bandwidth improves by 35% across all Geekbench tests. I confirmed with Apple that the A7 has a 64-bit wide memory interface, and we're likely talking about LPDDR3 memory this time around so there's probably some frequency uplift there as well.

The result is something Apple refers to as desktop-class CPU performance. I’ll get to evaluating those claims in a moment, but first, let’s talk about the other big part of the A7 story: the move to a 64-bit ISA.

A7 SoC Explained The Move to 64-bit
Comments Locked

464 Comments

View All Comments

  • ddriver - Wednesday, September 18, 2013 - link

    I mean, only a true apple fanboy is capable of disregarding all that technical argumentation because of the mention of the term "apple fanboys". A drowning man will hold onto a straw :)
  • akdj - Thursday, September 19, 2013 - link

    You consider your comment 'technical argumentation'? It's not....it's your 'opinion'. I think you can rest assured Anand's site is geared much more to those of us interested in technology and less interested in being a 'fanboy'. In fact....so far reading through the comments, you're the first to bring that silly cliché up, "Fan Boy".
    A drowning man will hold on to anything to help save himself :)
  • Wilco1 - Wednesday, September 18, 2013 - link

    Good comment - I'm equally unimpressed by the comparison of a real phone with a Bay Trail tablet development board which has significantly higher TDP. And then calling it a win for Bay Trail based on a few rubbish JS benchmarks is even more ridiculous. These are not real CPU benchmarks but all about software optimization and tuning for the benchmark.

    Single threaded Geekbench 3 results show the A7 outperforming the 2.4GHz Bay Trail by 45%. That's despite the A7 running at only 54% of the frequency of Bay Trail! In short, A7 is 2.7 times faster than BT and on par/better than HasWell IPC...
  • tech4real - Wednesday, September 18, 2013 - link

    not trying to dismiss A7's cpu core, it's an amazing silicon and significantly steps up against A6, but is there a possibility that the geekbench3 is unfit to gauge average cross-ISA cross-OS cpu performance... To me, the likelihood of this is pretty high.
  • Wilco1 - Wednesday, September 18, 2013 - link

    Comparing different ISAs does indeed introduce inaccuracies due to compilers not being equal. Cross OS is less problematic as long as the benchmark doesn't use the OS a lot.

    It's a good idea to keep this in mind, but unfortunately there is little one can do about it. And other CPU benchmarks are not any better either, if you used SPEC then performance differences across different compilers are far larger than Geekbench (even on the same CPU the difference between 2 compilers can be 50%)...
  • Dooderoo - Wednesday, September 18, 2013 - link

    "The AES and SHA1 gains are a direct result of the new cryptographic instructions that are a part of ARMv8. The AES test in particular shows nearly an order of magnitude performance improvement".

    Your comment: "in reality the encryption workloads are handled in a fundamentally different way in the two modes [...] a mixed bad into one falsely advertising performance gains attributed to 64bit execution and not to the hardware implementations as it should"

    Maybe actually read the article?

    "The FP chart also shows no miracles, wider SIMD units result in almost 2x the score in few tests, nothing much in the rest"
    Exclude those test and you're still looking at 30% improvement. 30% increase in performance from a recompile counts at "nothing much" in what world?
  • ddriver - Wednesday, September 18, 2013 - link

    My point was encryption results should not have been included in the chart and presented as "benefits of 64bit execution mode" because they aren't.

    Also those 30% can easily be attributed to other incremental upgrades to the chip, like faster memory subsystem, better prefetchers and whatnot. Not necessarily 64bit execution, I've been using HPC software for years and despite the fact x64 came with double the registers, I did not experience any significant increase in the workloads I use daily - 3D rendering, audio and video processing and multiphysics simulations. The sole benefit of 64bit I've seen professionally is due to the extra ram I can put into the machine, making tasks which require a lot of ram WAY FASTER, sometimes 10s even 100s times faster because of the avoided swapping.

    Furthermore, I will no longer address technically unsubstantiated comments, in order to avoid spamming all over the comment space.
  • Dooderoo - Wednesday, September 18, 2013 - link

    "Furthermore, I will no longer address technically unsubstantiated comments, in order to avoid spamming all over the comment space."
    Man, you give up too easily.

    Encryption results are exactly that: "benefits of 64bit execution mode". Why? 32-bit A32 doesn't have the instructions, 64-bit A64 does. Clear and obvious benefit.

    "30% can easily be attributed to other incremental upgrades to the chip". Wouldn't the 32-bit version benefit from those as well?

    I'm beginning to think you don't understand that those results are both from the A7 SOC, once run with A32 and once with A64.
  • ddriver - Wednesday, September 18, 2013 - link

    ""30% can easily be attributed to other incremental upgrades to the chip". Wouldn't the 32-bit version benefit from those as well?"

    This may be correct. Unless I am overlooking execution mode details, of which I am not aware, and I expect neither are you, unless you are an engineer who has worked on the A7 chip. I don't think that data is available yet to comment on it in detail.

    But you are not correct about encryption results, because it is a matter of extra hardware implementation. It is like comparing software rendering to hardware rendering, a CPU with hardware implementation of graphics will be immensely faster at a graphics workload, even if it is the same speed as the one that runs graphics in software. If anything, the architecture upgrades of the A7 chip can at best result in 2x peak theoretical performance improvement, while the AES test shows 8+x improvement. This is because the performance boost is not due to 64 bit mode execution, but due to the extra hardware implementation that is exclusively available in that mode.
  • Dooderoo - Wednesday, September 18, 2013 - link

    "I don't think that data is available yet to comment on it in detail."
    Yet you're ok with calling the article "cunningly deceitful"? Weird.

Log in

Don't have an account? Sign up now