After Swift Comes Cyclone Oscar

I was fortunate enough to receive a tip last time that pointed me at some LLVM documentation calling out Apple’s Swift core by name. Scrubbing through those same docs, it seems like my leak has been plugged. Fortunately I came across a unique string looking at the iPhone 5s while it booted:

I can’t find any other references to Oscar online, in LLVM documentation or anywhere else of value. I also didn’t see Oscar references on prior iPhones, only on the 5s. I’d heard that this new core wasn’t called Swift, referencing just how different it was. Obviously Apple isn’t going to tell me what it’s called, so I’m going with Oscar unless someone tells me otherwise.

Oscar is a CPU core inside M7, Cyclone is the name of the Swift replacement.

Cyclone likely resembles a beefier Swift core (or at least Swift inspired) than a new design from the ground up. That means we’re likely talking about a 3-wide front end, and somewhere in the 5 - 7 range of execution ports. The design is likely also capable of out-of-order execution, given the performance levels we’ve been seeing.

Cyclone is a 64-bit ARMv8 core and not some Apple designed ISA. Cyclone manages to not only beat all other smartphone makers to ARMv8 but also key ARM server partners. I’ll talk about the whole 64-bit aspect of this next, but needless to say, this is a big deal.

The move to ARMv8 comes with some of its own performance enhancements. More registers, a cleaner ISA, improved SIMD extensions/performance as well as cryptographic acceleration are all on the menu for the new core.

Pipeline depth likely remains similar (maybe slightly longer) as frequencies haven’t gone up at all (1.3GHz). The A7 doesn’t feature support for any thermal driven CPU (or GPU) frequency boost.

The most visible change to Apple’s first ARMv8 core is a doubling of the L1 cache size: from 32KB/32KB (instruction/data) to 64KB/64KB. Along with this larger L1 cache comes an increase in access latency (from 2 clocks to 3 clocks from what I can tell), but the increase in hit rate likely makes up for the added latency. Such large L1 caches are quite common with AMD architectures, but unheard of in ultra mobile cores. A larger L1 cache will do a good job keeping the machine fed, implying a larger/more capable core.

The L2 cache remains unchanged in size at 1MB shared between both CPU cores. L2 access latency is improved tremendously with the new architecture. In some cases I measured L2 latency 1/2 that of what I saw with Swift.

The A7’s memory controller sees big improvements as well. I measured 20% lower main memory latency on the A7 compared to the A6. Branch prediction and memory prefetchers are both significantly better on the A7.

I noticed large increases in peak memory bandwidth on top of all of this. I used a combination of custom tools as well as publicly available benchmarks to confirm all of this. A quick look at Geekbench 3 (prior to the ARMv8 patch) gives a conservative estimate of memory bandwidth improvements:

Geekbench 3.0.0 Memory Bandwidth Comparison (1 thread)
  Stream Copy Stream Scale Stream Add Stream Triad
Apple A7 1.3GHz 5.24 GB/s 5.21 GB/s 5.74 GB/s 5.71 GB/s
Apple A6 1.3GHz 4.93 GB/s 3.77 GB/s 3.63 GB/s 3.62 GB/s
A7 Advantage 6% 38% 58% 57%

We see anywhere from a 6% improvement in memory bandwidth to nearly 60% running the same Stream code. I’m not entirely sure how Geekbench implemented Stream and whether or not we’re actually testing other execution paths in addition to (or instead of) memory bandwidth. One custom piece of code I used to measure memory bandwidth showed nearly a 2x increase in peak bandwidth. That may be overstating things a bit, but needless to say this new architecture has a vastly improved cache and memory interface.

Looking at low level Geekbench 3 results (again, prior to the ARMv8 patch), we get a good feel for just how much the CPU cores have improved.

Geekbench 3.0.0 Compute Performance
  Integer (ST) Integer (MT) FP (ST) FP (MT)
Apple A7 1.3GHz 1065 2095 983 1955
Apple A6 1.3GHz 750 1472 588 1165
A7 Advantage 42% 42% 67% 67%

Integer performance is up 44% on average, while floating point performance is up by 67%. Again this is without 64-bit or any other enhancements that go along with ARMv8. Memory bandwidth improves by 35% across all Geekbench tests. I confirmed with Apple that the A7 has a 64-bit wide memory interface, and we're likely talking about LPDDR3 memory this time around so there's probably some frequency uplift there as well.

The result is something Apple refers to as desktop-class CPU performance. I’ll get to evaluating those claims in a moment, but first, let’s talk about the other big part of the A7 story: the move to a 64-bit ISA.

A7 SoC Explained The Move to 64-bit
Comments Locked


View All Comments

  • Wilco1 - Thursday, September 19, 2013 - link

    The Geekbench results are indeed skewed by AES encryption. The author claimed AES was the only benchmark where they use hardware acceleration when available. There has been a debate on fixing the weighting or to place hardware accelerated benchmarks in a separate category to avoid skewing the results. So I'm hoping a future version will fix this.

    As for cross-platform benchmarking, Geekbench currently uses the default platform compiler (LLVM on iOS, GCC on Android, VC++ on Windows). So there will be compiler differences that skew results slightly. However this is also what you'd get if you built the same application for iOS and Android.
  • smartypnt4 - Thursday, September 19, 2013 - link

    A lot of the other stuff in Geekbench seems to be fairly representative, though. Except a few of the FP ones like the blur and sharpen tests...

    It surely can't be hard to have Geekbench omit those results. I think if they did that, you'd see that the A7 is roughly 50-60% faster than the A6 instead of 100% faster, but I'm not sure there. I'd have to go and do work to figure that out. Which is annoying :-)
  • name99 - Wednesday, September 18, 2013 - link

    I'd agree with the tweaks you suggest: (improved memory controller and prefetcher, doubling of L2, larger branch predictor tables).

    There is also scope for a wider CPU. Obviously the most simple-minded widening of a CPU substantially increases power, but there are ways to limit the extra power without compromising performance too much, if you are willing to spend the transistors. I think Apple is not just willing to spend the transistors, but will have them available to spend once they ditch 32-bit compatibility. At that point they can add a fourth decoder, use POWER style blocking of instructions to reduce retirement costs, and add whatever extra pipes make sense.
    The most useful improvement (in my experience) would be to up the L1 from being able to handle one load+store cycle to two loads+ one store per cycle, but I don't know what the power cost of that is --- may be too high.

    On the topic of minor tweaks, do we know what the page size used by iOS is? If they go from 4K to 16K and/or add support for large pages, they could get a 10% of so speed boost just from better TLB coverage.
    (And what's Android's story on this front? Do they stick with standard 4K pages, or do they utilize 16 or 64K pages and/or large pages?)
  • extide - Wednesday, September 18, 2013 - link

    Those are some pretty generous numbers you pulled out of your hat there. It's not as easy as just do this and that and bam, you have something to compete with Intel Core series stuff. No. I mean yeah, Apple has done a great job here and I wish someone else was making CPU's like this for the Android phones but oh well.
  • name99 - Wednesday, September 18, 2013 - link

    "Now, I will agree that this does prove that if Apple really wanted to, they could build something to compete with Haswell in terms of raw throughput."

    I agree with your point, but I think we should consider what an astonishing statement this is.
    Two years ago Apple wasn't selling it's own CPU. They burst onto the scene and with their SECOND device they're at an IPC and a performance/watt that equals Intel! Equals THE competitor in this space, the guys who are using the best process on earth.

    If you don't consider that astonishing, you don't understand what has happened here.

    (And once again I'd make my pitch that THIS shows what Intel's fatal flaw is. The problem with x86 is not that it adds area to a design, or that it slows it down --- though it does both. The problem is that it makes design so damn complex that you're constantly lagging; and you're terrified of making large changes because you might screw up.
    Apple, saddled with only the much smaller ARM overhead, has been vastly more nimble than Intel.
    And it's only going to get worse if, as I expect, Apple ditches 32-bit ARM as soon as they can, in two years or so, giving them an even easier design target...)

    What's next for Apple?
    At the circuit level, I expect them to work hard to make their CPU as good at turboing as Intel. (Anand talked about this.)
    At the ISA level, I expect their next major target to be some form of hardware transactional memory --- it just makes life so much easier, and, even though they're at two cores today, they know as well as anyone that the future is more cores. You don't have to do TM the way Intel has done it; the solution IBM used for POWER8 is probably a better fit for ARM. And of course if Apple do this (using their own extensions, because as far as I know ARM doesn't yet even have a TM spec) it's just one more way in which they differentiate their world from the commodity ARM world.
  • smartypnt4 - Wednesday, September 18, 2013 - link

    @extide: agreed.

    @name99: It is very astonishing indeed. Then again, a high profile company like Apple has no problem attracting some of the best talent via compensation and prestige.

    They've still got quite a long way to match Haswell, in any case. But the throughput is technically there to rival Intel if they wanted to. I would hope that Haswell contains a much more advanced branch predictor and prefetcher than what Apple has, but you never know. My computer architecture professor always said that everything in computer architecture has already been discovered. The question now is when will it be advantageous to spend the transistors to implement the most complicated designs.

    The next year is going to be very interesting, indeed.
  • Bob Todd - Wednesday, September 18, 2013 - link

    How many crows did you stuff down after claiming BT would be slower than A15 and even A12? Remember posting this about integer performance?

    "Silverthorne < A7 < A9 < A9R4 < Silvermont < A12 < Bobcat < A15 < Jaguar"

    Apple's A7 looks great, but you've made so many utterly ridiculous Intel performance bashing posts that it's pretty much impossible to take anything you say seriously.
  • Wilco1 - Wednesday, September 18, 2013 - link

    BT has indeed far lower IPC than A15 just like I posted - pretty much all benchmark results confirm that. On Geekbench 3 A15 is 23-25% faster clock for clock on integer and FP.

    The jury is still out on A12 vs BT as we've seen no performance results for A12 so far. So claiming I was wrong is not only premature but also incorrect as the fact is that Bay Trail is slower.
  • Wilco1 - Wednesday, September 18, 2013 - link

    Also new version with A7 and A57 now looks like this:

    Silverthorne < A7 < A9 < A9R4 < Silvermont < A12 < Bobcat < A15 < Jaguar < A57 < Apple A7
  • Bob Todd - Wednesday, September 18, 2013 - link

    Cherry picking a single benchmark which is notoriously inaccurate at comparisons across platforms/architectures doesn't make you "right", it just makes you look like more of a troll. Bay Trail has better integer performance than Jaguar (at near identical base clocks), so by your own ranking above it *has* to be faster than A12 and A15.

    You show up in every ARM article spouting the same drivel over and over again, yet you were mysteriously absent in the Bay Trail performance preview. Here's the link if you want to try to find a way to spin more FUD.

    Apple's A7 looks great, and IT is still the powerhouse of mobile graphics. The A7 version in the iPad should be a beast. None of that makes most of your comments any less loony.

Log in

Don't have an account? Sign up now