The A12 Vortex CPU µarch

When talking about the Vortex microarchitecture, we first need to talk about exactly what kind of frequencies we’re seeing on Apple’s new SoC. Over the last few generations Apple has been steadily raising frequencies of its big cores, all while also raising the microarchitecture’s IPC. I did a quick test of the frequency behaviour of the A12 versus the A11, and came up with the following table:

Maximum Frequency vs Loaded Threads
Per-Core Maximum MHz
Apple A11 1 2 3 4 5 6
Big 1 2380 2325 2083 2083 2083 2083
Big 2   2325 2083 2083 2083 2083
Little 1     1694 1587 1587 1587
Little 2       1587 1587 1587
Little 3         1587 1587
Little 4           1587
Apple A12 1 2 3 4 5 6
Big 1 2500 2380 2380 2380 2380 2380
Big 2   2380 2380 2380 2380 2380
Little 1     1587 1562 1562 1538
Little 2       1562 1562 1538
Little 3         1562 1538
Little 4           1538

Both the A11 and A12’s maximum frequency is actually a single-thread boost clock – 2380MHz for the A11’s Monsoon cores and 2500MHz for the new Vortex cores in the A12. This is just a 5% boost in frequency in ST applications. When adding a second big thread, both the A11 and A12 clock down to respectively 2325 and 2380MHz. It’s when we are also concurrently running threads onto the small cores that things between the two SoCs diverge: while the A11 further clocks down to 2083MHz, the A12 retains the same 2380 until it hits thermal limits and eventually throttles down.

On the small core side of things, the new Tempest cores are actually clocked more conservatively compared to the Mistral predecessors. When the system just had one small core running on the A11, this would boost up to 1694MHz. This behaviour is now gone on the A12, and the clock maximum clock is 1587MHz. The frequency further slightly reduces to down to 1538MHz when there’s four small cores fully loaded.

Much improved memory latency

As mentioned in the previous page, it’s evident that Apple has put a significant amount of work into the cache hierarchy as well as memory subsystem of the A12. Going back to a linear latency graph, we see the following behaviours for full random latencies, for both big and small cores:

The Vortex cores have only a 5% boost in frequency over the Monsoon cores, yet the absolute L2 memory latency has improved by 29% from ~11.5ns down to ~8.8ns. Meaning the new Vortex cores’ L2 cache now completes its operations in a significantly fewer number of cycles. On the Tempest side, the L2 cycle latency seems to have remained the same, but again there’s been a large change in terms of the L2 partitioning and power management, allowing access to a larger chunk of the physical L2.

I only had the test depth test up until 64MB and it’s evident that the latency curves don’t flatten out yet in this data set, but it’s visible that latency to DRAM has seen some improvements. The larger difference of the DRAM access of the Tempest cores could be explained by a raising of the maximum memory controller DVFS frequency when just small cores are active – their performance will look better when there’s also a big thread on the big cores running.

The system cache of the A12 has seen some dramatic changes in its behaviour. While bandwidth is this part of the cache hierarchy has seen a reduction compared to the A11, the latency has been much improved. One significant effect here which can be either attributed to the L2 prefetcher, or what I also see a possibility, prefetchers on the system cache side: The latency performance as well as the amount of streaming prefetchers has gone up.

Instruction throughput and latency

Backend Execution Throughput and Latency
  Cortex-A75 Cortex-A76 Exynos-M3 Monsoon | Vortex
  Exec Lat Exec Lat Exec Lat Exec Lat
Integer Arithmetic
ADD
2 1 3 1 4 1 6 1
Integer Multiply 32b
MUL
1 3 1 2 2 3 2 4
Integer Multiply 64b
MUL
1 3 1 2 1
(2x 0.5)
4 2 4
Integer Division 32b
SDIV
0.25 12 0.2 < 12 1/12 - 1 < 12 0.2 10 | 8
Integer Division 64b
SDIV
0.25 12 0.2 < 12 1/21 - 1 < 21 0.2 10 | 8
Move
MOV
2 1 3 1 3 1 3 1
Shift ops
LSL
2 1 3 1 3 1 6 1
Load instructions 2 4 2 4 2 4 2  
Store instructions 2 1 2 1 1 1 2  
FP Arithmetic
FADD
2 3 2 2 3 2 3 3
FP Multiply
FMUL
2 3 2 3 3 4 3 4
Multiply Accumulate
MLA
2 5 2 4 3 4 3 4
FP Division (S-form) 0.2-0.33 6-10 0.66 7 >0.16 12 0.5 | 1 10 | 8
FP Load 2 5 2 5 2 5    
FP Store 2 1-N 2 2 2 1    
Vector Arithmetic 2 3 2 2 3 1 3 2
Vector Multiply 1 4 1 4 1 3 3 3
Vector Multiply Accumulate 1 4 1 4 1 3 3 3
Vector FP Arithmetic 2 3 2 2 3 2 3 3
Vector FP Multiply 2 3 2 3 1 3 3 4
Vector Chained MAC
(VMLA)
2 6 2 5 3 5 3 3
Vector FP Fused MAC
(VFMA)
2 5 2 4 3 4 3 3

To compare the backend characteristics of Vortex, we’ve tested the instruction throughput. The backend performance is determined by the amount of execution units and the latency is dictated by the quality of their design.

The Vortex core looks pretty much the same as the predecessor Monsoon (A11) – with the exception that we’re seemingly looking at new division units, as the execution latency has seen a shaving of 2 cycles both on the integer and FP side. On the FP side the division throughput has seen a doubling.

Monsoon (A11) was a major microarchitectural update in terms of the mid-core and backend. It’s there that Apple had shifted the microarchitecture in Hurricane (A10) from a 6-wide decode from  to a 7-wide decode. The most significant change in the backend here was the addition of two integer ALU units, upping them from 4 to 6 units.

Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load/store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than Arm’s upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we're not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.

The Apple A12 - First Commercial 7nm Silicon SPEC2006 Performance: Reaching Desktop Levels
Comments Locked

253 Comments

View All Comments

  • name99 - Saturday, October 6, 2018 - link

    If you're going to count like that, you need to throw in at least 7 Chinook cores (small 64-bit Apple-designed cores that act as controllers for various large blocks like the GPU or NPU).
    [A Chinook is a type of non-Vortex wind, just like a Zephyr, Tempest, or Mistral...]

    There's nothing that screams their existence on the die shots, but what little we know about them has been established by looking at the OS binaries for the new iPhones. Presumably if they really are minimal and require little to no L2 and smaller L1s (ie regular memory structures that are more easily visible), they could look like pretty much any of that vast sea of unexplained territory on the die.

    It's unclear what these do today apart from the obvious tasks like initialization and power tracking. (On the GPU it handles thread scheduling.)
    Even more interesting is what Apple's long term game here is? To move as much of the OS as possible off the main cores onto these auxiliary cores (and so the wheel of reincarnation spins back to System/360 Channel Controllers?) For example (in principle...) you could run the file system stack on the flash controller, or the network on a controller living near the radio basebands, and have the OS communicate with them at a much more abstract level.

    Does this make sense for power, performance, or security reasons? Not a clue! But in a world where transistors are cheap, I'm glad to see Apple slowly rethinking OS and system design decisions that were basically made in the early 1970s, and that we've stuck with ever since then regardless of tech changes.
  • ex2bot - Sunday, October 7, 2018 - link

    Much appreciate the review!
  • s.yu - Monday, October 8, 2018 - link

    Great job as always Andrei!
    I would only have hoped for a more thorough exploration of the limits of the portrait mode, to see if Apple really makes proper use of the depth map, taking a photo in portrait mode in a tunnel to see if the amount of blur is applied according to distance for example.
  • Mic_whos_right - Tuesday, October 9, 2018 - link

    Thanks for this comment--Now I know why nothing last year. Great Anandtech standard of a review! Always above my intellect of understanding w/ info overload that teaches me a lot of the product.
  • Moh Qadee - Thursday, November 1, 2018 - link

    Thank you for this great detailed review. I have been coming back to this review before making a purchase. Please make an comparison article of Iphone XS gaming vs other smartphones in market. How much does thermal make difference over longer periods before it starts to throttle or heat up. Would be able to give an approx time before you noticed heat while gaming on XS? I don't mind investing in an expensive phone as long as thermals doesn't limit the performance. There are phones like Razor 2 or Rog out. People make an comparison with an iPhone as it doesn't require much cooling. I wonder if gaming for above 20+ mins makes it challenging for Iphone to heat up enough that you should be worried about?
  • Ahadjisavvas - Monday, November 19, 2018 - link

    And the exynos m3 had 12 execution ports right? Can you elaborate on the major differences between the design of the vortex core in the a12 and the meerkat core in the m3? I would deeply appreciate it if you could.
  • Ahadjisavvas - Monday, November 19, 2018 - link

    And the exynos m3 has 12 execution ports right? Can you elaborate on the main differences between the design of the exynos m3 and the vortex core,that'll be really helpful and informative as well. Also,are you planning on writing a piece about the a12x soc, it'll be really interesting to hear how far apple has come with the soc on the 2018 ipad pro.
  • alysdexia - Monday, May 13, 2019 - link

    "Now what is very interesting here is that this essentially looks identical to Apple’s Swift microarchitecture from Apple's A6 SoC."
    This comparison doesn't make sense and it seems like you took the same execution ports to determine whether the chips are identical, when the ports could be arbitrary for each release. Rather I took the specifications (feature size, revision) and dates of each from these pages: https://en.wikipedia.org/wiki/Comparison_of_ARMv7-... and https://en.wikipedia.org/wiki/Comparison_of_ARMv8-... to come up with these matches: Cortex-A15-A9 A6, Cortex-A15-A9 A6X, Cortex-A57-A53 A7, Cortex-A57-A53 A8, Cortex-A57-A53 A8X, Cortex-A57-A53 A9, Cortex-A57-A53 A9X, Cortex-A73 A10, Cortex-A73 A10X, Cortex-A75 A11, Cortex-A76 A12, Cortex-A76 A12X. For exemplum 5 execution ports could be gotten (I'm no computer engineer so this is a SWAG.) from the 3 in Cortex-A9 subtracted from the 8 in Cortex-A15 but the later big.LITTLEs with 9 and 5 ports could be split from 7 or 8 as (7+2)+(7−2) or (8+1)+(8−3). You need to correct the Anandtech and Wikipedia pages.

    faster -> swifter, swiftlier
    ISO -> iso -> idem
    's !-> they; 1 != 2
    great:small::big:lite::mickel:littel
  • RSAUser - Friday, October 5, 2018 - link

    I still don't like iOS tendency towards warmer photos than it is irl.
  • DERSS - Saturday, October 6, 2018 - link

    It is weird because it is warmer in bright light and bleaker in dim light.
    Why can not they just even it out, make the photos less yellowish in bright light and less bleak in dim light?

Log in

Don't have an account? Sign up now