The iPhone XS & XS Max Review: Unveiling the Silicon Secrets

Name: The iPhone XS & XS Max Review: Unveiling the Silicon Secrets
Item: The iPhone XS & XS Max Review: Unveiling the Silicon Secrets
Author: Andrei Frumusanu

by Andrei Frumusanu on October 5, 2018 8:00 AM EST

253 Comments | Add A Comment

253 Comments

The A12 Vortex CPU µarch

When talking about the Vortex microarchitecture, we first need to talk about exactly what kind of frequencies we’re seeing on Apple’s new SoC. Over the last few generations Apple has been steadily raising frequencies of its big cores, all while also raising the microarchitecture’s IPC. I did a quick test of the frequency behaviour of the A12 versus the A11, and came up with the following table:

Maximum Frequency vs Loaded Threads Per-Core Maximum MHz
Apple A11	1	2	3	4	5	6
Big 1	2380	2325	2083	2083	2083	2083
Big 2		2325	2083	2083	2083	2083
Little 1			1694	1587	1587	1587
Little 2				1587	1587	1587
Little 3					1587	1587
Little 4						1587
Apple A12	1	2	3	4	5	6
Big 1	2500	2380	2380	2380	2380	2380
Big 2		2380	2380	2380	2380	2380
Little 1			1587	1562	1562	1538
Little 2				1562	1562	1538
Little 3					1562	1538
Little 4						1538

Both the A11 and A12’s maximum frequency is actually a single-thread boost clock – 2380MHz for the A11’s Monsoon cores and 2500MHz for the new Vortex cores in the A12. This is just a 5% boost in frequency in ST applications. When adding a second big thread, both the A11 and A12 clock down to respectively 2325 and 2380MHz. It’s when we are also concurrently running threads onto the small cores that things between the two SoCs diverge: while the A11 further clocks down to 2083MHz, the A12 retains the same 2380 until it hits thermal limits and eventually throttles down.

On the small core side of things, the new Tempest cores are actually clocked more conservatively compared to the Mistral predecessors. When the system just had one small core running on the A11, this would boost up to 1694MHz. This behaviour is now gone on the A12, and the clock maximum clock is 1587MHz. The frequency further slightly reduces to down to 1538MHz when there’s four small cores fully loaded.

Much improved memory latency

As mentioned in the previous page, it’s evident that Apple has put a significant amount of work into the cache hierarchy as well as memory subsystem of the A12. Going back to a linear latency graph, we see the following behaviours for full random latencies, for both big and small cores:

The Vortex cores have only a 5% boost in frequency over the Monsoon cores, yet the absolute L2 memory latency has improved by 29% from ~11.5ns down to ~8.8ns. Meaning the new Vortex cores’ L2 cache now completes its operations in a significantly fewer number of cycles. On the Tempest side, the L2 cycle latency seems to have remained the same, but again there’s been a large change in terms of the L2 partitioning and power management, allowing access to a larger chunk of the physical L2.

I only had the test depth test up until 64MB and it’s evident that the latency curves don’t flatten out yet in this data set, but it’s visible that latency to DRAM has seen some improvements. The larger difference of the DRAM access of the Tempest cores could be explained by a raising of the maximum memory controller DVFS frequency when just small cores are active – their performance will look better when there’s also a big thread on the big cores running.

The system cache of the A12 has seen some dramatic changes in its behaviour. While bandwidth is this part of the cache hierarchy has seen a reduction compared to the A11, the latency has been much improved. One significant effect here which can be either attributed to the L2 prefetcher, or what I also see a possibility, prefetchers on the system cache side: The latency performance as well as the amount of streaming prefetchers has gone up.

Instruction throughput and latency

Backend Execution Throughput and Latency
	Cortex-A75		Cortex-A76		Exynos-M3		Monsoon \| Vortex
	Exec	Lat	Exec	Lat	Exec	Lat	Exec	Lat
Integer Arithmetic ADD	2	1	3	1	4	1	6	1
Integer Multiply 32b MUL	1	3	1	2	2	3	2	4
Integer Multiply 64b MUL	1	3	1	2	1 (2x 0.5)	4	2	4
Integer Division 32b SDIV	0.25	12	0.2	< 12	1/12 - 1	< 12	0.2	10 \| 8
Integer Division 64b SDIV	0.25	12	0.2	< 12	1/21 - 1	< 21	0.2	10 \| 8
Move MOV	2	1	3	1	3	1	3	1
Shift ops LSL	2	1	3	1	3	1	6	1
Load instructions	2	4	2	4	2	4	2
Store instructions	2	1	2	1	1	1	2
FP Arithmetic FADD	2	3	2	2	3	2	3	3
FP Multiply FMUL	2	3	2	3	3	4	3	4
Multiply Accumulate MLA	2	5	2	4	3	4	3	4
FP Division (S-form)	0.2-0.33	6-10	0.66	7	>0.16	12	0.5 \| 1	10 \| 8
FP Load	2	5	2	5	2	5
FP Store	2	1-N	2	2	2	1
Vector Arithmetic	2	3	2	2	3	1	3	2
Vector Multiply	1	4	1	4	1	3	3	3
Vector Multiply Accumulate	1	4	1	4	1	3	3	3
Vector FP Arithmetic	2	3	2	2	3	2	3	3
Vector FP Multiply	2	3	2	3	1	3	3	4
Vector Chained MAC (VMLA)	2	6	2	5	3	5	3	3
Vector FP Fused MAC (VFMA)	2	5	2	4	3	4	3	3

To compare the backend characteristics of Vortex, we’ve tested the instruction throughput. The backend performance is determined by the amount of execution units and the latency is dictated by the quality of their design.

The Vortex core looks pretty much the same as the predecessor Monsoon (A11) – with the exception that we’re seemingly looking at new division units, as the execution latency has seen a shaving of 2 cycles both on the integer and FP side. On the FP side the division throughput has seen a doubling.

Monsoon (A11) was a major microarchitectural update in terms of the mid-core and backend. It’s there that Apple had shifted the microarchitecture in Hurricane (A10) from a 6-wide decode from to a 7-wide decode. The most significant change in the backend here was the addition of two integer ALU units, upping them from 4 to 6 units.

Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load/store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than Arm’s upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we're not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.

The Apple A12 - First Commercial 7nm Silicon SPEC2006 Performance: Reaching Desktop Levels

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

253 Comments

View All Comments

eastcoast_pete - Sunday, October 7, 2018 - link
Apple's strength (supremacy) in the performance of their SoCs really lies in the fine-tuned match of apps and especially low-level software that make good use of excellent hardware. What happens when that doesn't happen was outlined in detail by Andrei in his reviews of Samsung's Mongoose M3 SoC - to use a famous line from a movie that "could've been a contender", but really isn't. Apple's tight integration is the key factor that a more open ecosytem (Android) has a hard time matching; however, Google and (especially) Qualcomm leave a lot of possible performance improvements on the table by really poor collaboration; for example, GPU-assisted computing is AWOL for Android - not a smart move when you try to compete against Apple.
varase - Tuesday, October 23, 2018 - link
I have serious doubts that Android would even run on an A12 SoC - I thought Apple trashed ARMv7 when it went to A11.
Strafeb - Saturday, October 6, 2018 - link
It would be interesting to see comparison of screen efficiency of iPhone XR's low res LCD screen, and also some of LG's pOLED screens like in V40.
Alistair - Saturday, October 6, 2018 - link
The Xeon Platinum 8176 is a 28 core, $9000 Intel server CPU, based on Skylake. In single threaded performance, the iPhone XS outperforms it by 12 percent for integers, despite its lower clock speed. If the iPhone were to run at 3.8ghz, the Apple A12 would outperform Intel's CPU by 64 percent on average for integer tests.

iPhone XS and A12 numbers from: https://www.anandtech.com/show/13392/the-iphone-xs...

Xeon numbers from: https://www.anandtech.com/show/12694/assessing-cav...

spreadsheet: https://docs.google.com/spreadsheets/d/1ipKIh4i56o...

image of chart: https://i.imgur.com/IAupi9p.jpg

Think about that, the iPhone's CPU IPC (performance per clock) is already higher in integer performance now. Those tests include: spam filter, compression, compiling, vehicle scheduling, game ai, protein seq. analyses, chess, quantum simulation, video encoding, network sim, pathfinding, and xml processing. Test takes hours to run.
SanX - Saturday, October 6, 2018 - link
Yes, and while Apple and all other mobile processor manufacturers charge $5 per core, Intel $300
yeeeeman - Saturday, October 6, 2018 - link
It might be faster in single thread, but in MT it gets toasted by the Xeon. The Xeon is 9000$ for a few reasons:
- it is an enterprise chip;
- it supports ecc;
- it supports up to 8 cpus on a board;
- it supports tons of ram, a LOT of memory channels;
- it has almost 40MB of L3 cache, compared to 8mb in a12;
- it has a ring bus architecture meaning all those cores have very low latency between them and to memory;
- it has CISC instructions, meaning that when you get out of basic phone apps and you start doing scientific/database/HPC stuff, you will see a lot of benefits and performance improvements from executing a single instruction for a specific operation, compared to the RISC nature of A12;
- it supports AVX512, needed for high performance computing. In this, the A12 would get smashed;
- and many more;
So the Xeon 8180 is still an mighty impressive chip and Intel has invested some real thought and experience into making it. Things that Apple doesn't have.
I get it, it is nice to see Apple having a chip with this much compute power in such a low TDP and it is due to the fact that x86 chips have a lot of extra stuff added in for legacy. But don't get carried away with this, what Apple is doing now from uArch point of view is not new. Desktop chip have had this stuff 15 years ago. The difference is that Apple works on the latest fabrication process and doesn't care about x86 legacy.
Alistair - Saturday, October 6, 2018 - link
"It might be faster in single thread, but in MT it gets toasted by the Xeon"

That is totally irrelevant. Obviously Apple could easily make a chip with more cores. Just like Cavium's Thunder. 8 x A12 Vortex cores would beat an 8 core Xeon in integer calculations easily enough.
eastcoast_pete - Sunday, October 7, 2018 - link
Agree on your points re. the XEON. However, I'd still like to see Apple launch CPUs/iGPUs based on their design especially in the laptop space, where Intel still rules and charges premium prices. If nothing else, Apple getting into that game would fan the flames under Intel's chair that AMD is trying to kindle (started to work for desktop CPUs). In the end, we all benefit if Chipzilla either gets off its enormous bottom(line) and innovates more, or gets pushed to the side by superior tech. So, even as a non-Apple user: go Apple, go!
Constructor - Sunday, October 7, 2018 - link

- it has CISC instructions, meaning that when you get out of basic phone apps and you start doing scientific/database/HPC stuff, you will see a lot of benefits and performance improvements from executing a single instruction for a specific operation, compared to the RISC nature of A12;

CISC instructions generally don't really do much more than RISC ones do – they just have more addressing modes while RISC is almost always register-to-register with separate Load & Store.

That just doesn' make any difference any more because the bottleneck is not instruction fetching (as it once was in the old times) but actually execution unit pipeline congestion, including of the Load & Store units.

- it supports AVX512, needed for high performance computing. In this, the A12 would get smashed;

There's already a scalable vector extention for ARM which Apple could adopt if that was actually a bottleneck. And even the existing vector units aren't anything to scoff at – the issue is more that Intel CPUs are forced to drop down to half their nominal clock once you actually use AVX512; It could actually be more efficient to optimize the regular vetor units for ful lspeed operation to make up for it.

So the Xeon 8180 is still an mighty impressive chip and Intel has invested some real thought and experience into making it. Things that Apple doesn't have.

We actually have no clue what Apple is investing in behind closed doors until they slam it on the table as a finished product ready for sale!
tipoo - Thursday, October 18, 2018 - link
I'm hoping Apple takes the ARM switch as an opportunity to bring an ARM AVX-512 equivalent down to more products, like the iMac.

The iPhone XS & XS Max Review: Unveiling the Silicon Secrets

The A12 Vortex CPU µarch

Much improved memory latency

Instruction throughput and latency

Post Your Comment

253 Comments

View All Comments

eastcoast_pete - Sunday, October 7, 2018 - link

varase - Tuesday, October 23, 2018 - link

Strafeb - Saturday, October 6, 2018 - link

Alistair - Saturday, October 6, 2018 - link

SanX - Saturday, October 6, 2018 - link

yeeeeman - Saturday, October 6, 2018 - link

Alistair - Saturday, October 6, 2018 - link

eastcoast_pete - Sunday, October 7, 2018 - link

Constructor - Sunday, October 7, 2018 - link

tipoo - Thursday, October 18, 2018 - link

Log in

Don't have an account? Sign up now