The Apple A12 - First Commercial 7nm Silicon

Over the last few years Apple’s silicon design teams have been at the forefront of both architecture design and adopting bleeding-edge manufacturing processes. The Apple A12 is yet another generational jump for the company, as it’s able to claim to be the first commercially available piece of 7nm silicon.

When talking about process nodes, generally speaking the smaller the figure, the smaller the transistor features are. While the actual name of recent nodes have long lost any meaning in correlation to actual physical sizes, they still represent a jump in density, and thus, the ability for vendors to pack in more transistors in the same die space.

We thank TechInsights for publicly sharing their picture of the Apple A12, and we’ve followed up with posting a quick first analysis reaction to the die shot:


AnandTech modified TechInsights Apple A12 Die Shot

Going over it again here for this article, I’ve put down my own custom labelling and interpretation of the die shot. The new A12 largely follows Apple’s SoC layout structure (90° rotated to most past die shots).

On the right side we see the GPU complex with the four GPU cores and shared logic in the middle. The CPU complex is found at the bottom, with the two Vortex big CPU cores on the centre-left, divided by the big L2 cache, right next to the four small Tempest CPU cores and their own L2 cache.

The four big chunks of SRAM in the middle blocks are part of the system cache – this is a SoC-wide cache layer in between the memory controllers and the internal system interconnect & block memory subsystems. Here Apple uses this block as an energy saving feature: Because memory transactions to DRAM are quite expensive in terms of energy usage, being able to cache things on-chip will save a great amount of power, with the added benefit of possible performance increases due to the locality of the data.

The Apple A12’s system cache has, to date, seen the biggest change since its introduction in the Apple A7. The big change in layout also points out to a large change in the functionality of the block, as now we clearly see a separation of the block into what’s apparent to be four slices. On previous Apple SoCs such as in the A11 or A10, the system cache looked more like a single block of logic, with what appeared to be two slices. A doubling of the slices could possibly point out to a very large change in the memory performance of this block – something I very much believe to be the case following more analysis in the coming pages.

Finally, the last big introduction in the A12 is a major revamp of the neural accelerator IP. Here Apple claims to have moved from a dual-core design found in the A11, to a new 8-core design. Last year’s design was rumoured to be a CEVA IP – although we’ve never have full confirmation on that as Apple doesn’t want it to be known. 

During the keynote presentation it’s important to note that Apple never mentioned that this is an in-house design, something the marketing materials was always eager to point out for other IP blocks of the SoCs.

Edit: I've been notified that Apple does have the "Apple-designed" mention on the A12 webpage, meaning this does indeed mean it's an in-house IP.

The A12 being an 8-core design would point out to a 4x increase in performance – but the actual increase is near 8x, increasing from 600GigaOPs in the A11 to 5TeraOPs in the A12. In the die shot we see the 8 MAC engines surrounding a big central cache, with possible shared logic at the top that would be responsible for fixed function and fully connected layer processing.

Die Block Comparison (mm²)
SoC

Process Node
Apple A12

TSMC N7
Apple A11

TSMC 10FF
Total Die 83.27 87.66
Big Core 2.07 2.68
Small Core 0.43 0.53
CPU Complex (incl. cores) 11.90 14.48
GPU Total 14.88 15.28
GPU Core 3.23 4.43
NPU 5.79 1.83

Looking over the different block size changes from the A11 to the A12, we see the benefits of TSMC’s newer 7nm manufacturing node. It’s to be noted that nearly all IP blocks have undergone changes, so it’s not really a valid apples-to-apples comparison to determine just how much density has improved with the new manufacturing node. Still, taking a single GPU core as a possible candidate (as we largely see the same structure), we see a 37% decrease in size compared to the A11. It’s quite obvious that the new node has enabled Apple to add an additional GPU core even though in absolute terms, the GPU is still smaller in the A12.

A bigger CPU and a massive cache hierarchy


Credit: TechInsights Apple A12 Die Shot, ChipRebel Apple A11 Die Shot

Moving on to the CPU complex, and in particular the new big CPU core, we now see what is possibly Apple’s biggest change in its CPU layout in several generations. In particular, we see a doubling of the L1 data caches in the new Vortex CPU, increasing from 64KB to 128KB in the new core. In the front-end, we also saw a doubling of the SRAM blocks that I attribute to the L1 instruction caches – which I now believe to have also seen doubling to 128KB. It’s funny that even after several years, we still haven’t really figured out what the A10 had introduced into the front-end block: here we saw a new very large block of cache whose function largely remains unclear.

A big question over the years has been what Apple’s cache hierarchy actually looks like. Looking at the memory latency behaviour at different test depths, we see the different jumps at different test depths. I didn’t annotate the latency values as we’ll revisit them later in a non-logarithmic version of this graph.

On the big core side, we clearly see the L1$ jump from 64KB to 128KB, and I think there’s no doubt about the increases here. Moving into the L2 cache however, we’re seeing some odd behaviour in terms of the latency. It’s clear that around the 3MB range that there’s an increasing jump in latency, up until around 6MB. It’s to be noted that this behaviour of a slower second 3MB only happens when accessing in a fully random patterns, in smaller access windows the latency is flat up until 6MB.

Not dwelling more into this for now, we move into the area beyond 6MB that is served by the system cache. It’s hard to make it out at first because there’s an offset caused by overall lower latency, but in general the latency curve goes around 4MB further before we reach mostly DRAM latencies. This corresponds to what we actually see on the die: the new system cache not only has seen a doubling of its slices, but has also outright doubled in capacity from 4 to 8MB.

Moving onto an analysis of the little cores, things become a bit more complex. At first glance you would believe the A11’s small cores L2 was limited to 512KB and that the A12 goes up to 1.5MB, however what I think is going on is that we’re being tricked by the power management policy of the cache. Looking at the A11 Mistral core latency, we see some obvious jumps at 768KB and 1MB. A similar jump can be seen in the A12 cores at 2MB.

It’s at this point, where it’s best to go back to the die shots and do some pixel counting, and we come up with the following table:

Measured and Estimated Cache Sizes
SoC Apple A12 Apple A11
Big L1$ 128KB 64KB
Big L2$ 128 instances
6MB per core/thread
8MB total at 64KB/inst
128 instances
6MB per core/thread
8MB total at 64KB/inst
Small L1$ 32KB 32KB
Small L2$ 32 instances
1.5MB per core/thread
2MB total at 64KB/inst
16 + 2(?) instances
512KB per core/thread
1MB total at 64KB/inst
System Cache 4x 16 instances
(double size)

8MB at 128KB/inst
2x 32 instances


4MB at 64KB/inst

The big core L2’s haven’t seen any structural changes between the A11 and A12 – both have 128 instances of SRAM macros, separated into two chunks. The question here remains that if the L2 is indeed just 6MB, then that would mean 48KB per SRAM block.

When looking at the small cores, we see that they use identical SRAM macros. The A12’s small core L2 has doubled from 16 to 32 instances, meaning there must have been a doubling. However as we see that the measured latency depth of the L2 has at least tripled, something else must be going on. What we’re measuring by no means has to represent what’s in the hardware, and indeed we can confirm this by running the latency test in a more special manner that makes the power management think it’s just some minor workload. On the A12, the Tempest cores then appear to have only 512KB available to them.

The conclusion is here is that Apple is employing partial cache power-down in what seems to be per-bank granularity. On the A12 each small core L2 bank is 512KB, and on the A11 it’s 256KB. Also, this more strongly leads me to believe there’s 2MB on the A12 and 1MB on the A11, it’s just that the test probably doesn’t fulfill the policy requirements to access the full cache.

In turn, because this would confirm 64KB per SRAM instance, we can go back and make some hypotheses about the big core L2’s. Again, looking at it, one would believe it stops at 6MB, but looking closer, especially on the A12, there is a change of behaviour at 8MB. Again it’s likely that the cores have 8MB of physical cache, and there’s an obvious change in access behaviour once we’re nearing a full cache.

The take-away here is that Apple’s caches are just immense, and the A12 further expands in that regard by doubling the system cache size. In practice, we have around 16MB of useable cache hierarchy on the part of the big CPU cores – a massive amount that just puts the memory and cache subsystems of competing SoCs to shame.

An evolutionary GPU

On the GPU side of things, we had some big expectations from the A12, not merely in terms of performance, but also in terms of architecture. Last year there had been a press release from Imagination stating that Apple had informed them that the company planned to no longer use its IP in new products in 15 to 24 months’ time. This would eventually lead to and crash of the stock price and subsequent sale of the company to an equity firm.  

So while Apple did declare the A11 GPU as an in-house design, it still very much looked like an Imagination derived design, as its block design was very similar to prior Rogue generations – with the big difference being that what was now called a core is the larger structure of what was previously two cores. The fact that it’s still a TBDR (Tile-Based Deferred Renderer), which IMG holds many patents on, but not least the fact that Apple still very much exposes and supports PVRTC (PowerVR Texture Compression), a proprietary format, means that the GPU still likely linked to IMG’s IP. Here it’s likely that we’re still looking at an architectural license design rather than what we would usually call a “clean sheet” design.


Credit: TechInsights Apple A12 Die Shot, ChipRebel Apple A11 Die Shot

Moving onto the A12 GPU – model named as the G11P, we see some very obvious similarities to last year’s A11 GPU. The individual function blocks seem to be largely located the same and constructed in a similar fashion.

What I think Apple’s biggest advancements in the A12 GPU is the fact that it now supports memory compression. I was very surprised to hear this during the announcement because at the same time it means two things: Prior Apple SoCs and their GPUs apparently didn’t have memory compression, and that now this alone should amount for a very significant boost in performance of the new GPU.

By memory compression, in particular we mean transparent framebuffer compression from the GPU to main memory. In the desktop space, vendors like Nvidia and AMD have had this for many years now, and it enabled the advancement of GPU performance even in the face of non-advancing memory-bandwidth increases. Smartphone GPUs also require memory compression, not only because of limited bandwidth on mobile SoCs, but most importantly because of the reduced power consumption that is associated with high bandwidth requirements. Arm’s AFBC has been the most publicly talked about mechanism in the mobile space, but other vendors such as Qualcomm or even Imagination have their own implementations.

Apple seems to be very late to the party in only introducing this with the A12 – however it also means the A12 will benefit from an unusually large generational boost in efficiency and performance, which makes a lot of sense given Apple’s proclaimed large increases for the new GPU.

Introduction & Design The A12 Vortex CPU µarch: Massive Memory Improvements
Comments Locked

253 Comments

View All Comments

  • zepi - Saturday, October 6, 2018 - link

    Otherwise a nice idea, but Datacenter CPU-market is too little to be interesting for Apple, as crazy as it is.

    Intel makes about $5b/quarter selling Xeons and other Datacenter stuff.

    Apple makes some $50B. I don't think they can waste chip-development resources to design something for such a little "niche".
  • tipoo - Thursday, October 18, 2018 - link


    Well, it would be largely reusing the R&D they already do for iOS chips, making the high performance cores is the hardest part, scaling them up to more cores would be a fraction the work.
  • varase - Tuesday, October 23, 2018 - link

    The Enterprise server business is already a crowded field, and it's not really something Apple has any expertise with.

    In Apple terms, it's not like there's a huge profit potential there, even if they were successful.

    Why put all that effort into learning, when most of their income comes from a portable consumer device they first released in 2007?
  • iwod - Saturday, October 6, 2018 - link

    What are the other die area used for? The labels only has ~half of the die. I could add image signal processing, video encode and decode if that is not included in GPU. Some FPGA we know Apple had included in their SoC. But all that accounted that is likely less than 25% of that due space. What about the other 25%?
  • Glaurung - Sunday, October 7, 2018 - link

    Hardware accelerators for anything and everything that can be hardware accelerated.

    Plus the "secure enclave" is also on there somewhere - a fenced off, cut down SOC within the SOC for handling logins/unlocking and other security stuff.
  • Antony Newman - Sunday, October 7, 2018 - link

    Andrei - This is an awesome review. Do you think Apple could roll out a low end laptop with 6 Vortex cores - or are there still SoC design areas that Apple still needs to address?

    AJ
  • Constructor - Sunday, October 7, 2018 - link

    I'm not Andrei, but my speculation on this would be:

    • It would make no sense to start with the weakest Macs because that would put the transition to Apple's own CPUs in a bad light from the start. As in the Intel transition 12 years ago they would need to start with the middle of their lineup (with iMacs and MacBook Pros) in order to demonstrate the strength of the new CPU platform and to motivate software developers to jump on board, including actually working on the new machines full time if possible.

    • They would need to have an emulation infrastructure for Intel legacy code in place like they did with Rosetta back then (also for Windows/Linux VMs!). And even in emulation that legacy code cannot be much slower than natively on then-current Intel machines, so their own CPUs already need to be a good bit faster than the corresponding Intel ones at the time in order to compensate for most of the emulation cost.

    • As in 2006, this would have a significant impact on macOS so at announcement they would need to push at least developer versions of the new macOS to developers. Back in 2006 they had Intel-based developer systems ready before the actual Intel Macs came out – this time they could actually provide a macOS developer version for the then top-of-the-line iPads until the first ARM-based Macs were available (which already support Blutooth keyboards now and could then just support Bluetooth mice and trackpads as well). But this also means that as back then, they would need to announce the transition at WWDC to explain it all and to get the developers into the boat.

    • Of course Apple would need to build desktop/notebook capable versions of their CPUs with all the necessary infrastructure (PCIe, multiple USB, Thunderbolt) but on the other hand they'd have more power and active cooling to work with, so they could go to more big cores and to higher clock speeds.

    Again: This is sheer speculation, but the signs are accumulating that something this that may indeed be in the cards with Intel stagnating and Apple still plowing ahead.

    I just don't think that it would be practical to put the current level of Apple CPUs into a Mac just like that even though from sheer CPU performance it looks feasible. These transitions have always been a massive undertaking and can't just be shot from the hip, even though the nominal performance seems almost there right now.
  • Constructor - Sunday, October 7, 2018 - link

    Oops – this forum insists on putting italics into separate lines. Oh well.
  • ex2bot - Sunday, October 7, 2018 - link

    Not to mention they’d have to maintain two processor architectures for an extended period. By that, I mean, I doubt they’d transition high-end Macs for a long, long time to avoid angering pros... again.
  • serendip - Monday, October 8, 2018 - link

    A real left field move would be for Apple to release a MacOS tablet running ARM, like a Qualcomm Windows tablet. I wouldn't rule it out considering how Apple went from a single product for the iPhone and iPad to making multiple sizes.

Log in

Don't have an account? Sign up now