CPU MT Performance: A Real Monster

What’s more interesting than ST performance, is MT performance. With 8 performance cores and 2 efficiency cores, this is now the largest iteration of Apple Silicon we’ve seen.

As a prelude into the scores, I wanted to remark some things on the previous smaller M1 chip. The 4+4 setup on the M1 actually resulted that a significant chunk of the MT performance being enabled by the E-cores, with the SPECint score in particular seeing a +33% performance boost versus just the 4 P-cores of the system. Because the new M1 Pro and Max have 2 less E-cores, just assuming linear scaling, the theoretical peak of the M1 Pro/Max should be +62% over the M1. Of course, the new chips should behave better than linear, due to the better memory subsystem.

In the detailed scores I’m showcasing the full 8+2 scores of the new chips, and later we’ll talk about the 8 P scores in context. I hadn’t run the MT scores of the new Fortran compiler set on the M1 and some numbers will be missing from the charts because of that reason.

SPECint2017 Rate-N Estimated Scores

Looking at the data – there’s very evident changes to Apple’s performance positioning with the new 10-core CPU. Although, yes, Apple does have 2 additional cores versus the 8-core 11980HK or the 5980HS, the performance advantages of Apple’s silicon is far ahead of either competitor in most workloads. Again, to reiterate, we’re comparing the M1 Max against Intel’s best of the best, and also nearly AMD’s best (The 5980HX has a 45W TDP).

The one workload standing out to me the most was 502.gcc_r, where the M1 Max nearly doubles the M1 score, and lands in +69% ahead of the 11980HK. We’re seeing similar mind-boggling performance deltas in other workloads, memory bound tests such as mcf and omnetpp are evidently in Apple’s forte. A few of the workloads, mostly more core-bound or L2 resident, have less advantages, or sometimes even fall behind AMD’s CPUs.

SPECfp2017 Rate-N Estimated Scores

The fp2017 suite has more workloads that are more memory-bound, and it’s here where the M1 Max is absolutely absurd. The workloads that put the most memory pressure and stress the DRAM the most, such as 503.bwaves, 519.lbm, 549.fotonik3d and 554.roms, have all multiple factors of performance advantages compared to the best Intel and AMD have to offer.

The performance differences here are just insane, and really showcase just how far ahead Apple’s memory subsystem is in its ability to allow the CPUs to scale to such degree in memory-bound workloads.

Even workloads which are more execution bound, such as 511.porvray or 538.imagick, are – albeit not as dramatically, still very much clearly in favour of the M1 Max, achieving significantly better performance at drastically lower power.

We noted how the M1 Max CPUs are not able to fully take advantage of the DRAM bandwidth of the chip, and as of writing we didn’t measure the M1 Pro, but imagine that design not to score much lower than the M1 Max here. We can’t help but ask ourselves how much better the CPUs would score if the cluster and fabric would allow them to fully utilise the memory.

SPEC2017 Rate-N Estimated Total

In the aggregate scores – there’s two sides. On the SPECint work suite, the M1 Max lies +37% ahead of the best competition, it’s a very clear win here and given the power levels and TDPs, the performance per watt advantages is clear. The M1 Max is also able to outperform desktop chips such as the 11900K, or AMD’s 5800X.

In the SPECfp suite, the M1 Max is in its own category of silicon with no comparison in the market. It completely demolishes any laptop contender, showcasing 2.2x performance of the second-best laptop chip. The M1 Max even manages to outperform the 16-core 5950X – a chip whose package power is at 142W, with rest of system even quite above that. It’s an absolutely absurd comparison and a situation we haven’t seen the likes of.

We also ran the chip with just the 8 performance cores active, as expected, the scores are a little lower at -7-9%, the 2 E-cores here represent a much smaller percentage of the total MT performance than on the M1.

Apple’s stark advantage in specific workloads here do make us ask the question how this translates into application and use-cases. We’ve never seen such a design before, so it’s not exactly clear where things would land, but I think Apple has been rather clear that their focus with these designs is catering to the content creation crowd, the power users who use the large productivity applications, be it in video editing, audio mastering, or code compiling. These are all areas where the microarchitectural characteristics of the M1 Pro/Max would shine and are likely vastly outperform any other system out there.

CPU ST Performance: Not Much Change from M1 GPU Performance: 2-4x For Productivity, Mixed Gaming
Comments Locked

493 Comments

View All Comments

  • celeste_P - Tuesday, October 26, 2021 - link

    Does any one know where can I find the policy about translating/reprinting the article? Do AnandTech allow such behavior? What are the policies that one needs to follow?
    This article is quite interesting and I want to translate/publish it on Chinese website to share with a broader range of people
  • colinstalter - Wednesday, October 27, 2021 - link

    Why not just share the URL on the Chinese page? Do people in China not have translator functions built into their web browsers like Chrome does?
  • celeste_P - Wednesday, October 27, 2021 - link

    Of course they do XD
    But as you can imagine, the quality of machine translation won't be that great, especially considering all these domain specific terms within this article.
  • ABR - Tuesday, October 26, 2021 - link

    An excellent review.
  • ajmas - Tuesday, October 26, 2021 - link

    Given the number of games already available and running on iOS, I wonder how much work would be involved in making them available on macOS?

    As for effective performance, I am eagerly waiting to see what the real world tests reveal, since specs only say so much.
  • mandirabl - Wednesday, October 27, 2021 - link

    As a developer, technically you don't have to do much, just re-compile the game and check another box (for Mac), basically.

    The problem is: iOS games are mostly touch-focused, whereas macOS is mouse-first. So they have to check if that translates without changing anything. If it does, it's a matter of a couple of minutes. If it doesn't translate well ... they have a choice to release it anyway or blocking access on macOS. Yes, developers have to actually decide against releasing their app/game for macOS - if they don't do anything in that regard, the app/game simply shows up in an App Store search on a Mac.
  • Kevin45 - Tuesday, October 26, 2021 - link

    Apple's goal is very simple: If you are going to provide SW tools for Pro users of the MacOS platform, you write to Metal - period.

    It IS the most superior way to take advantage of what Apple has laid out to developers and Apple's Pro users absolutely want the HW tools they buy to be max'd out by the developers.

    Apple has taken an approach Intel and AMD cannot. Unified memory design aside, Apple has looked at it's creative markets and developed sub-cores, which for this Creative focus segment, Apple markets as it's "Media Engine" which has hardware h.264 and hardware ProRes compute, which just crush these formats and codecs.

    The argument "Yah, but the CPU and GPU cores aren't the most powerful that one can buy." is still. They don't need to be because they have dedicated cores to where the power needs to be. Sure, in a Wintel world, or Linux space, more powerful GPU and CPU cores is all they've got. So when talking those worlds indeed that's the correct argument. Not when talking Apple HW with Apple silicon.

    Intel has fought nVIDIA to have their beefier and beefier cores do heavy lifting, while nVIDA wants the GPU to be the most important play in the mix. Apple has broken out their SoC into many sub-sets to meet the high compute needs of it's user base.

    Now more than ever, developers that have drug their feet, need to get onboard. As companies continue to show off - such as Apple with FCP, Motion and Compressor optimized apps for the hardware, even DaVinci (niche player but powerful), they put pressure on other players such as sloth-boy Adobe, to get going and truly write for Apple's tools that take advantage of such well thought out HW + SW combo.
  • richardnpaul - Tuesday, October 26, 2021 - link

    The article comes across a bit fanbioy. (yes, yes I know that this is usually the case here but I just wanted to say it out loud again). See below for why.

    You have covered in depth things like how the increased L3 design between Zen2 and 3 can cause big jumps in performance and what was missing here was discussion of how the 24/48MB cache between the memory interface impacts performance especially when using the GPU (we've seen this last year AMD's designs doing exactly this to improve performance of their designs by reducing the impact of calling out to the slow GDDR6 RAM.)

    The GPU is nothing special. 10Tflops at 1.3GHz puts it around the same class as a Vega64, a 14nm design, which similarly used RAM packaged on an interposer with the GPU (being 14nm it was big, 5nm makes it much more reasonable). With the buffer cache I'd expect it might perform better, also the CPUs will bump up performance (just look at how much more FPS you get with Zen3 over Zen2 and with Zen3 with vcache it'll be another 15% more on top from exactly the same GPU hardware and that's with the CPU and GPU having to talk over PCI-E).

    Also, Apple have made themselves second class gaming citizens with their decision to build Mantle and enforce it as the only API (I may be mistaken here but as far as I'm aware the whole reason for Molten is because you have to use Metal on MacOS and developers have introduced this Vulkan to Metal shim to ease porting). Also, as I understand it, you can't connect external dGPUs via Thunderbolt to provide comparisons. Apple's vendor lock-in at it's worst (have I mentioned that Apple are their own worst enemy a lot of the time?)

    As such the gaming performance doesn't surprise me, this is a technically much slower and inferior GPU to AMD and nVIDIAs current designs on an older process (7nm and 8nm respectively). The cost is that whilst these are faster, they're larger and more power hungry though a die shrink of bring something like an AMD 6600 based chip into the same ballpark.

    Also on the 512bit memory interface I'd probably look at it more like 384bit plus 128bit, which is the GPU plus the usual CPU interfaces. The CPU is always gojng to contend for some of that 512bit interface, so you're never going to see 512bit for the GPU, on the other hand, you get what ever the cpu doesn't use for free, which is a great bonus of this design, and if the CPU needs more than a 128bit interface can manage it has access to that too if the GPU isn't heavily loaded on the memory interface.

    I kind of expect you guys to cover all this though in the article, not have me railing at the lack of it in the comments section.
  • richardnpaul - Tuesday, October 26, 2021 - link

    Oh and you failed to ever mention that the trade-off of the design is that you need to buy all the RAM you'll ever need up front because it's soldered to the SoC package. The reason that we don't normally see such designs is that the trade-off is potentially expensive unsaleable parts. The cost of these laptops are way above the usual and whilst they have some really nice tech this is one of the other downsides of this design (and the 5nm node and the amount of silicon).
  • OreoCookie - Tuesday, October 26, 2021 - link

    Or perhaps Anandtech gave it a glowing review simply because the M1 Max is fast and energy efficient at the same time? In memory intensive benchmarks it was 2-5 x faster than the x86 competition while being more energy efficient. What more do you want?

    And the article *was* including a Zen 3 mobile part in its comparison and the M1 Max was faster while consuming less energy. Since the V-Cache version of Zen 3 hasn't been released yet, there are no benchmarks for Anandtech to release as they either haven't been run yet or are under embargo.

    Lastly, this article is about some of the low-level capabilities of the hardware, not vendor lock-in or whether Metal is better or worse than Vulkan. They did not even test the ML accelerator or hardware codec bits (which is completely fair).

Log in

Don't have an account? Sign up now