Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

Name: Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights
Item: Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights
Author: Andrei Frumusanu

by Andrei Frumusanu on October 25, 2021 9:00 AM EST

493 Comments | Add A Comment

493 Comments

Power Behaviour: No Real TDP, but Wide Range

Last year when we reviewed the M1 inside the Mac mini, we did some rough power measurements based on the wall-power of the machine. Since then, we learned how to read out Apple’s individual CPU, GPU, NPU and memory controller power figures, as well as total advertised package power. We repeat the exercise here for the 16” MacBook Pro, focusing on chip package power, as well as AC active wall power, meaning device load power, minus idle power.

Apple doesn’t advertise any TDP for the chips of the devices – it’s our understanding that simply doesn’t exist, and the only limitation to the power draw of the chips and laptops are simply thermals. As long as temperature is kept in check, the silicon will not throttle or not limit itself in terms of power draw. Of course, there’s still an actual average power draw figure when under different scenarios, which is what we come to test here:

Apple MacBook Pro 16 M1 Max Power Behaviour

Starting off with device idle, the chip reports a package power of around 200mW when doing nothing but idling on a static screen. This is extremely low compared to competitor designs, and is likely a reason Apple is able achieve such fantastic battery life. The AC wall power under idle was 7.2W, this was on Apple’s included 140W charger, and while the laptop was on minimum display brightness – it’s likely the actual DC battery power under this scenario is much lower, but lacking the ability to measure this, it’s the second-best thing we have. One should probably assume a 90% efficiency figure in the AC-to-DC conversion chain from 230V wall to 28V USB-C MagSafe to whatever the internal PMIC usage voltage of the device is.

In single-threaded workloads, such as CineBench r23 and SPEC 502.gcc_r, both which are more mixed in terms of pure computation vs also memory demanding, we see the chip report 11W package power, however we’re just measuring a 8.5-8.7W difference at the wall when under use. It’s possible the software is over-reporting things here. The actual CPU cluster is only using around 4-5W under this scenario, and we don’t seem to see much of a difference to the M1 in that regard. The package and active power are higher than what we’ve seen on the M1, which could be explained by the much larger memory resources of the M1 Max. 511.povray is mostly core-bound with little memory traffic, package power is reported less, although at the wall again the difference is minor.

In multi-threaded scenarios, the package and wall power vary from 34-43W on package, and wall active power from 40 to 62W. 503.bwaves stands out as having a larger difference between wall power and reported package power – although Apple’s powermetrics showcases a “DRAM” power figure, I think this is just the memory controllers, and that the actual DRAM is not accounted for in the package power figure – the extra wattage that we’re measuring here, because it’s a massive DRAM workload, would be the memory of the M1 Max package.

On the GPU side, we lack notable workloads, but GFXBench Aztec High Offscreen ends up with a 56.8W package figure and 69.80W wall active figure. The GPU block itself is reported to be running at 43W.

Finally, stressing out both CPU and GPU at the same time, the SoC goes up to 92W package power and 120W wall active power. That’s quite high, and we haven’t tested how long the machine is able to sustain such loads (it’s highly environment dependent), but it very much appears that the chip and platform don’t have any practical power limit, and just uses whatever it needs as long as temperatures are in check.

	M1 Max MacBook Pro 16"			Intel i9-11980HK MSI GE76 Raider
	Score	Package Power (W)	Wall Power Total - Idle (W)	Score	Package Power (W)	Wall Power Total - Idle (W)
Idle		0.2	7.2 (Total)		1.08	13.5 (Total)
CB23 ST	1529	11.0	8.7	1604	30.0	43.5
CB23 MT	12375	34.0	39.7	12830	82.6	106.5
502 ST	11.9	11.0	9.5	10.7	25.5	24.5
502 MT	74.6	36.9	44.8	46.2	72.6	109.5
511 ST	10.3	5.5	8.0	10.7	17.6	28.5
511 MT	82.7	40.9	50.8	60.1	79.5	106.5
503 ST	57.3	14.5	16.8	44.2	19.5	31.5
503 MT	295.7	43.9	62.3	60.4	58.3	80.5
Aztec High Off	307fps	56.8	69.8	266fps	35 + 144	200.5
Aztec+511MT		92.0	119.8		78 + 142	256.5

Comparing the M1 Max against the competition, we resorted to Intel’s 11980HK on the MSI GE76 Raider. Unfortunately, we wanted to also do a comparison against AMD’s 5980HS, however our test machine is dead.

In single-threaded workloads, Apple’s showcases massive performance and power advantages against Intel’s best CPU. In CineBench, it’s one of the rare workloads where Apple’s cores lose out in performance for some reason, but this further widens the gap in terms of power usage, whereas the M1 Max only uses 8.7W, while a comparable figure on the 11980HK is 43.5W.

In other ST workloads, the M1 Max is more ahead in performance, or at least in a similar range. The performance/W difference here is around 2.5x to 3x in favour of Apple’s silicon.

In multi-threaded tests, the 11980HK is clearly allowed to go to much higher power levels than the M1 Max, reaching package power levels of 80W, for 105-110W active wall power, significantly more than what the MacBook Pro here is drawing. The performance levels of the M1 Max are significantly higher than the Intel chip here, due to the much better scalability of the cores. The perf/W differences here are 4-6x in favour of the M1 Max, all whilst posting significantly better performance, meaning the perf/W at ISO-perf would be even higher than this.

On the GPU side, the GE76 Raider comes with a GTX 3080 mobile. On Aztec High, this uses a total of 200W power for 266fps, while the M1 Max beats it at 307fps with just 70W wall active power. The package powers for the MSI system are reported at 35+144W.

Finally, the Intel and GeForce GPU go up to 256W power daw when used together, also more than double that of the MacBook Pro and its M1 Max SoC.

The 11980HK isn’t a very efficient chip, as we had noted it back in our May review, and AMD’s chips should fare quite a bit better in a comparison, however the Apple Silicon is likely still ahead by extremely comfortable margins.

Huge Memory Bandwidth, but not for every Block CPU ST Performance: Not Much Change from M1

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

493 Comments

View All Comments

Ppietra - Thursday, October 28, 2021 - link
Gosh, no! not 2.35m people. You are so obsessed with a GPU having everything on silicon for itself that you fail to see how much more cache resources the SoC has when compared with other processors. Even if there was 1GB of cache you would still be complaining because the CPU can use it. Get some common sense.
richardnpaul - Friday, October 29, 2021 - link
You're wrong. I've shown that your fringe is larger than some country's populations and you've dismissed it and pivoted back to another talking point, a point that is a misrepresentation of what I was saying.
I was wondering what the effect was on performance of the CPU and GPU when both are being used and both are using the shared cache simultaneously given that we know that in isolation with just their own cache it improves efficiency. I'm not and haven't been saying it's an actual issue, it's something that could be tested, and also, we have no clue as to whether it's a real world problem or not.

The article was the one that was talking about the GPU and it having access to all the 512bit memory interface, I was challenging that saying that actually the CPU is going to use some of that bandwidth, but the benefit of the design is that when the GPU needs more and the CPU isn't using it it has access to it and vice-versa.

And if you knew anything about common sense you wouldn't say to get some of it. You're rude and dismissive of anyone else who doesn't fit into your world view, you might want to do something about fixing that about yourself; but probably you won't.
Ppietra - Friday, October 29, 2021 - link
No, you haven’t shown anything, because for whatever reason you continue to ignore how big the cache is when compared with anything else out there, and how big the L2 cache is, also when compared with anything else out there - something that they don’t share. Thirdly, if you even tried to pay attention to what was said, you would see that the M1 Max has double the system cache size, and yet not much different CPU performance.
You also continue to ignore that in a game (which is the thing you are obsessing about), CPU and GPU work together. Not having to send instructions to an external GPU, and CPU and GPU being able to work on the same data stored in cache, gives a big performance improvement, it removes bottlenecks. So you obsessing because the CPU can use system cache during a game makes no sense, because the sharing can actually give a boost in game performance.
Fringe cases would never be equivalent to every gamer.
richardnpaul - Friday, October 29, 2021 - link
"continue to ignore how big the cache is when compared with anything else out there"
Like the previously mentioned RX 6800 which has 256MB? I've not mentioned the RX6800 (infinity) cache at all?

The L2 cache is large, but then it doesn't have an L3 cache. This is a balancing act that chip architects engage with all the time. It seems that zen3 and the M1 Max graphs are very similar for latency with full random being a little higher but most everything else looking close enough that I'm not going to stick my neck out and declare either a winner.

"and CPU and GPU being able to work on the same data stored in cache, gives a big performance improvement, it removes bottlenecks"
This is not represented in the benchmarking, which might be because there needs to be some specific optimisation done, or it could be due to something else. I expect the situation to improve though, probable with more focus on the M1 Pro which will carry over to the Max.
Ppietra - Friday, October 29, 2021 - link
You are not going to see something in benchmark that is inherent to how the system works, how it manages memory, there is no off switch. You need to have the knowledge of how things work.
"The L2 cache is large, but then it doesn't have an L3 cache."????????????????
System cache behaves as if it was a L3 cache for the CPU. How can you say that zen 3 and M1 are similar when the M1 Max has 3.5 times the cache size of a laptop Ryzen??? Just the L2 cache is larger than all the cache available in a laptop Ryzen.
"RX 6800 which has 256MB?" A RX6800 isn’t a laptop chip. [" laptop processors " - - it’s there in one of the first comments]
richardnpaul - Saturday, October 30, 2021 - link
This is where you need to look at the latency graphs for M1 Pro/Max and then go and find the Zen 3 article and compare the graphs for yourself. And I haven't been comparing the M1 Max to a laptop Ryzen, I have repeatedly compared it to a single zen3 core complex where they are much closer in terms of total cache. Compare the 5nm M1 Max to the 7nm Zen3 all you like, with its much higher transistor count. You're not talking about the same thing as I was all along.

I have repeatedly compared whatever is the closest comparison, regardless of where its used to get a helpful idea of what benefits it could bring. That Apple have managed to do this in a laptop's power budget is, and I'll quote myself here "a technological marvel". The M1 Pro/Max are combined GPUs and CPUs, that means you can compare them to standalone GPUs and to CPUs. You're the one who can't seem to understand that they both need to stand on their own merits.
Ppietra - Saturday, October 30, 2021 - link
Really!??? You want to compare a laptop processor with Desktop chips that can consume 3-4 times more than the all laptop, and you think that is close? no common sense whatsoever!
But guess what even then a M1 Max has more cache available than a consumer desktop Ryzen!
The latency graphs are for the CPU (which, by the way you can actually see differences because of the size of the level 1 and 2 caches even with desktop Ryzen), they don’t tell you anything when you want to compare the response latency between CPU and GPU, nor about the performance boost from processing the CPU and GPU being able to process the same data in cache without having to access RAM.
Who said you cannot compare with dedicated GPUs?
richardnpaul - Sunday, October 31, 2021 - link
I'm comparing architectures, not products, that's why it seems to you like this is an "unfair" comparison. I also bear in mind what node the architecture is at, as that makes quite a marked difference due to transistor budget constraints.

Yes M1 Max has more cache, and where you're not using the GPU (a bit difficult as you'll be running an OS which has a GUI, but let's say that that is basically negligible) it should have a reasonable impact on usages which are heavy on memory bandwidth. In fact you can see that in the benchmarks, there are a number of which heavily reward the M1 Max over anything else, not that many in total but certain use cases will see great uplifts, just the same as Milan-X and the equivalent chiplets in Ryzen CPUs which we'll get to see in the next few days will have benefits in certain use cases.

What I was saying way back was, what's the contention there, when running a game, how much benefit is the GPU getting and if any how much is the CPU losing when contention starts to happen on the SLC. Caches usually work on some kind of LRU basis, so if two separate things are trying to use the same cache (which can have benefits where they are both using the cache for the same data) both suffer as their older cache data is evicted by the other processor. That should be measurable. Workloads that share the same data, if its small enough to fit into the 48MB on the Max, should see huge benefits, and yes, one application that has been highlighted has taken advantage of this. But we are yet to see others take this up, AMD, having tried this before will tell you that if you can't get broad software support that it's a dead duck, however, Apple have often made long term bets and stuck with them over a number of years, which could make the difference.

Apple have approached this in two different ways. They have created a monster APU, AMD's effort was... safe, I think they thought that they could iterate over time to large better designs, however, no-one wanted to put that much time and effort into a bet that AMD would deliver in the future when Intel wasn't making similar noises.
They're on a cutting edge node, with a cutting edge design, and there's no other choice for Apple users, sure you can get the original M1 or M1 Pro, but there's no Intel to get in the way and the only downside of the other chips are that they will be slower due to having fewer resources but it's all much the same design.
OreoCookie - Wednesday, October 27, 2021 - link
No, the 24 MB = 2 x 12 MB are the shared L2 caches amongst the performance core clusters, the two efficiency cores share another 4 MB (so the M1 Pro and M1 Max have close to Zen 3 desktop-level L2 caches if you ignore the system level cache). These caches are not shared between CPUs and GPUs at all. Only the system-level cache of yet *another* 48 MB is shared amongst all logic that has access to main memory. Given that the total memory bandwidth is larger than what CPU and GPU need in a worst-case scenario, I fail to see how this is somehow an edge case.

It seems the memory bandwidth so large that it can accommodate all CPU cores running a memory-intensive workload at full tilt *and* the GPU running a memory-intensive workload with room to spare. Even if you could saturate the memory bandwidth by also using the NPU (ML accelerator) and/or the hardware en/decoder, I think you are really reaching. This would be far beyond the capabilities of any comparable machine. Even much more powerful machines would struggle with such a workload.
richardnpaul - Thursday, October 28, 2021 - link
Yes sorry, I do know that, the 24 in 24/48MB was a reference to the M1 Pro which has half the shared buffer. That shared buffer, I'd need to go back and look at the access times (and compare it to Zen3 desktop) because it's almost on the other side of the chip from the cores.

I do see that they tested a game at 4K, and I know that some games lean more heavily on the onboard RAM on dGPUs and not all games have specific high resolution 4K textures and so use more RAM than others. And it is mentioned on the second page that they didn't see anything that pushed the GPU over using 90GB/s of bandwidth and I don't know if that they were measuring during that testing run (I would expect that they were but you know what they say about assumptions :D).

I think that you're right and that the architecture team probably went overboard on the bandwidth anticipating certain edge case scenarios where the system has multiple tasks loading multiple parts of the CPU and we'll see some rebalancing in future designs. I would like to see a game run with or without mods that does stress the GPU memory subsystem (games aren't usually hammering the CPU bandwidth so more should be available to the GPU, which may very well never be able to saturate it by design, but the cache may be saturated). This will also tell us something about longevity of the SoC too.

I don't think that I'm reaching, more that I see systems lasting for 7+ years, and when newer generations of hardware move on unusual usage when some hardware is new suddenly becomes common place because newer hardware is a evolving target over time and sometimes software does actually utilise it. (Sometimes CPU bugs rob you of performance and make your hardware feel slow, other times it's just that software is a bit more demanding now than it was years before when you got it)

Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights

Power Behaviour: No Real TDP, but Wide Range

Post Your Comment

493 Comments

View All Comments

Ppietra - Thursday, October 28, 2021 - link

richardnpaul - Friday, October 29, 2021 - link

Ppietra - Friday, October 29, 2021 - link

richardnpaul - Friday, October 29, 2021 - link

Ppietra - Friday, October 29, 2021 - link

richardnpaul - Saturday, October 30, 2021 - link

Ppietra - Saturday, October 30, 2021 - link

richardnpaul - Sunday, October 31, 2021 - link

OreoCookie - Wednesday, October 27, 2021 - link

richardnpaul - Thursday, October 28, 2021 - link

Log in

Don't have an account? Sign up now