Hot Chips 2020: Marvell Details ThunderX3 CPUs - Up to 60 Cores Per Die, 96 Dual-Die in 2021

Name: Hot Chips 2020: Marvell Details ThunderX3 CPUs - Up to 60 Cores Per Die, 96 Dual-Die in 2021
Item: Hot Chips 2020: Marvell Details ThunderX3 CPUs - Up to 60 Cores Per Die, 96 Dual-Die in 2021
Author: Andrei Frumusanu

by Andrei Frumusanu on August 17, 2020 4:30 PM EST

27 Comments | Add A Comment

27 Comments

Today as part of HotChips 2020 we saw Marvell finally reveal some details on the microarchitecture of their new ThunderX3 server CPUs and core microarchitectures. The company had announced the existence of the new server and infrastructure processor back in March, and is now able to share more concrete specifications about how the in-house CPU design team promises to distinguish itself from the quickly growing competition that is the Arm server market.

We had reviewed the ThunderX2 back in 2018 – at the time still a Cavium product before the designs and teams were acquired by Marvell only a few months later that year. Ever since, the Arm server ecosystem has been jump-started by Arm’s Neoverse N1 CPU core and partner designs such as from Amazon (Graviton2) and Ampere (Altra), a quite different set of circumstances and alongside AMD’s successful return in the market, a very different landscape.

Marvell started off the HotChips presentation with a roadmap of its products, detailing that the ThunderX3 generation isn’t merely just a single design, but actually represents a flexible approach using multiple dies, with the first generation 60-core CN110xx SKUs using a single die as a monolithic design in 2020, and next year seeing the release of a 96-core dual-die variant aiming for higher performance.

The use of a dual-die approach like this is very interesting as it represents a mid-point between a completely monolithic design, and a chiplet approach from vendors such as AMD. Each die here is identical in the sense that it can be used independently as standalone products.

From a SoC-perspective, the ThunderX3 die scales up to 60 cores, with the 2-die variant scaling up to 96. The first thing question that comes to mind when seeing these figures is why the 2-die variant doesn’t scale up to the full 120-cores- Marvell didn’t cover this during the talk but there were a few clues in the presentation.

Marvell had made the performance improvement claim of 2-3x over a ThunderX2 at equal power levels. This latter had a TDP of 180W – if the TX3 maintains this thermal envelope then it would mean that a dual-die design would have had to grow TDPs to up to 360W which far beyond what one can air cool in a typical server form-factor and rack in terms of power density. Assuming just a linear cut-down to 96 cores as advertised we’d end up around 288W – which is more in line with the current high-end server CPU deployments without water-cooling. Of course – this is all our own analysis and take of the matter.

A single die supports 8 channels of DDR4-3200 which is standard for this generation of a server product and essentially in line with everybody else in the market. I/O wise, we see a disclosure of 64 lanes of PCIe 4.0 – which is again in line with competitors but half of what higher-end alternatives from Ampere or AMD can achieve.

One big unknown right now is how the dual-die product will segment the I/O and memory controllers – if this is going to be a 50-50 split in terms of resources between the two dies, or whether we’ll see an imbalanced setup – or if the platform can actually handle the full resources from each die and transform itself into a 16-channel 128 lane beast?

Comparison of Major Arm Server CPUs
	Marvell ThunderX3 110xx	Cavium ThunderX2 9980-2200	Ampere Altra Q80-33	Amazon Graviton2
Process Technology	TSMC 7nm	TSMC 16 nm	TSMC 7 nm	TSMC 7nm
Die Type	Monolithic or Dual-Die MCM	Monolithic	Monolithic	Monolithic
Micro-architecture	Triton	Vulcan	Neoverse N1 (Ares)
Cores	60 (1 Die) Swiched 3x Ring 96 (2 Die)	32 Ring bus	80 Mesh	64 Mesh
Threads	240 (1 Die) 384 (2 Die)	128	80	64
Max. number of sockets	2	2	2	1
Base Frequency	?	2.2 GHz	-	-
Turbo Frequency	3.1 GHz	2.5 GHz	3.3 GHz	2.5 GHz
L3 Cache	90MB	32 MB	32 MB	32 MB
DRAM	8-Channel DDR4-3200	8-Channel DDR4-2667	8-Channel DDR4-3200	8-Channel DDR4-3200
PCIe lanes	4.0 x 64 (1 Die)	3.0 x 56	4.0 x 128	4.0 x 64
TDP	~180W (1 Die) (unconfirmed)	180W	250 W	~110-130W (unconfirmed)

On paper at least, the ThunderX3 seems quite similar to Amazon’s Graviton2 as they both share a similar amount of CPU cores and similar memory and IO configurations. The bigger differences that one can immediately point out to is that the ThunderX3 employs SMT4 in its CPU cores and thus supports up to 240 threads per die. There’s also a TDP difference, but I attribute this to the Graviton2 being conservative with its clock frequencies, whilst Ampere’s SKUs being more in line with the ThunderX3, particularly the 64-core 3.0GHz 180W Q64-30 being the closest match in specifications.

Another thing that stands out for the ThunderX3 is the 90MB of L3 cache that dwarfs the 32MB of the previous generation as well as the 32MB configurations of Ampere and Amazon.

Marvell here opted to evolve its own interconnect microarchitecture which has now evolved from a simple ring design, to a switched ring with three sub-rings, or columns. Ring stops consist of CPU tiles with 4 cores and two L3-slices with 3MB of cache. This gives a full die with 15 ring stops (3x5 columns) and the full 60 cores 90MB of total L3 cache which is a quite respectable amount.

In the Q&A sessions, Marvell disclosed that their rationale for a switched ring topology versus a single ring, or a mesh design was that a single ring wouldn’t have been able to scale up in performance and bandwidth at higher core counts. A mesh design would have been a big change, and it would have required a reduction in core count. A switched ring represented a good trade-off between the two architectures. Indeed, if this is what enabled Marvell to include up to 3x the cache versus its nearest competitors, it seems to have been a good choice.

One odd thing I noted is that the system is still using a snoop-based coherency algorithm which comes in contrast with other directory-based systems in the industry. This might reduce implementation complexity and area, but might lag behind in terms of power efficiency and coherency traffic for the chip.

The memory controllers tap into the rings, and Marvell’s inter-socket/die CCPI3 interface here serves up to 84GB/s of bandwidth.

Triton CPU Core - 30% Generational IPC Improvements

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

27 Comments

View All Comments

Tomatotech - Monday, August 17, 2020 - link
SMT4 is interesting, 60% speedup from 5% area. Makes me think SMT8 might just be worth exploring in the next generation, even if it is only 30% speedup from 10% area.

It’s been a while since I studied this, but saying it has 90MB L3 cache might be regarded as misleading. If you try stuffing your 50MB of code and data into cache, it’s going to fall over quite spectacularly. Each 4-core tile has only 6MB cache, so you have to keep your code and data to only 6MB or less if you want it to fit into cache. (Or fiddle around with splitting it between tiles). Better to say it has 15x6MB L3 cache (Still damn good) and leave it at that?
saratoga4 - Monday, August 17, 2020 - link
>Each 4-core tile has only 6MB cache

According to the slides linked above the L3 cache lines have no affinity for any specific core, so it really is a single 90MB L3, similar to how Intel's ring bus parts work.
Tomatotech - Monday, August 17, 2020 - link
You have to be careful about the marketing wording. What you said is what they want you to think - that any core can use any cache equally well - but a close reading indicates it could mean just 'any of the 4 cores in a single tile can use any of the 2x3MB caches associated with that tile'. Hence my use of 6MB.

Even worse - a single core might not be able to use all of both on-tile caches at the same time, so the real max cache for a core might be as low as 3MB. This chip is designed to be a multi-thread monster with up to 240 threads per die, each thread accessing its own part of cache. Not to have a tiny number of threads using up all the cache. That's a different layout.
saratoga4 - Monday, August 17, 2020 - link
>but a close reading indicates it could mean just 'any of the 4 cores in a single tile can use any of the 2x3MB caches associated with that tile'.

The slides say that individual tiles are striped, so I don't think your reading is correct.
Krysto - Tuesday, August 18, 2020 - link
Power10 is already doing SMT8.
GreenReaper - Wednesday, August 19, 2020 - link
It talks about the increase in area, but not power usage. Moreover you risk effectively reducing single-threaded speed unless you upgrade every other bit (and probably even then). If you limit them further it starts to look more and more like a GPU.

I think we need more evidence that people can use SMT4 effectively, as they have done eventually with Quad-Core and beyond, before it is increased further.
senttoschool - Tuesday, August 18, 2020 - link
Both AMD and Intel are in trouble. The x86 market is shrinking fast.

AWS is deadset on transitioning to its own ARM server chips so they can reduce cost, build chips for their own needs, and have unique features. Google Cloud and Microsoft Azure will probably follow shortly in order to compete.

With Apple's transition to ARM, the x86 market shrank by 10% overnight. In addition, MacOS ARM will spur a renewed push for developers to optimize for ARM (Windows or Mac) on the laptop/desktop. This means Windows ARM will probably be an extremely viable option in the near future.

The x86 market isn't dead. But it's shrinking. AMD and Intel aren't just competing against each other, they're also competing with Apple, Qualcomm, Marvell, ARM, Nuvia, Ampere, Samsung, Huawei, Alibaba, etc.
Quantumz0d - Tuesday, August 18, 2020 - link
LOL

ARM is always stupid custom, Apple trash is the least bothered part for Apple themselves as their Mac share of revenue is under 10%, that's why Apple did the move because instead of paying Intel and getting caught in the cheap VRM trash news they would benefit from the new hybrid OS of Mac which is an abomination to desktop UX. Apple transition shranked x86 marketshare overnight ? hahah, like Apple said themselves it's going to be a 2 year period so the products are still there. By that time you think AMD and Intel both DC centric companies sit and lay eggs ? ARM is dogshit, it's always fucking custom the massive support of x86 in Linux space is not there for ARM so transitioning to that is not possible at this point of time, x86 already is a RISC underneath it, so your yapping of Intel and AMD in trouble ain't true. They sure have to make sure they are innovating, Intel is going Big Little for the mobile space, AMD patented the same and they will go the same, and with massive Windows ecosystem this ARM bullshit failed on RT and x86 to ARM translation is not for free, you get performance penalty. Apple pays top companies like Adobe to make their Software like first party IP, it's not a major market.

AT spec scores out of that graphs are meaningless. When the work done by Snapdragon processors is equivalent to what A series do, just because the OS feels snappy due to tight integration and closed binaries is not a measure but real life work performance.

All those companies, none of them make Datacenter processors except Marvell and with Apple being only consumer centric company what the fuck are you blabbing here ? Datacenter market is owned by x86 not ARM, and Qualcomm left after pouring billions in their Centriq prized custom ARM IP where Cloudflare was heralding just like how again CF is now doing it again with Altera. AWS is going Graviton because it's always Amazon's one of their own type in house stuff to save money and it's not going to beat AMD which is the king of the DC with EPYC 7742. Icelake SP is also coming and Intel already probably locked out vendors from moving to AMD by now forget ARM.

ARM is again, full custom. Qualcomm is the only one which properly respects the OSS by CAF and Binaries, Blobs etc on Android space. Samsung is a failure, Huawei is utter trash in that dept. Nuvia is vaporware until real product and that's only for DC market, not Smartphones or ultra portables, which is where lot of money is there to start up quick. And on Windows it's utter trash and same for Linux, only low level workloads. Name one ARM processor based machine which can run anything like x86 Wintel machines ?
HVAC - Tuesday, August 18, 2020 - link
Utter ... trash ...
Greys - Wednesday, August 26, 2020 - link
I am agree with you!

Hot Chips 2020: Marvell Details ThunderX3 CPUs - Up to 60 Cores Per Die, 96 Dual-Die in 2021

Post Your Comment

27 Comments

View All Comments

Tomatotech - Monday, August 17, 2020 - link

saratoga4 - Monday, August 17, 2020 - link

Tomatotech - Monday, August 17, 2020 - link

saratoga4 - Monday, August 17, 2020 - link

Krysto - Tuesday, August 18, 2020 - link

GreenReaper - Wednesday, August 19, 2020 - link

senttoschool - Tuesday, August 18, 2020 - link

Quantumz0d - Tuesday, August 18, 2020 - link

HVAC - Tuesday, August 18, 2020 - link

Greys - Wednesday, August 26, 2020 - link

Log in

Don't have an account? Sign up now