Ponte Vecchio: The Old Bridge in the land of Gelato

With the disclosures about Xe-HPC and the new compute unit and memory fabric in mind, Intel went onto the big money announcement: the first Xe-HPC GPU will be a design called Ponte Vecchio (PVC).

If that looks like a weird GPU to you, then I agree. What we have are two segments, each with eight compute chiplets, a number of XeMF chips, and HBM memory. I do want to step back a second to talk about some of the technologies around PVC before going into what we see here.

Earlier in 2019, I was offered a chance to speak to Intel’s Ramune Nagisetty. She’s in charge of Intel’s plans regarding chiplet and connectivity technologies: she’s the person to speak to when it comes to connecting chiplets with technologies like the Embedded Multi-Die Interconnect Bridge (EMIB), but also when it comes to new stacking technologies like Foveros. We’ve already seen EMIB in a number of products, such as Intel’s FPGA family, and consumer hardware like Kaby Lake-G, and Foveros is getting its first limelight in the new Lakefield chip for small-form factor client notebooks, which stacks a 14nm IO die with a 10nm Compute Die, all with a DRAM PoP on top.

In that discussion, it was confirmed what Raja had said at the Architecture Day a few months prior: both EMIB and Foveros will be coming to Intel’s new family of GPUs. Of course, EMIB and Foveros doesn’t come cheap: trying to align vias between silicon is best done when sober and with a clear head, and even then there are always optimizations to be made. In that regard, we assumed EMIB and Foveros would be going on the high-end discrete graphics solutions, what we now know to be Xe-HP. But it would appear that these technologies are likely to be in Xe-HPC as a priority, and Ponte Vecchio seems to be the first chip that will have both.

It is worth noting that technologies like EMIB and Foveros are going to define Intel’s future. Not only in GPUs, but also CPUs, FPGAs, AI, and everything else.

This slide shows that both technologies will be on Ponte Vecchio, but this slide actually says something a lot more than what it appears.

One of the questions with EMIB and Foveros is in what form it will take. There are many possibilities for both, and this slide might have exposed a few. Note that this slide only shows half of a Ponte Vecchio, and I want to do a custom diagram for you so show what I’m seeing here.

Here I’ve expanded out to a full PVC, with 16 Xe-HPC chiplets, 8 XeMF chips, and each half of the PVC has access to four HBM stacks. Each of the HBM stacks is connected via EMIB, along with what I think is an IO chip (perhaps built on 14nm).

From the official slide above, you will notice the group of 8 Xe-HPC chiplets and four XeMF chips seem to be on a separate package altogether, with a unified BGA onto another tier (the mid-tier) of packaging below. To be honest, that slide also looks like there are some chips on that mid package, but on this mid-package we also get the EMIB connections and what I believe is an XeMF scaling chip for off-GPU connections. The HBM2 and what IO chip seem to be on a bottom tier of packaging, which means that we have three tiers of packaging for this individual GPU: four if you include the Foveros on each of the Xe-HPC chiplets!

Now normally when we think of tiered packaging like this, we expect it to be BGA on BGA. There could be a part of this design that uses an interposer. If Intel was making this at TSMC, I might use the word ‘CoWoS’ as part of the stacking technology they use. At the minute it’s all a lot of guesswork. Assuming the diagram is accurate.

I’ve labelled each of the Xe-HPC chiplets as having Foveros, as we believe these to be two die stacked. A lot of commentary around this stacking is asking the question of what exactly are on each of the two dies that are stacked – is one IO and the other compute, like in Lakefield, or is it something else? A semi-reasonable conclusion is that one die might be cache, and the other die might be compute. However it’s important to note that GPUs tend to have local cache right next to the compute, to in actual fact I expect both of these dies in the Foveros stack to both be compute dies. It may be such that one is for SIMD and the other is for SIMT, but if the two were identical and modular then it makes it much easier to manufacture.

For those of you wondering what those metal brackets are around each of the Xe-HPC dies, those are likely to be shims to help correct for different chip heights. I should point out that the die inside the ship, the stacked Xe-HPC die, is actually smaller in the diagram than the XeMF die between each pair of chiplets. This could be for all the cache on the XeMF dies, or it could be if they are built on a different process node.

Also, it is worth noting that with Ponte Vecchio, we have two ‘slices’. From the diagram, it looks like that each slice has additional XeMF connection points, which could be used to go off-GPU, or so more slices (up to four) could be connected together. Some of that we might address when we look at the Aurora system and Intel’s proposed layout of one of the servers.

Intel promises a couple of other PVC features. Not only will the hardware support ECC and parity checks across all memory and caches, but there will be Xeon-class RAS features and something called ‘in-field repair’. More details to come.

Also, a final word on the name: Ponte Vecchio. Intel has plenty of Lakes, Coves, Rapids, and other things. Apparently having a codename that involves a physical place actually helps address any potential legal issues without having to do due diligence for a proper marketing name. Raja Koduri said that he chose it because it’s based in Florence, which has good gelato. He likes gelato, and should Intel decide to do a launch event with Ponte Vecchio in mind, Raja gets his gelato. I went to Florence a couple of years ago – the gelato was good.

Xe: The Flavors and The Architecture oneAPI: Intel’s Solution to Software
Comments Locked

47 Comments

View All Comments

  • peevee - Monday, December 30, 2019 - link

    "Xe contains two fundamental units: SIMT and SIMD. In essence, SIMD (single instruction, multiple data) is CPU like and can be performed on single elements with multiple data sources, while SIMT (single instruction, multiple threads) involves using the same instructions on blocks of data"

    That phrase makes absolutely no sense. "CPU-like" SIMD executes the same instruction on multiple data elements, not on "single elements".
  • peevee - Monday, December 30, 2019 - link

    What the H Lenovo, a Chinese company, is doing developing a critical tool for top-secret projects within DoE?
  • henryiv - Thursday, January 2, 2020 - link

    Thanks for the great article. DPC++ stands for data-parallel c++ btw (which is basically SYCL implementation of Intel).
  • Deicidium369 - Wednesday, January 27, 2021 - link

    Xe HP was shown with 4 tiles and 42 TFLOPS so each tile = 10.5 TFLOPS at FP32 or half of that for FP64. Assuming FP64 is the most likely

    Xe HPC has 16 Tiles x 5.25 TFLOPS per tile = 84 TFLOPS per Xe HPC. There are 6 Xe HPC per sled = 504 TFLOPS per sled or roughly 0.5 PFLOPS - so ~2000 sleds needed for 1 ExaFLOP FP64.

    2000 sleds - 20 sleds per rack = 100 racks at FP64

    230 Petabytes of storage at the densest config 1U = 1PB so 230 1U 1PB - 230 U = less than 6 racks...

    Even if using 2.5" would not need more than 20 racks for storage

    So 100 rack cabinets of Compute + 20 rack cabinets to reach 1 ExaFLOP and 230PB - Networking could be 1-2 racks - not sure the water cooling components are in standalone racks or not. So 122 Cabinets + ??? for cooling.

Log in

Don't have an account? Sign up now