CCX Size

Moving down in node size brings up a number of challenges in the core and beyond. Even disregarding power and frequency, the ability to put structures into silicon and then integrate that silicon into the package, as well as providing power to the right parts of the silicon through the right connections becomes an exercise in itself. AMD gave us some insight into how 7nm changed some of its designs, as well as the packaging challenges therein.

A key metric given up by AMD relates to the core complex: four cores, the associated core structures, and then L2 and L3 caches. With 12nm and the Zen+ core, AMD stated that a single core complex was ~60 square millimeters, which separates into 44mm2 for the cores and 16mm2 for the 8MB of L3 per CCX. Add two of these 60mm2 complexes with a memory controller, PCIe lanes, four IF links, and other IO, and a Zen+ zeppelin die was 213 mm2 in total.

For Zen 2, a single chiplet is 74mm2, of which 31.3 mm2 is a core complex with 16 MB of L3. AMD did not breakdown this 31.3 number into cores and L3, but one might imagine that the L3 might be approaching 50% of that number. The reason the chiplet is so much smaller is that it doesn’t need memory controllers, it only has one IF link, and has no IO, because all of the platform requirements are on the IO die. This allows AMD to make the chiplets extremely compact. However if AMD intends to keep increasing the L3 cache, we might end up with most of the chip as L3.

Overall however, AMD has stated that the CCX (cores plus L3) has decreased in size by 47%. That is showing great scaling, especially if the +15% raw instruction throughput and increased frequency comes into play. Performance per mm2 is going to be a very exciting metric.

Packaging

With Matisse staying in the AM4 socket, and Rome in the EPYC socket, AMD stated that they had to make some bets on its packaging technology in order to maintain compatibility. Invariably some of these bets end up being tradeoffs for continual support, however AMD believes that the extra effort has been worth the continued compatibility.

One of the key points AMD spoke about with relation to packaging is how each of the silicon dies are attached to the package. In order to enable a pin-grid array desktop processor, the silicon has to be affixed to the processor in a BGA fashion. AMD stated that due to the 7nm process, the bump pitch (the distance between the solder balls on the silicon die and package) reduced from 150 microns on 12nm to 130 microns on 7nm. This doesn’t sound like much, however AMD stated that there are only two vendors in the world with technology sufficient to do this. The only alternative would be to have a bigger bit of silicon to support a larger bump pitch, ultimately leading to a lot of empty silicon (or a different design paradigm).

One of the ways in order to enable the tighter bump pitch is to adjust how the bumps are processed on the underside of the die. Normally a solder bump on a package is a blob/ball of lead-free solder, relying on the physics of surface tension and reflow to ensure it is consistent and regular. In order to enable the tighter bump pitches however, AMD had to move to a copper pillar solder bump topology.

In order to enable this feature, copper is epitaxially deposited within a mask in order to create a ‘stand’ on which the reflow solder sits. Due to the diameter of the pillar, less solder mask is needed and it creates a smaller solder radius. AMD also came across another issue, due to its dual die design inside Matisse: if the IO die uses standard solder bump masks, and the chiplets use copper pillars, there needs to be a level of height consistency for integrated heat spreaders. For the smaller copper pillars, this means managing the level of copper pillar growth.

AMD explained that it was actually easier to manage this connection implementation than it would be to build different height heatspreaders, as the stamping process used for heatspreaders would not enable such a low tolerance. AMD expects all of its 7nm designs in the future to use the copper pillar implementation.

Routing

Beyond just putting the silicon onto the organic substrate, that substrate has to manage connections between the die and externally to the die. AMD had to increase the number of substrate layers in the package to 12 for Matisse in order to handle the extra routing (no word on how many layers are required in Rome, perhaps 14). This also becomes somewhat complicated for single core chiplet and dual core chiplet processors, especially when testing the silicon before placing it onto the package.

From the diagram we can clearly see the IF links from the two chiplets going to the IO die, with the IO die also handling the memory controllers and what looks like power plane duties as well. There are no in-package links between the chiplets, in case anyone was still wondering: the chiplets have no way of direct communication – all communication between chiplets is handled through the IO die.

AMD stated that with this layout they also had to be mindful of how the processor was placed in the system, as well as cooling and memory layout. Also, when it comes to faster memory support, or the tighter tolerances of PCIe 4.0, all of this also needs to be taken into consideration as provide the optimal path for signaling without interference from other traces and other routing.

New Instructions: Cache and Memory Bandwidth QoS Control AMD Zen 2 Microarchitecture Overview: The Quick Analysis
Comments Locked

216 Comments

View All Comments

  • eek2121 - Wednesday, June 19, 2019 - link

    I think what people are getting at is having an L4 Cache. Such a cache would be slower than L3, but would be much faster than DRAM (for now, DDR 5133 was recently demonstrated, that is 2566 MHz double data rate). HBM2 is a prime candidate for that because you can stick 8 Gb on a CPU for $60 and with some engineering work, it would help performance massively. 8 gb could hold practically everything needed in cache. That being said, there are engineering challenges to overcome and I doubt this will ever be a thing.

    Once JEDEC approves RAM running at DDR 5600 at reasonable timings it won’t matter anyway. AMD can simply bump up the IF speed to 1:1 and with shortened RAM traces, performance penalties can be minimized.
  • jamescox - Saturday, June 22, 2019 - link

    For an interposer based Epyc package for the next generation, I would expect perhaps they do an active interposer with all of the external interface transistors in the interposer. They could do similar things with a passive interposer also though. The passive interposer could be an intermediate between Zen 3 and Zen 4. Then they could place a large number of 7 nm+ chiplets on the interposer. As I said, it is hard to speculate, but an option that I thought of based on AdoredTV 15 chiplet rumor would be to have 4 memory controller chips, each one running 2 channels (128-bit) DDR5. Those chips would just be the memory controller logic if on an active interposer and the interfaces to the interposer connections. That isn’t much so at 7 nm and below, they could place massive L4 SRAM caches on the memory controller chips. Current ~75 square mm Zen 2 chiplets have 16 MB plus 8 cpu cores, so it could be a large amount of cache; perhaps something like 64 or 128 MB per chip. It wouldn’t be a cheap device, but AMD’s goal is to get into the high end market eventually.

    The other chiplets could be 1 or two die to manage connections out to the cpu chiplets. This would just be the logic with an active interposer. With a regular interposer, it would need to have the IO transistors also, but the interfaces are quite small. A single infinity fabric switch chip handling all cpu chiplets could provide very low latency. They may have another chip with a switch to tie everything together or they could actually place a couple cpu chiplets on the interposer. Two extra cpu chiplets or one 16 core chiplet could be where the 80 core rumor came from. A possible reason to do that is to allow an HBM based gpu to be mounted on either side. That would make an exceptional HPC product with 16 cores (possible 64 threads if they go to 4 way SMT) and 2 HBM gpus. Another way to get 80 core would be to just make a 3 CCX chiplet with 12 cores. It looks like the Epyc package will not fit all 12 core die though. A mixture of 4 12-core and 4 8-core looks like it would fit, but it wouldn’t be symmetric though. That would allow a quick Zen 2+ style upgrade. Desktop might be able to go to 24 cores and Epyc to 80. The confusion could be mixing up a Zen 2+ rumor and a Zen 3 rumor or something like that. The interposer makes a lot of sense for the giant IO die that cannot be easily implemented at 7 nm. The yields probably don’t support that large of die, so you use an interposer and make a bunch of 100 square mm sized die instead.

    I can’t rule out placing HBM on an IO interposer, but due to the latency not really being that much better than off package DRAM, especially at DDR5 speeds, it just doesn’t seem like they would do it.
  • nandnandnand - Sunday, July 7, 2019 - link

    "That being said, there are engineering challenges to overcome and I doubt this will ever be a thing."

    Putting large amounts of DRAM ever closer to the CPU will definitely be a thing:

    https://www.darpa.mil/attachments/3DSoCProposersDa...

    Intel is already moving in this direction with Foveros, and AMD is also working on it:

    https://www.tomshardware.com/news/amd-3d-memory-st...

    It doesn't matter how fast DDR5 is. The industry must move in this direction to grab performance and power efficiency gains.
  • AdrianMel - Sunday, June 16, 2019 - link

    I would like these AMD chips to be used on laptops. It would be a breakthrough in computing power, low consumption. I think that if a HBM2 memory or a larger memory is integrated into the processor, I think it will double the computing power. It would be a study and implementation of 2 super ports, the old expresscard 54 in which we can insert 2 video cards in laptops
  • nandnandnand - Sunday, July 7, 2019 - link

    AMD needs to put out some 6-8 core Zen 2 laptop chips.
  • peevee - Monday, June 17, 2019 - link

    Does it mean that AVX2 performance doubles compared to Zen+? At least on workloads where data for the inner loop fits into L1D$ (hierarchical dense matrix multiplication etc)?
  • peevee - Monday, June 17, 2019 - link

    "AMD manages its L3 by sharing a 16MB block per CCX, rather than enabling access to any L3 from any core."

    Does it mean that for code and shared data caches, 64MB L3 on Ryzen 9 behaves essentially like 16MB cache (say, all 12/16 cores run the same code as it usually is in performance-critical client code and not 4+ different processes/VMs in parallel)? What a waste it is/would be...
  • jamescox - Saturday, June 22, 2019 - link

    The caches on different CCXs can communicate with each other. In Zen 2, those one the same die probably communicate at core clock rather than at memory clock; there is no memory clock on the cpu chiplet. The speeds between chiplets have essentially more than doubled the clocks vs. Zen 1 and there is a possibility that they doubled the widths also. There just about isn’t any way to scale to such core counts otherwise.

    An intel monolithic high core count device will have trouble competing. The latency of their mesh network will go up with more cores and it will burn a lot of power. The latency of the L3 with a mesh network will be higher than the latency within a 4-core CCX. Problems with the CCX architecture are mostly due to OS scheduler issues and badly written multithreaded code. Many applications performed significantly better on Linux compared to windows due to this.

    The mesh network is also not workable across multiple chiplets. A 16-core (or even a 10 core) monolithic device would be quite large for 10 nm. They would be wasting a bunch of expensive 10 nm capacity on IO. With the large die size and questionable yields, it will be a much more expensive chip than AMD’s MCM. Also, current Intel chips top out at 38.5 MB of L3 cache on 14 nm. Those are mostly expensive Xeon processors. AMD will have a 32 MB part for $200 and a 64 MB part for $500. Even when Intel actually gets a 10 nm part on the desktop, it will likely be much more expensive. They are also going to have serious problems getting their 10 nm parts up to competitive clock speeds with the 14 nm parts. They have been tweaking 14 nm for something like 5+ years now. Pushing the clock on their problematic 10 nm process doesn’t sound promising.
  • peevee - Monday, June 17, 2019 - link

    "One of the features of IF2 is that the clock has been decoupled from the main DRAM clock....

    For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half."

    I have news for you - 2:1 is still COUPLED. False advertisement in the slides.

    And besides, who in their right mind would want to halve IF clock to go from DDR3200 to even DDR4000 (with requisite higher timings)?
  • BMNify - Saturday, June 22, 2019 - link

    the only real world test that matters in the UHD2/8K Rec. 2020/BT.2020 LIVE NHK/bbc broadast of the 2020 Summer Olympics will begin on Friday, 24 July and related video streams is can AMD Zen 2 do it can any pc core do realtime x264/x265/ffmpeg software encoding and x264/x265 compliant decoding (notice how many hw assisted encoders today dont decode to spec as seen when you re-enode them with the latest ffmpeg), how many 8k encodes and what overheads are remaining if any can even do one...

Log in

Don't have an account? Sign up now