Integer Units, Load and Store

The integer unit schedulers can accept up to six micro-ops per cycle, which feed into the 224-entry reorder buffer (up from 192). The Integer unit technically has seven execution ports, comprised of four ALUs (arithmetic logic units) and three AGUs (address generation units).

The schedulers comprise of four 16-entry ALU queues and one 28-entry AGU queue, although the AGU unit can feed 3 micro-ops per cycle into the register file. The AGU queue has increased in size based on AMD’s simulations of instruction distributions in common software. These queues feed into the 180-entry general purpose register file (up from 168), but also keep track of specific ALU operations to prevent potential halting operations.

The three AGUs feed into the load/store unit that can support two 256-bit reads and one 256-bit write per cycle. Not all the three AGUs are equal, judging by the diagram above: AGU2 can only manage stores, whereas AGU0 and AGU1 can do both loads and stores.

The store queue has increased from 44 to 48 entries, and the TLBs for the data cache have also increased. The key metric here though is the load/store bandwidth, as the core can now support 32 bytes per clock, up from 16.

Floating Point Cache and Infinity Fabric
POST A COMMENT

217 Comments

View All Comments

  • jamescox - Sunday, June 16, 2019 - link

    Everyone keeps bringing up HBM for cpus as if it is magical in some manner. HBM can provide high bandwidth, but it is still DRAM. He latency isn’t that great, so it isn’t really that useful as a cpu cache. If you are trying to run AVX512 code across a bunch of CPU cores, then maybe you could use the bandwidth. If you have code that can use that level of parallelism, then it will almost certainly run much more efficiently on an actual gpu. I didn’t think that expanding AVX to 512-bits was a good idea. There isn’t too much difference from a cpu perspective between 1 512-bit instruction and 2 256-bit instructions. The registers are wider, but they can have many more smaller registers that are specified in the ISA by using existing register renaming techniques. At 14 nm, the 512-bit units seem to take too much space and consume too much power. They may be more easily doable in 7 nm or below eventually, but they may still have issues running at cpu core clocks. If you have to run it at half clock (which is about where gpus are vs. cpus) then you have lost the advantage of going double the width anyway. IMO, the AVX 512 instructions were Intel’s failed attempt (Xeon Phi seems to have been a disappointment) at making a cpu act like a gpu. They have basically given that up and are now designing an actual gpu.

    I went off in a bit of a tangent there, but HBM really isn’t that useful for a cpu cache. It isn’t going to be that low of latency; so it would not increase single thread performance much compared to stuff actually designed to be a low latency cache. The next generations form AMD May start using active silicon interposers, but I would doubt that they would use HBM. The interposer is most likely to be used in place of the IO die. They could place all of the large transistors needed for driving off die interfaces (reason why IO doesn’t scale well) in the active interposer. They could then stack 7 nm chips on top of the active interposer for the actual logic. Cache scales very well which is why AMD can do a $200 chip with 32 MB of L3 cache and a $500 chip with 64 MB of L3. Intel 14 nm chips top out at 38.5 MB, mostly for high priced Xeon chips. With an active interposer, they could, for example) make something like 4 or 8 memory controller chips with large SRAM caches on 7 nm while using the active interposer for the IO drivers. Many different configurations are possible with an active interposer, so it is hard to speculate. Placing HBM on the IO interposer, as the AdoredTV guy has speculated, doesn’t sound like a great idea. Two stacks of HBM deliver 512 GB/s, which would take around 10 IF links to transfer to the CPU chiplets. That would be a massive waste of power. If they do use HBM for cpu chiplets, you would want to connect it directly to the cpu chiplet; you would place the a cpu chiplet and HBM stack on the same interposer. That would have some latency advantage, but mostly for large systems like Epyc.
    Reply
  • eek2121 - Wednesday, June 19, 2019 - link

    I think what people are getting at is having an L4 Cache. Such a cache would be slower than L3, but would be much faster than DRAM (for now, DDR 5133 was recently demonstrated, that is 2566 MHz double data rate). HBM2 is a prime candidate for that because you can stick 8 Gb on a CPU for $60 and with some engineering work, it would help performance massively. 8 gb could hold practically everything needed in cache. That being said, there are engineering challenges to overcome and I doubt this will ever be a thing.

    Once JEDEC approves RAM running at DDR 5600 at reasonable timings it won’t matter anyway. AMD can simply bump up the IF speed to 1:1 and with shortened RAM traces, performance penalties can be minimized.
    Reply
  • jamescox - Saturday, June 22, 2019 - link

    For an interposer based Epyc package for the next generation, I would expect perhaps they do an active interposer with all of the external interface transistors in the interposer. They could do similar things with a passive interposer also though. The passive interposer could be an intermediate between Zen 3 and Zen 4. Then they could place a large number of 7 nm+ chiplets on the interposer. As I said, it is hard to speculate, but an option that I thought of based on AdoredTV 15 chiplet rumor would be to have 4 memory controller chips, each one running 2 channels (128-bit) DDR5. Those chips would just be the memory controller logic if on an active interposer and the interfaces to the interposer connections. That isn’t much so at 7 nm and below, they could place massive L4 SRAM caches on the memory controller chips. Current ~75 square mm Zen 2 chiplets have 16 MB plus 8 cpu cores, so it could be a large amount of cache; perhaps something like 64 or 128 MB per chip. It wouldn’t be a cheap device, but AMD’s goal is to get into the high end market eventually.

    The other chiplets could be 1 or two die to manage connections out to the cpu chiplets. This would just be the logic with an active interposer. With a regular interposer, it would need to have the IO transistors also, but the interfaces are quite small. A single infinity fabric switch chip handling all cpu chiplets could provide very low latency. They may have another chip with a switch to tie everything together or they could actually place a couple cpu chiplets on the interposer. Two extra cpu chiplets or one 16 core chiplet could be where the 80 core rumor came from. A possible reason to do that is to allow an HBM based gpu to be mounted on either side. That would make an exceptional HPC product with 16 cores (possible 64 threads if they go to 4 way SMT) and 2 HBM gpus. Another way to get 80 core would be to just make a 3 CCX chiplet with 12 cores. It looks like the Epyc package will not fit all 12 core die though. A mixture of 4 12-core and 4 8-core looks like it would fit, but it wouldn’t be symmetric though. That would allow a quick Zen 2+ style upgrade. Desktop might be able to go to 24 cores and Epyc to 80. The confusion could be mixing up a Zen 2+ rumor and a Zen 3 rumor or something like that. The interposer makes a lot of sense for the giant IO die that cannot be easily implemented at 7 nm. The yields probably don’t support that large of die, so you use an interposer and make a bunch of 100 square mm sized die instead.

    I can’t rule out placing HBM on an IO interposer, but due to the latency not really being that much better than off package DRAM, especially at DDR5 speeds, it just doesn’t seem like they would do it.
    Reply
  • nandnandnand - Sunday, July 7, 2019 - link

    "That being said, there are engineering challenges to overcome and I doubt this will ever be a thing."

    Putting large amounts of DRAM ever closer to the CPU will definitely be a thing:

    https://www.darpa.mil/attachments/3DSoCProposersDa...

    Intel is already moving in this direction with Foveros, and AMD is also working on it:

    https://www.tomshardware.com/news/amd-3d-memory-st...

    It doesn't matter how fast DDR5 is. The industry must move in this direction to grab performance and power efficiency gains.
    Reply
  • AdrianMel - Sunday, June 16, 2019 - link

    I would like these AMD chips to be used on laptops. It would be a breakthrough in computing power, low consumption. I think that if a HBM2 memory or a larger memory is integrated into the processor, I think it will double the computing power. It would be a study and implementation of 2 super ports, the old expresscard 54 in which we can insert 2 video cards in laptops Reply
  • nandnandnand - Sunday, July 7, 2019 - link

    AMD needs to put out some 6-8 core Zen 2 laptop chips. Reply
  • peevee - Monday, June 17, 2019 - link

    Does it mean that AVX2 performance doubles compared to Zen+? At least on workloads where data for the inner loop fits into L1D$ (hierarchical dense matrix multiplication etc)? Reply
  • peevee - Monday, June 17, 2019 - link

    "AMD manages its L3 by sharing a 16MB block per CCX, rather than enabling access to any L3 from any core."

    Does it mean that for code and shared data caches, 64MB L3 on Ryzen 9 behaves essentially like 16MB cache (say, all 12/16 cores run the same code as it usually is in performance-critical client code and not 4+ different processes/VMs in parallel)? What a waste it is/would be...
    Reply
  • jamescox - Saturday, June 22, 2019 - link

    The caches on different CCXs can communicate with each other. In Zen 2, those one the same die probably communicate at core clock rather than at memory clock; there is no memory clock on the cpu chiplet. The speeds between chiplets have essentially more than doubled the clocks vs. Zen 1 and there is a possibility that they doubled the widths also. There just about isn’t any way to scale to such core counts otherwise.

    An intel monolithic high core count device will have trouble competing. The latency of their mesh network will go up with more cores and it will burn a lot of power. The latency of the L3 with a mesh network will be higher than the latency within a 4-core CCX. Problems with the CCX architecture are mostly due to OS scheduler issues and badly written multithreaded code. Many applications performed significantly better on Linux compared to windows due to this.

    The mesh network is also not workable across multiple chiplets. A 16-core (or even a 10 core) monolithic device would be quite large for 10 nm. They would be wasting a bunch of expensive 10 nm capacity on IO. With the large die size and questionable yields, it will be a much more expensive chip than AMD’s MCM. Also, current Intel chips top out at 38.5 MB of L3 cache on 14 nm. Those are mostly expensive Xeon processors. AMD will have a 32 MB part for $200 and a 64 MB part for $500. Even when Intel actually gets a 10 nm part on the desktop, it will likely be much more expensive. They are also going to have serious problems getting their 10 nm parts up to competitive clock speeds with the 14 nm parts. They have been tweaking 14 nm for something like 5+ years now. Pushing the clock on their problematic 10 nm process doesn’t sound promising.
    Reply
  • peevee - Monday, June 17, 2019 - link

    "One of the features of IF2 is that the clock has been decoupled from the main DRAM clock....

    For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half."

    I have news for you - 2:1 is still COUPLED. False advertisement in the slides.

    And besides, who in their right mind would want to halve IF clock to go from DDR3200 to even DDR4000 (with requisite higher timings)?
    Reply

Log in

Don't have an account? Sign up now