HBM: Wide & Slow Makes It Fast

Given the challenges faced in pushing GDDR5 and similar memory technologies even further, development of high bandwidth memory technology has in the last decade shifted back towards wider, slower interfaces. As serial interfaces reach their limits, parallel interfaces become an increasingly viable alternative. And although they bring with them their own challenges – there’s a reason serial interfaces have been dominant most recently, after all – the ramp up of challenges in further improving serial interfaces has coincided with the development of technologies that make parallel interfaces easier to implement. As a result the pendulum has swung back to parallel interfaces and HBM.

HBM in a nutshell takes the wide & slow paradigm to its fullest. Rather than building an array of high speed chips around an ASIC to deliver 7Gbps+ per pin over a 256/384/512-bit memory bus, HBM at its most basic level involves turning memory clockspeeds way down – to just 1Gbps per pin – but in exchange making the memory bus much wider. How wide? That depends on the implementation and generation of the specification, but the examples AMD has been showcasing so far have involved 4 HBM devices (stacks), each featuring a 1024-bit wide memory bus, combining for a massive 4096-bit memory bus. It may not be clocked high, but when it’s that wide, it doesn’t need to be.

Of course while extra memory bandwidth is nice, the far more interesting part is how HBM delivers this. Although the idea of a wide, slow parallel bus is easy enough to comprehend on paper, implementing it is a whole other matter. A 4096-bit memory bus involves thousands of traces, far more than GDDR5, all of which must be carefully constructed in order to make HBM work. As a result there are a couple of fundamental technologies that are seeing their big (though not necessarily initial) introduction with HBM.

The first and most fundamental matter is how do you efficiently route a 4096-bit memory bus? Even the best surface mounting BGA technologies have their limits, and as it is Hawaii was pushing things with its 512-bit GDDR5 memory bus. An even wider bus only makes that harder, inviting issues both with the routing such a wide bus on a PCB or chip substrate, and in using BGA to connect a chip to those traces.

First part of the solution to that in turn was to develop something capable of greater density routing, and that something was the silicon interposer. The interposer in its broadest terms is a partially fabbed silicon chip that instead of being developed into a full ASIC packed with logic, is only developed as far as having metal layers, in order to route signals and power among devices. The interposer in turn works because it exploits some of the core advantages of modern photolithographic processes, allowing for very fine paths to be created that would otherwise not be possible/practical on traditional PCBs and substrates.

Using a silicon interposer solves some of the fundamental problems with HBM, but it also provides some ancillary benefits as well. Along with solving the obvious routing issue, the interposer allows for DRAM to be placed very close to an ASIC, but without being placed on top of it (ala Package-on-package), which is impractical for high TDP devices like GPUs. By being able to place DRAM so close to the ASIC, it avoids the drawbacks of long memory paths, making the shorter paths both simpler to construct and require less power in the process. It also benefits integration, as similar to PoP technology, you can have more of the device’s functionality located on the same package as the ASIC, reducing the number of devices that need to be placed off-package and routed to the ASIC.

Of course the interposer does come with a drawback as well, and that’s cost. While AMD is not talking about costs in great detail – this is a technology deep dive, not an analyst meeting – the fact that the interposer is essentially a very large, partially developed silicon chip means that it’s relatively expensive to produce, especially compared to the very low costs of PCBs and traditional substrates. Mitigating this is the fact that interposers don’t need to go through the most complex and expensive phases of photolithography – the actual front-end lithography – so the cost is only the silicon wafer itself, along with the work required to create the metal layers, with the final interposer only being some 100 microns thick. Furthermore this doesn’t require cutting-edge fabs – old, fully amortized 65nm equipment works quite well – which further keeps the costs down. The end result is that the interposer is still a significant cost, but it is not as bad as it initially seems. This ultimately is why HBM will first be introduced on high margin products like high-end video cards before potentially making its way down to cheaper devices like APUs.

Meanwhile AMD and their vendors will over the long run also benefit from volume production. The first interposers are being produced on retooled 65nm lithographic lines, however once volume production scales up, it will become economical to develop interposer-only lines that are cheaper to operate since they don’t need the ability to offer full lithography as well. Where that cut-off will be is not quite clear at this time, though it sounds like it will happen sooner than later.

Looking at the broader picture, in the grand scheme of things the interposer becomes a new layer on a complete chip, sitting between the traditional substrate and any DRAM/ASICs mounted on top of it. Microbumps will connect the DRAM and ASICs to the interposer, and the interposer will then be connected to the substrate, before finally the substrate is connected to its partner PCB. PCB mounting itself will become a bit easier in the process, as there’s no longer a need to route memory traces through the substrate, which means the only remaining connections are data (PCIe bus, etc) and power for the ASIC and DRAM. All the complex routing is essentially localized to occurring at the interposer layer.

Moving on, the other major technological breakthrough here is the creation of through-silicon vias (TSVs). With the interposer to enable the routing of a dense memory bus, the other issue to solve was the creation of dense memory. The solution to this was to stack multiple memory dies together into a single device/stack, in order to create the single 1024-bit stack, and TSVs are in turn what make this possible.

The reason for stacking DRAM is pretty straightforward: it makes production easier by reducing the DRAM to fewer discrete devices, not to mention it saves space. The challenge here is that you can’t have traditional surface mount connections since the DRAM is stacked, and traditional edge connections (as used in PoP) are neither dense enough nor do they scale well to the kind of stacks HBM would require.


TSVs. Image Courtesy The International Center for Materials Nanoarchitectonics

As a result a means was needed to route DRAM connections though the lower layers of the stack, and this problem was solved with TSVs. Whereas regular vias offer the ability to connect two layers together, TSVs extend this principle by running the vias straight through silicon devices in order to connect layers farther out. The end result is something vaguely akin to DRAM dies surface mounted on top of each other via microbumps, but with the ability to communicate through the layers. From a manufacturing standpoint, between the silicon interposer and TSVs, TSVs are the more difficult technology to master as it essentially combines all the challenges of DRAM fabbing with the challenges of stacking those DRAM dies on top of each other.

Having developed the means to stack DRAM, the final component of an HBM stack is a logic die that lies on the bottom of the stack. Similarly outfitted with TSVs, the logic die is responsible for actually operating the DRAM dies above it, and then handling the operation of the HBM bus between the stack and the ASIC. This actually ends up being a rather interesting development since the net result is more logic added despite the simplicity of the HBM bus, but at the same time thanks to TSVs and the interposer, it’s easier than ever to add that logic.

History: Where GDDR5 Reaches Its Limits The Net Benefits of HBM & Closing Thoughts
Comments Locked

163 Comments

View All Comments

  • akamateau - Thursday, May 28, 2015 - link

    You miss the whole point.

    DX12 solves Witcher.

    HBM was designed to manage the high volume of drawcalls that DX12 enables.

    ALL GPU's were crippled with DX11. DX11 is DEAD.

    You can't render an object until you draw it. Dx11 does not support multicore or multithreaded cpu processing of graphics, DX12 does.

    With DX12 ALL CPU cores feed the GPU.
  • akamateau - Monday, June 8, 2015 - link

    If AMD can put 4-8 gigs of HBM on a GPU then they can do the same with CPU's as well as APU's. All of the Patents that I am showing below reference 3d statcked memory

    In fact one interesting quote from the patents listed below is this:

    "Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer."

    AMD has plans to fill that empty silicon with much more memory.

    The point: REPLACE SYSTEM DYNAMIC RAM WITH ON-DIE HBM 2 OR 3!

    Eliminating the electrical path distance to a few millimeters from 4-8 centimeters would be worth a couple of clocks of latency. If AMD is building HBM and HBM 2 then they are also building HBM 3 or more!

    Imagine what 64gb of HBM could do for a massive server die such as Zen? The energy savings alone would be worth it never mind the hugely reduced motherboard size, eliminating sockets and RAM packaging. The increased amount of CPU's/blade or mobo also reduces costs as servers can become much more dense.

    Most folks now only run 4-8 gigs in their laptops or desktops. Eliminating DRAM and replacing it with HBM is a huge energy and mechanical savings as well as a staggering performance jump and it destroys DDR5. That process will be very mature in a year and costs will drop. Right now the retail cost of DRAM per GB is about $10. Subtract packaging and channel costs and that drops to $5 or less. Adding 4-8 GB of HBM has a very cheap material cost, likely the main expense is the process, testing and yields. Balance that against the energy savings MOBO real estate savings and HBM replacing system DRAM becomes even more likely without the massive leap in performance as an added benefit.

    The physical cost savings is quite likely equivalent to the added process cost. Since Fiji will likely be released at a very competitive price point.

    AMD is planning on replacing system DRAM memory with stacked HBM. Here are the Patents. They are all published last year and this year with the same inventor; Gabriel H. Loh and the assignee is of course AMD.

    Stacked memory device with metadata management
    WO 2014025676 A1
    "Memory bandwidth and latency are significant performance bottlenecks in many processing systems. These performance factors may be improved to a degree through the use of stacked, or three-dimensional (3D), memory, which provides increased bandwidth and reduced intra-device latency through the use of through-silicon vias (TSVs) to interconnect multiple stacked layers of memory. However, system memory and other large-scale memory typically are implemented as separate from the other components of the system. A system implementing 3D stacked memory therefore can continue to be bandwidth-limited due to the bandwidth of the interconnect connecting the 3D stacked memory to the other components and latency-limited due to the propagation delay of the signaling traversing the relatively-long interconnect and the handshaking process needed to conduct such signaling. The inter-device bandwidth and inter-device latency have a particular impact on processing efficiency and power consumption of the system when a performed task requires multiple accesses to the 3D stacked memory as each access requires a back-and-forth communication between the 3D stacked memory and thus the inter-device bandwidth and latency penalties are incurred twice for each access."

    Interposer having embedded memory controller circuitry
    US 20140089609 A1
    " For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "

    Die-stacked memory device with reconfigurable logic
    US 8922243 B2
    "Memory system performance enhancements conventionally are implemented in hard-coded silicon in system components separate from the memory, such as in processor dies and chipset dies. This hard-coded approach limits system flexibility as the implementation of additional or different memory performance features requires redesigning the logic, which design costs and production costs, as well as limits the broad mass-market appeal of the resulting component. Some system designers attempt to introduce flexibility into processing systems by incorporating a separate reconfigurable chip (e.g., a commercially-available FPGA) in the system design. However, this approach increases the cost, complexity, and size of the system as the system-level design must accommodate for the additional chip. Moreover, this approach relies on the board-level or system-level links to the memory, and thus the separate reconfigurable chip's access to the memory may be limited by the bandwidth available on these links."

    Hybrid cache
    US 20140181387 A1
    "Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100.
    Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories."

    Partitionable data bus
    US 20150026511 A1
    "Die-stacked memory devices can be combined with one or more processing units (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Accelerated Processing Units (APUs)) in the same electronics package. A characteristic of this type of package is that it can include, for example, over 1000 data connections (e.g., pins) between the one or more processing units and the die-stacked memory device. This high number of data connections is significantly greater than data connections associated with off-chip memory devices, which typically have 32 or 64 data connections."

    Non-uniform memory-aware cache management
    US 20120311269 A1
    "Computer systems may include different instances and/or kinds of main memory storage with different performance characteristics. For example, a given microprocessor may be able to access memory that is integrated directly on top of the processor (e.g., 3D stacked memory integration), interposer-based integrated memory, multi-chip module (MCM) memory, conventional main memory on a motherboard, and/or other types of memory. In different systems, such system memories may be connected directly to a processing chip, associated with other chips in a multi-socket system, and/or coupled to the processor in other configurations.
    Because different memories may be implemented with different technologies and/or in different places in the system, a given processor may experience different performance characteristics (e.g., latency, bandwidth, power consumption, etc.) when accessing different memories. For example, a processor may be able to access a portion of memory that is integrated onto that processor using stacked dynamic random access memory (DRAM) technology with less latency and/or more bandwidth than it may a different portion of memory that is located off-chip (e.g., on the motherboard). As used herein, a performance characteristic refers to any observable performance measure of executing a memory access operation."

    “NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc”
    http://www.eecg.toronto.edu/~enright/micro14-inter...

    “3D-Stacked Memory Architectures for Multi-Core Processors” Gabriel H. Loh Georgia Institute of Technology College of Computing”
    http://ag-rs-www.informatik.uni-kl.de/publications...

    “Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches” Gabriel H. Loh⋆ Mark D. Hill†⋆ ⋆ AMD Research † Department of Computer Sciences Advanced Micro Devices, Inc. University of Wisconsin – Madison”
    http://research.cs.wisc.edu/multifacet/papers/micr...

    All of this adds up to HBM being placed on-die as a replacement of or maybe supplement to system memory. But why have system DRAM if you can build much wider bandwidth memory closer to the CPU on-die? Unless of course you build socketed HBM DRAM and a completely new system memory bus to feed it.

    Replacing system DRAM memory with on-die HBM has the same benefits for the performance and energy demand of the system as it has for GPU's. Also it makes for smaller motherboards, no memory sockets and no memory packaging.

    Of course this is all speculation. But it also makes sense.
  • amilayajr - Tuesday, June 16, 2015 - link

    With HBM in mind. Does AMD holds the patent for this? Is Nvidia just going to use HBM for free? Any one care to elaborate ? Because if Nvidia gets to use it for free then that's really funny for AMD side considering they are the one who research it and developed it. Am I making sense?

Log in

Don't have an account? Sign up now