The Net Benefits of HBM

Now that we’ve had a chance to talk about how HBM is constructed and the technical hurdles in building it, we can finally get to the subject of the performance and design benefits of HBM. HBM is of course first and foremost about further increasing memory bandwidth, but the combination of stacked DRAM and lower power consumption also opens up some additional possibilities that could not be pursued with GDDR5.

We’ll start with the bandwidth capabilities of HBM. The amount of bandwidth ultimately depends on the number of stacks in use along with the clockspeed of those stacks. HBM uses a DDR signaling interface, and while AMD is not disclosing final product specifications at this time, they have given us enough information to begin to build a complete picture.

GPU Memory Math
  AMD Radeon R9 290X NVIDIA GeForce GTX Titan X Theoretical 4-Stack HBM1
Total Capacity 4GB 12GB 4GB
Bandwidth Per Pin 5Gbps 7Gbps 1Gbps
Number of Chips/Stacks 16 24 4
Bandwidth Per Chip/Stack 20GB/sec 14GB/sec 128GB/sec
Effective Bus Width 512-bit 384-bit 4096-bit
Total Bandwidth 320GB/sec 336GB/sec 512GB/sec
Estimated DRAM
Power Consumption
30W 31.5W 14.6W

The first generation of HBM AMD is using allows for each stack to be clocked up to 500MHz, which after DDR signaling leads to 1Gbps per pin. For a 1024-bit stack this means a single stack can deliver up to 128GB/sec (1024b * 1G / 8b) of memory bandwidth. HBM in turn allows from 2 to 8 stacks to be used, with each stack carrying 1GB of DRAM. AMD’s example diagrams so far (along with NVIDIA’s Pascal test vehicle) have all been drawn with 4 stacks, in which case we’d be looking at 512GB/sec of memory bandwidth. This of course is quite a bit more than the 320GB/sec of memory bandwidth for the R9 290X or 336GB/sec for NVIDIA’s GTX titan X, working out to a 52-60% increase in memory bandwidth.

At the same time this also calls into question memory capacity – 4 1GB stacks is only 4GB of VRAM – though AMD seems to be saving that matter for the final product introduction later this quarter. Launching a new, high-end GPU with 4GB could be a big problem for AMD, but we'll see just what they have up their sleeves in due time.

What’s perhaps more interesting is what happens to DRAM energy consumption with HBM. As we mentioned before, R9 290X spends 15-20% of its 250W power budget on DRAM, or roughly 38-50W of power on an absolute basis. Meanwhile by AMD’s own reckoning, GDDR5 is good for 10.66GB/sec of bandwidth per watt of power, which works out to 30W+ via that calculation. HBM on the other hand delivers better than 35GB/sec of bandwidth per watt, an immediate 3x gain in energy efficiency per watt.

Of course AMD is then investing some of those gains back in to coming up with more memory bandwidth, so it’s not as simple as saying that memory power consumption has been cut by 70%. Rather given our earlier bandwidth estimate of 512GB/sec of memory bandwidth for a 4 stack configuration, we would be looking at about 15W of power consumption for a 512GB/sec HBM solution, versus 30W+ for a 320GB/sec GDDR5 solution. The end result then points to DRAM power consumption being closer to halved, with AMD saving 15-20W of power.

What’s the real-world advantage of a 15-20W reduction in DRAM power consumption? Besides being able to invest that in reducing overall video card power consumption, the other option is to invest it in increasing clockspeeds. With PowerTune putting a hard limit on power consumption, a larger GPU power budget would allow AMD to increase clockspeeds and/or run at the maximum GPU clockspeed more often, improving performance by a currently indeterminable amount. Now as fair warning here, higher GPU clockspeeds typically require higher voltages, which in turn leads to a rapid increase in GPU power consumption. So although having additional power headroom does help the GPU, it may not be good for quite as much of a clockspeed increase as one might hope.

Meanwhile the performance increase from the additional memory bandwidth is equally nebulous until AMD’s new product is announced and benchmarked. As a rule of thumb GPUs are virtually always memory bandwidth bottlenecked – they are after all high-throughput processors capable of trillions of calculations per second working with only hundreds of billions of bytes of bandwidth – so there is no doubt that the higher memory bandwidths of HBM will improve performance. However memory bandwidth increases currently don’t lead to 1:1 performance increases even on AMD’s current cards, and it’s unlikely to be any different on future products.

Throwing an extra wrinkle into matters, any new AMD product would be based on GCN 1.2 or newer, which introduced AMD’s latest generation of color compression technology. The net result is that on identical workloads, memory bandwidth pressure is going down exactly at the same time as memory bandwidth availability is going up. AMD will end up gaining a ton of effective memory bandwidth – something that will be very handy for high resolutions – but it also makes it impossible to predict what the final performance impact might be. Still, it will be interesting to see what AMD can do with a 2x+ increase in effective memory bandwidth for graphics workloads.

The final major benefit AMD is looking at taking advantage of with HBM – and that this point they’re not even being subtle about – is new form factor designs from the denser designs enabled by HBM. With the large GDDR5 memory chips replaced with much narrower HBM stacks, AMD is telling us that the resulting ASIC + RAM setups can be much smaller.

How much smaller? Well 1GB of GDDR5, composed of 2Gbit modules (the standard module size for R9 290X) would take up 672mm2, versus just 35mm2 for the same 1GB of DRAM as an HBM stack. Even if we refactor this calculation for 4Gbit modules – the largest modules used in currently shipping video cards – then we still end up with 336mm2 versus 35mm2, which is still a savings of 89% for 1GB of DRAM. Ultimately the HBM stack itself is composed of multiple DRAM dies, so there’s still quite a bit of silicon in play, however its 2D footprint is reduced significantly thanks to stacking.

By AMD’s own estimate, a single HBM-equipped GPU package would be less than 70mm X 70mm (4900mm2), versus 110mm X 90mm (9900mm2) for R9 290X. Throw in additional space savings from the fact that HBM stacks don’t require quite as complex power delivery circuitry, and the card space savings could be significant. By our reckoning the total card size will still be fairly big – all of those VRMs and connectors need to go somewhere – but there is potential for significant savings. What AMD intends to do with those savings remains to be seen, but with apologies to AMD on this one, NVIDIA has already shown off their Pascal test vehicle for their mezzanine connector design, and it goes without saying that such a form factor opens up some very interesting possibilities.


With apologies to AMD: NVIDIA’s Pascal Test Vehicle, An Example Of A Smaller, Non-Traditional Video Card Design

Finally, aftermarket enthusiasts may or may not enjoy one final benefit from the use of HBM. Because the DRAM and GPU are now on the same package, AMD is going to be capping the package with an integrated heat spreader (IHS) to compensate for any differences in height between the HBM stacks and GPU die, to protect the HBM stacks, and to supply the HBM stacks with sufficient cooling. High-end GPU dies have been bare for some time now, so an IHS brings with it the same kind of protection for the die that IHSs brought to CPUs. At the same time however this means it’s no longer possible to make direct contact with the GPU, so extreme overclockers may come away disappointed. We’ll have to see what the shipping products are like and whether in those cases it’s viable to remove the IHS.

Closing Thoughts

Bringing this deep dive to a close, as the first GPU manufacturer to be shipping an HBM solution – in fact AMD expects to be the only vendor to ship an HBM1 solution – AMD has set into motion some very aggressive product goals thanks to the gains from HBM. Until we know more about AMD’s forthcoming video card I find it prudent to keep expectations in check here, as HBM is just one piece of the complete puzzle that is a GPU. But at the same time let’s be clear here: HBM is the future memory technology of GPUs, there is potential for significant performance increases thanks to the massive increase in memory bandwidth offers, and for roughly the next year AMD is going to be the only GPU vendor offering this technology.

AMD for their part is looking to take as much of an advantage of their lead as they can, both at the technical level and the consumer level. At the technical level AMD has said very little about performance so far, so we’ll have to wait and see just what their new product brings. But AMD is being far more open about their plans to exploit the size advantage of HBM, so we should expect to see some non-traditional designs for high-end GPUs. Meanwhile at the consumer level, expect to see HBM enter the technology lexicon as the latest buzzword for high-performance products – almost certainly to be stamped on video card boxes today just as GDDR5 has been for years – as AMD looks to let everyone know about their advantage.

Meanwhile shifting gears towards the long term, high-end GPUs are just the first of what AMD expects to be a wider rollout for HBM. Though AMD is not committing to any other products at this time, as production ramps up and costs come down, HBM is expected to become financially viable in a wider range, including lower-end GPUs, HPC products (e.g. FirePro S and AMD’s forthcoming HPC APU), high-end communications gear, and of course AMD’s mainstream consumer APUs. As lower-margin products consumer APUs will likely be among the farthest off, however in the long-run they may very well be the most interesting use case for HBM, as APUs are among the most bandwidth-starved graphics products out there. But before we get too far ahead of ourselves, let’s see what AMD is able to do with HBM on their high-end video cards later this quarter.

HBM: Wide & Slow Makes It Fast
Comments Locked

163 Comments

View All Comments

  • akamateau - Thursday, May 28, 2015 - link

    You miss the whole point.

    DX12 solves Witcher.

    HBM was designed to manage the high volume of drawcalls that DX12 enables.

    ALL GPU's were crippled with DX11. DX11 is DEAD.

    You can't render an object until you draw it. Dx11 does not support multicore or multithreaded cpu processing of graphics, DX12 does.

    With DX12 ALL CPU cores feed the GPU.
  • akamateau - Monday, June 8, 2015 - link

    If AMD can put 4-8 gigs of HBM on a GPU then they can do the same with CPU's as well as APU's. All of the Patents that I am showing below reference 3d statcked memory

    In fact one interesting quote from the patents listed below is this:

    "Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer."

    AMD has plans to fill that empty silicon with much more memory.

    The point: REPLACE SYSTEM DYNAMIC RAM WITH ON-DIE HBM 2 OR 3!

    Eliminating the electrical path distance to a few millimeters from 4-8 centimeters would be worth a couple of clocks of latency. If AMD is building HBM and HBM 2 then they are also building HBM 3 or more!

    Imagine what 64gb of HBM could do for a massive server die such as Zen? The energy savings alone would be worth it never mind the hugely reduced motherboard size, eliminating sockets and RAM packaging. The increased amount of CPU's/blade or mobo also reduces costs as servers can become much more dense.

    Most folks now only run 4-8 gigs in their laptops or desktops. Eliminating DRAM and replacing it with HBM is a huge energy and mechanical savings as well as a staggering performance jump and it destroys DDR5. That process will be very mature in a year and costs will drop. Right now the retail cost of DRAM per GB is about $10. Subtract packaging and channel costs and that drops to $5 or less. Adding 4-8 GB of HBM has a very cheap material cost, likely the main expense is the process, testing and yields. Balance that against the energy savings MOBO real estate savings and HBM replacing system DRAM becomes even more likely without the massive leap in performance as an added benefit.

    The physical cost savings is quite likely equivalent to the added process cost. Since Fiji will likely be released at a very competitive price point.

    AMD is planning on replacing system DRAM memory with stacked HBM. Here are the Patents. They are all published last year and this year with the same inventor; Gabriel H. Loh and the assignee is of course AMD.

    Stacked memory device with metadata management
    WO 2014025676 A1
    "Memory bandwidth and latency are significant performance bottlenecks in many processing systems. These performance factors may be improved to a degree through the use of stacked, or three-dimensional (3D), memory, which provides increased bandwidth and reduced intra-device latency through the use of through-silicon vias (TSVs) to interconnect multiple stacked layers of memory. However, system memory and other large-scale memory typically are implemented as separate from the other components of the system. A system implementing 3D stacked memory therefore can continue to be bandwidth-limited due to the bandwidth of the interconnect connecting the 3D stacked memory to the other components and latency-limited due to the propagation delay of the signaling traversing the relatively-long interconnect and the handshaking process needed to conduct such signaling. The inter-device bandwidth and inter-device latency have a particular impact on processing efficiency and power consumption of the system when a performed task requires multiple accesses to the 3D stacked memory as each access requires a back-and-forth communication between the 3D stacked memory and thus the inter-device bandwidth and latency penalties are incurred twice for each access."

    Interposer having embedded memory controller circuitry
    US 20140089609 A1
    " For high-performance computing systems, it is desirable for the processor and memory modules to be located within close proximity for faster communication (high bandwidth). Packaging chips in closer proximity not only improves performance, but can also reduce the energy expended when communicating between the processor and memory. It would be desirable to utilize the large amount of "empty" silicon that is available in an interposer. "

    Die-stacked memory device with reconfigurable logic
    US 8922243 B2
    "Memory system performance enhancements conventionally are implemented in hard-coded silicon in system components separate from the memory, such as in processor dies and chipset dies. This hard-coded approach limits system flexibility as the implementation of additional or different memory performance features requires redesigning the logic, which design costs and production costs, as well as limits the broad mass-market appeal of the resulting component. Some system designers attempt to introduce flexibility into processing systems by incorporating a separate reconfigurable chip (e.g., a commercially-available FPGA) in the system design. However, this approach increases the cost, complexity, and size of the system as the system-level design must accommodate for the additional chip. Moreover, this approach relies on the board-level or system-level links to the memory, and thus the separate reconfigurable chip's access to the memory may be limited by the bandwidth available on these links."

    Hybrid cache
    US 20140181387 A1
    "Die-stacking technology enables multiple layers of Dynamic Random Access Memory (DRAM) to be integrated with single or multicore processors. Die-stacking technologies provide a way to tightly integrate multiple disparate silicon die with high-bandwidth, low-latency interconnects. The implementation could involve vertical stacking as illustrated in FIG. 1A, in which a plurality of DRAM layers 100 are stacked above a multicore processor 102. Alternately, as illustrated in FIG. 1B, a horizontal stacking of the DRAM 100 and the processor 102 can be achieved on an interposer 104. In either case the processor 102 (or each core thereof) is provided with a high bandwidth, low-latency path to the stacked memory 100.
    Computer systems typically include a processing unit, a main memory and one or more cache memories. A cache memory is a high-speed memory that acts as a buffer between the processor and the main memory. Although smaller than the main memory, the cache memory typically has appreciably faster access time than the main memory. Memory subsystem performance can be increased by storing the most commonly used data in smaller but faster cache memories."

    Partitionable data bus
    US 20150026511 A1
    "Die-stacked memory devices can be combined with one or more processing units (e.g., Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Accelerated Processing Units (APUs)) in the same electronics package. A characteristic of this type of package is that it can include, for example, over 1000 data connections (e.g., pins) between the one or more processing units and the die-stacked memory device. This high number of data connections is significantly greater than data connections associated with off-chip memory devices, which typically have 32 or 64 data connections."

    Non-uniform memory-aware cache management
    US 20120311269 A1
    "Computer systems may include different instances and/or kinds of main memory storage with different performance characteristics. For example, a given microprocessor may be able to access memory that is integrated directly on top of the processor (e.g., 3D stacked memory integration), interposer-based integrated memory, multi-chip module (MCM) memory, conventional main memory on a motherboard, and/or other types of memory. In different systems, such system memories may be connected directly to a processing chip, associated with other chips in a multi-socket system, and/or coupled to the processor in other configurations.
    Because different memories may be implemented with different technologies and/or in different places in the system, a given processor may experience different performance characteristics (e.g., latency, bandwidth, power consumption, etc.) when accessing different memories. For example, a processor may be able to access a portion of memory that is integrated onto that processor using stacked dynamic random access memory (DRAM) technology with less latency and/or more bandwidth than it may a different portion of memory that is located off-chip (e.g., on the motherboard). As used herein, a performance characteristic refers to any observable performance measure of executing a memory access operation."

    “NoC Architectures for Silicon Interposer Systems Why pay for more wires when you can get them (from your interposer) for free?” Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Gabriel H. Loh AMD Research Advanced Micro Devices, Inc”
    http://www.eecg.toronto.edu/~enright/micro14-inter...

    “3D-Stacked Memory Architectures for Multi-Core Processors” Gabriel H. Loh Georgia Institute of Technology College of Computing”
    http://ag-rs-www.informatik.uni-kl.de/publications...

    “Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches” Gabriel H. Loh⋆ Mark D. Hill†⋆ ⋆ AMD Research † Department of Computer Sciences Advanced Micro Devices, Inc. University of Wisconsin – Madison”
    http://research.cs.wisc.edu/multifacet/papers/micr...

    All of this adds up to HBM being placed on-die as a replacement of or maybe supplement to system memory. But why have system DRAM if you can build much wider bandwidth memory closer to the CPU on-die? Unless of course you build socketed HBM DRAM and a completely new system memory bus to feed it.

    Replacing system DRAM memory with on-die HBM has the same benefits for the performance and energy demand of the system as it has for GPU's. Also it makes for smaller motherboards, no memory sockets and no memory packaging.

    Of course this is all speculation. But it also makes sense.
  • amilayajr - Tuesday, June 16, 2015 - link

    With HBM in mind. Does AMD holds the patent for this? Is Nvidia just going to use HBM for free? Any one care to elaborate ? Because if Nvidia gets to use it for free then that's really funny for AMD side considering they are the one who research it and developed it. Am I making sense?

Log in

Don't have an account? Sign up now