HBM2 & “The World’s Most Scalable GPU Memory Architecture”

With the launch of the Fiji GPU and resulting Radeon R9 Fury products in 2015, AMD became the first GPU manufacturer to ship with first-generation High Bandwidth Memory (HBM). The ultra-wide memory standard essentially turned the usual rules of GPU memory on its head, replacing narrow, high clocked memory (GDDR5) with wide, low clocked memory. By taking advantage of Through Silicon Vias (TSVs) and silicon interposers, HBM could offer far more bandwidth than GDDR5 while consuming less power and taking up less space.

Now for Vega, AMD is back again with support for the next generation of HBM technology, HBM2. In fact this is the very first thing we ever learned about Vega, going back to AMD’s roadmap from last year where it was the sole detail listed for the architecture.

HBM2 builds off of HBM, offering welcome improvements in both bandwidth and capacity. In terms of bandwidth, HBM2 can clock at up to 2Gbps per pin, twice the rate of HBM1. This means that at those clockspeeds (and I’ll note that at least so far we haven’t seen any 2Gbps HBM2), AMD can either double their memory bandwidth or cut the number of HBM stacks they need in half to get the same amount of bandwidth. The latter point is of particular interest, as we’ll get to here in a bit.

But more important still are the capacity increases. HBM1 stacked topped out at 1GB each, which means Fiji could have no more than 4GB of VRAM. HBM2 stacks go much higher – up to 8GB per stack – which means AMD’s memory capacity problems when using HBM have for all practical purposes gone away. AMD could in time offer 8GB, 16GB, or even 32GB of HBM2 memory, which is more than competitive with current GDDR5 memory types.

Meanwhile it’s very interesting to note that with Vega, AMD is calling their on-package HBM stacks “high-bandwidth cache” rather than “VRAM” or similar terms as was the case with Fiji products.

This is a term that can easily be misread – and it’s the one area where perhaps it’s too much of a tease – but despite the name, there are no signals from AMD right now that it’s going to be used as a cache in the pure, traditional sense. Rather, because AMD has already announced that they’re looking into other ideas such as on-card NAND storage (the Radeon Pro SSG), they are looking at memory more broadly.

And this brings us to what AMD is calling “The World’s Most Scalable GPU Memory Architecture”. Along with supporting HBM, AMD has undertaken a lot of under-the-hood work to better support large dataset management between the high bandwidth cache (HBM2), on-card NAND, and even farther out sources like system RAM and network storage.

The basic idea here is that, especially in the professional space, data set size is vastly larger than local storage. So there needs to be a sensible system in place to move that data across various tiers of storage. This may sound like a simple concept, but in fact GPUs do a pretty bad job altogether of handling situations in which a memory request has to go off-package.  AMD wants to do a better job here, both in deciding what data needs to actually be on-package, but also in breaking up those requests so that “data management” isn’t just moving around a few very large chunks of data. The latter makes for an especially interesting point, as it could potentially lead to a far more CPU-like process for managing memory, with a focus on pages instead of datasets.

Interestingly, to drive this point home, AMD actually turned to games rather than professional applications.  Plotting out the memory allocation and usage patterns of The Witcher III and Fallout 4, AMD finds that both games allocate far more memory than they actually use, by nearly a factor of 2x. Part of this is undoubtedly due to the memory management model of the DirectX 11 API used by both games, but a large factor is also simply due to the fact that this is traditionally what games have always done. Memory stalls are expensive and games tend to be monolithic use cases, so why not allocate everything you can, just to be sure you don’t run out?

The end result here is that AMD is painting a very different picture for how they want to handle memory allocations and caching on Vega and beyond. In the short term it’s professional workloads that stand to gain the most, but in the long run this is something that could impact games as well. And not to be ignored is virtualization; AMD’s foray into GPU virtualization is still into its early days, but this likely will have a big impact on virtualization as well. In fact I imagine it’s a big reason why AMD is giving Vega the ability to support a relatively absurd 512TB of virtual address space, many times the size of local VRAM. Multi-user time-sharing workloads are a prime example of where large address spaces can be useful.

ROPs & Rasterizers: Binning for the Win(ning)

We’ll suitably round-out our overview of AMD’s Vega teaser with a look at the front and back-ends of the GPU architecture. While AMD has clearly put quite a bit of effort into the shader core, shader engines, and memory, they have not ignored the rasterizers at the front-end or the ROPs at the back-end. In fact this could be one of the most important changes to the architecture from an efficiency standpoint.

Back in August, our pal David Kanter discovered one of the important ingredients of the secret sauce that is NVIDIA’s efficiency optimizations. As it turns out, NVIDIA has been doing tile based rasterization and binning since Maxwell, and that this was likely one of the big reasons Maxwell’s efficiency increased by so much. Though NVIDIA still refuses to comment on the matter, from what we can ascertain, breaking up a scene into tiles has allowed NVIDIA to keep a lot more traffic on-chip, which saves memory bandwidth, but also cuts down on very expensive accesses to VRAM.

For Vega, AMD will be doing something similar. The architecture will add support for what AMD calls the Draw Stream Binning Rasterizer, which true to its name, will give Vega the ability to bin polygons by tile. By doing so, AMD will cut down on the amount of memory accesses by working with smaller tiles that can stay-on chip. This will also allow AMD to do a better job of culling hidden pixels, keeping them from making it to the pixel shaders and consuming resources there.

As we have almost no detail on how AMD or NVIDIA are doing tiling and binning, it’s impossible to say with any degree of certainty just how close their implementations are, so I’ll refrain from any speculation on which might be better. But I’m not going to be too surprised if in the future we find out both implementations are quite similar. The important thing to take away from this right now is that AMD is following a very similar path to where we think NVIDIA captured some of their greatest efficiency gains on Maxwell, and that in turn bodes well for Vega.

Meanwhile, on the ROP side of matters, besides baking in the necessary support for the aforementioned binning technology, AMD is also making one other change to cut down on the amount of data that has to go off-chip to VRAM. AMD has significantly reworked how the ROPs (or as they like to call them, the Render Back-Ends) interact with their L2 cache. Starting with Vega, the ROPs are now clients of the L2 cache rather than the memory controller, allowing them to better and more directly use the relatively spacious L2 cache.

This is especially significant for a specific graphics task, which is rendering to a texture (as opposed to a frame buffer to be immediately displayed). Render to texture is an especially common operation for deferred shading, and while deferred shading itself isn’t new, its usage is increasing. With this change to Vega, the ROPs can now send a scene rendered to a texture to the L2 cache, which can in turn be fetched by the texture units for reuse in the next stage of the rendering process. Any potential performance improvements from this change are going to be especially game-specific since not every game uses deferred shading, but it’s one of those corner cases that is important for AMD to address in order to ensure more consistent performance.

Vega’s NCU: Packed Math, Higher IPC, & Higher Clocks First Thoughts: A Peek At What’s To Come
POST A COMMENT

154 Comments

View All Comments

  • jjj - Thursday, January 05, 2017 - link

    599$ for Titan X Pascal (or better) perf? Reply
  • nathanddrews - Thursday, January 05, 2017 - link

    If they have TitanXP performance, it's going to be more expensive than the 1080. You know NVIDIA is just waiting for a chance to release a $699 or $799 1080Ti, so whatever AMD brings out, you can bet there's gonna be a clash of titans. Er, titans and stars, that is. Reply
  • jjj - Thursday, January 05, 2017 - link

    599$ might be too aggressive but depends on where the die size lands and what Nvidia does.
    Vega has almost 15% more FLOPS than Titan XP so remains to be seen how well they utilize that computing power and how much silicon it takes.
    The likely 8GB of HMB does help on the cost side.
    It also depends on what other SKUs they got and when. A high price limits volumes but if they also have lesser SKUs at launch, they can afford to price the best SKU higher.
    Reply
  • haukionkannel - Thursday, January 05, 2017 - link

    True! Nvidia has guaranteed that fastest Vegas can be sold over 1000$...
    Lets hope that cheaper option are near 500-600$.

    The interesting part is when AMD will use these next generation GPU units in their mid and low range products. Maybe the next summer or next autumn? Then we will get interesting devices to 150-350$ slots! Most propably with gdd5 in the low end and maybe gddr5+ in the high mid range GPUs,
    Reply
  • jjj - Thursday, January 05, 2017 - link

    Why not get a 350$ card at launch. They have nothing above Polaris 10 and the 350 price band is important. Reply
  • eachus - Saturday, January 14, 2017 - link

    There is a new Polaris chip (Polaris 12) in the works. It may be intended only for APUs where it would be mounted on an interposer with a Ryzen chip. It is not clear what AMD is going to do in the gap between RX 480 and Vega 10. Vega 11 is expected to replace the RX 480*

    * Understand what replace means here. It doesn't mean that AMD will stop selling Polaris GPUs. It means that AMD expects Vega 11 to have a better price/performance than RX 480, and that the performance gap between and that the price range where RX 480 currently sells will be starved of oxygen. I do expect a dual Polaris 10 card to ship, and there is also an RX 490 design floating around. (It may be a Polaris 10 chip with a higher clock speed, more power of course, and 8 Gig of GDDR5x memory.)

    Always remember that marketing gets the last say, not engineering. So only one or none of these products may show up. It is also not clear when Vega 11 will arrive. If it is late in this year, or early in next year, there will be enough time to market the additional Polaris parts.
    Reply
  • Jad77 - Thursday, January 05, 2017 - link

    So, by the time this hits the streets, Nvidia will already have another hardware iteration out? It's likely too late, but if you still holding on, sell your AMD stock. Reply
  • Darkknight512 - Thursday, January 05, 2017 - link

    Likely Nvidia will have an answer, that has always been the case and is barely even worth mentioning anymore, at least until AMD gets the upper hand one of these days. They did it to Intel during the Athlon days. It is very much possible, they have smart engineers, they just don't have enough of them but that often does not matter if they can work more efficiently. They have one thing going for them and that is a larger team by 2x often results in <2x the work done.

    One of my former bosses while I worked in the silicon industry said "AMD has good technology, they just have terrible luck being the underdog in both industries they compete in at the same time.". I wholely agree, with some luck they actually can come out on top, Nvidia is spending a lot of money diversifying.
    Reply
  • MLSCrow - Thursday, January 05, 2017 - link

    Honestly, I for once, don't think nVidia will have an answer. I feel that their expectations of what AMD could do were so incredibly low, that they felt GP100 and all of its derivatives would be enough to lay the smack down on AMD for good. Even with Volta, which seems like it's going to be a slight tweak to Pascal, it seems that Vega might just come out on top, which would make more sense out of AMD's slide of "Poor Volta", which would be a rather idiotic move unless AMD truly had something to be that cocky about. Reply
  • Yojimbo - Thursday, January 05, 2017 - link

    NVIDIA isn't trying to lay the smackdown on AMD for good. NVIDIA has been evolving in a different direction from AMD. AMD, probably because they have been cash-strapped, has not been able to invest the money necessary to become a platform-based company the way NVIDIA has.

    Also, Volta will not be a small tweak to Pascal. Pascal was a die shrink and small tweak to Maxwell (from the point of view of the underlying architecture, not the features that it enables). Volta is supposed to have ~1.7 times the performance and efficiency of Pascal on the same process technology. It won't be out until 6 to 9 months after Vega, however. But I'm very leery about taking AMD's promises at face value. Even if Vega is as high performance and efficient as AMD claims, it still uses HBM2 which adds significant cost to the manufacture of the chip. That means they will only be able to put a limited amount of pricing pressure on NVIDIA.
    Reply

Log in

Don't have an account? Sign up now