As part of this evening’s AMD Capsaicin event (more on that later), AMD’s Chief Architect and SVP of the Radeon Technologies Group has announced a new Radeon Pro card unlike anything else. Dubbed the Radeon Pro Solid State Graphics (SSG), this card includes M.2 slots for adding NAND SSDs, with the goal of vastly increasing the amount of local storage available to the video card.

Details are a bit thin and I’ll update this later this evening, but in short the card utilizes a Polaris 10 Fiji GPU and includes 2 PCIe 3.0 M.2 slots for adding flash drives to the card. These slots are then attached to the GPU (it’s unclear if there’s a PCIe switch involved or if it’s wired directly), which the GPU can then use as an additional tier of storage. I’m told that the card can fit at least 1TB of NAND – likely limited by M.2 MLC SSD capacities – which massively increases the amount of local storage available on the card.

As AMD explains it, the purpose of going this route is to offer another solution to the workset size limitations of current professional graphics cards. Even AMD’s largest card currently tops out at 32GB, and while this is a fair amount, there are workloads that can use more. This is particular the case for workloads with massive datasets (oil & gas), or as AMD demonstrated, scrubbing through an 8K video file.

Current cards can spill over to system memory, and while the PCIe bus is fast, it’s still much slower than local memory, plus it is subject to the latency of the relatively long trip and waiting on the CPU to address requests. Local NAND storage, by comparison, offers much faster round trips, though on paper the bandwidth isn’t as good, so I’m curious to see just how it compares to the real world datasets that spill over to system memory.  Meanwhile actual memory management/usage/tiering is handled by a combination of the drivers and developer software, so developers will need to code specifically for it as things stand.

For the moment, AMD is treating the Radeon Pro SSG as a beta product, and will be selling developer kits for it directly., with full availability set for 2017. For now developers need to apply for a kit from AMD, and I’m told the first kits are available immediately. Interested developers will need to have saved up their pennies though: a dev kit will set you back $9,999.

Update:

Now that AMD’s presentation is over, we have a bit more information on the Radeon Pro SSG and how it works.

In terms of hardware, the Fiji based card is outfit with a PCIe bridge chip – the same PEX8747 bridge chip used on the Radeon Pro Duo, I’m told – with the bridge connecting the two PCIe x4 M.2 slots to the GPU, and allowing both cards to share the PCIe system connection. Architecturally the prototype card is essentially a PCIe SSD adapter and a video card on a single board, with no special connectivity in use beyond what the PCIe bridge chip provides.

The SSDs themselves are a pair of 512GB Samsung 950 Pros, which are about the fastest thing available on the market today. These SSDs are operating in RAID-0 (striped) mode to provide the maximum amount of bandwidth. Meanwhile it turns out that due to how the card is configured, the OS actually sees the SSD RAID-0 array as well, at least for the prototype design.

To use the SSDs, applications need to be programmed using AMD’s APIs to recognize the existence of the local storage and that it is “special,” being on the same board as the GPU itself. Ultimately the trick for application developers is directly streaming resources from  the SSDs treating it as a level of cache between the DRAM and system storage. The use of NAND in this manner does not fit into the traditional memory hierarchy very well, as while the SSDs are fast, on paper accessing system memory is faster still. But it should be faster than accessing system storage, even if it’s PCIe SSD storage elsewhere on the system. Similarly, don’t expect to see frame buffers spilling over to NAND any time soon. This is about getting large, mostly static resources closer to the GPU for more efficient resource streaming.

To showcase the potential benefits of this solution, AMD had an 8K video scrubbing demonstration going, comparing performance between using a source file on the SSG’s local SSDs, and using a source file on the system SSD (also a 950 Pro).

The performance differential was actually more than I expected; reading a file from the SSG SSD array was over 4GB/sec, while reading that same file from the system SSD was only averaging under 900MB/sec, which is lower than what we know 950 Pro can do in sequential reads. After putting some thought into it, I think AMD has hit upon the fact that most M.2 slots on motherboards are routed through the system chipset rather than being directly attached to the CPU. This not only adds another hop of latency, but it means crossing the relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything else attached to the chipset.

Though by and large this is all at the proof of concept stage. The prototype, though impressive in some ways in its own right, is really just a means to get developers thinking about the idea and writing their applications to be aware of the local storage. And this includes not just what content to put on the SSG's SSDs, but also how to best exploit the non-volatile nature of its storage, and how to avoid unnecessary thrashing of the SSDs and burning valuable program/erase cycles. The SSG serves an interesting niche, albeit a limited one: scenarios where you have a large dataset and you are somewhat sensitive to latency and want to stay off of the PCIe bus, but don't need more than 4-5GB/sec of read bandwidth. So it'll be worth keeping an eye on this to see what developers can do with it.

In any case, while AMD is selling dev kits now, expect some significant changes by the time we see the retail hardware in 2017. Given the timeframe I expect we’ll be looking at much more powerful Vega cards, where the overall GPU performance will be much greater, and the difference in performance between memory/storage tiers is even more pronounced.

Source: AMD

Comments Locked

120 Comments

View All Comments

  • BMNify - Tuesday, July 26, 2016 - link

    There are pictures floating around from the back of the card, exposing the PCB. We can clearly see that there are no GDDR5 memory chips soldered onto the PCB. The huge GPU package has a huge load of capacitors and there are one 8-pin and one 6-pin power connector, suggesting a TBP of >225-300 Watt.

    Raja did not mention Polaris for this product, neither does AMD in any info text. Other sites are already reporting that the devkit uses Fiji.

    Where does your info with Polaris 10 come from?
  • BMNify - Tuesday, July 26, 2016 - link

    confirmation by Robert Hallock:
    https://twitter.com/Thracks/status/757992332583067...
  • MLSCrow - Tuesday, July 26, 2016 - link

    I have to admit, although this product may only target an extremely niche market, the fact that AMD continues to innovate and come out with new technologies, even if those technologies are eventually mimicked and improved upon by other companies, is still a breath of fresh air every time. I like this product, even if I don't have a particular use for it, I still like that they are trying new things, having new visions, boldly going....ok sorry.
  • xenol - Tuesday, July 26, 2016 - link

    This might be something that actually uses M.2's bandwidth, as nothing else I've seen outside of benchmarks do.
  • pogostick - Tuesday, July 26, 2016 - link

    This comment board is much worse than usual. What's with all the vitriol?
  • AnnonymousCoward - Tuesday, July 26, 2016 - link

    Raw 8K 24-bit 60Hz should require >6GB/s (7680*4320*60*24/1e9/8), and 92Hz is >9.2GB/s. How does 4.6GB/s get the job done?
  • Mugur - Wednesday, July 27, 2016 - link

    Going back to Earth, I think that it would be a great idea to bundle a mainstream card like a Radeon RX 480 with am M.2 SSD for a good price.
  • Chaser - Thursday, July 28, 2016 - link

    Unexpected and impressive. I buy Nvidia (for now), but WTG AMD!
  • msroadkill612 - Saturday, April 29, 2017 - link

    A belated post.

    Its an awesome notion. News to me. Well done Ryan.

    Recent such raid 0 benches i have seen indicate such an array could easily exceed the bandwidth of 4 lanes, or 4GBps.

    they got
    "ATTO Read 2491 3314 +33
    ATTO Write 1568 3034 +93"

    with and w/o raid on an intel mobo w/ 3x native m.2 slots.

    Whats v interesting & important is raid0 brings read and write speeds into ~parity. Many operations require both, so the slower dictates the pace.

    If so, the effective pace really has doubled by striping 2 drives.

    AMD could well do much better numbers on their cards.

    A terabyte of space seems excessive, given 8GB is generous vram now (a big jump for coders to get their heads around and use), but i think the samsung ssds are fastest in the 512GB size.

    Yes, the prospects of it being used as a resource by the cpu via the 16 lane gpu bus, seem good. It would be the fastest storage on the system.

    Now u have me wondering Ryan. Could it be a killer feature on future top end vega gpu cards?

    Its pretty consistent w/ their stated vega philosophy/features (as of apr 2017)- memory sharing/layers of memory like l1 l2 l3 cache on a cpu.

    A massive l3 gpu cache perhaps?

    We know they will add hbm2, why not a ssd controller?

    If its real for vega, curiously they have gone noth ways. vram is much faster, and have added a ~memory layer that is much slower, but huge.
  • msroadkill612 - Saturday, April 29, 2017 - link

    clarification: the intel mobo above, had 3x m.2 sockets native, but only 2 were used for a raid0 array.

Log in

Don't have an account? Sign up now