MIOpen: The Radeon Instinct Software Stack

While solid hardware is the necessary starting point for building a deep learning product platform, as AMD has learned the hard way over the years, it takes more than good hardware to break into the HPC market. Software is just as important as hardware (if not more so), as software developers want to get to as close to plug-and-play as possible. For this reason, frameworks and libraries that do a lot of the heavy lifting for developers are critical. Case in point, along with the Tesla hardware, the other secret ingredient in NVIDIA’s deep learning stable has been cuDNN and their other libraries, which have moved most of the effort of implementing deep learning systems off of software developers and on to NVIDIA. This is the kind of ecosystem AMD needs to be able to build for Radeon Instinct to crack the market.

The good news for AMD is that they’re already partially here with ROCm, which lays the groundwork for their software stack. They now have the low-level tools such as stable programming languages and good compilers to build further libraries and frameworks off of that. The Radeon Instinct software stack, then, is all about building on top of ROCm.

The cornerstone of AMD’s efforts here (and their answer to cuDNN) is MIOpen, a high performance deep learning library for Radeon Instinct. AMD’s performance slides should be taken with a suitably large grain of salt when it comes to competitive comparisons, but they none the less hammer the point home that the company has been focused on putting together a powerful library to support their cards. The library will be responsible for providing optimized routines for basic neural network functions such as convolution, normalization, and activation functions.

Meanwhile built on top of MIOpen will be updated versions of the major deep learning frameworks, including Caffe, Torch 7, and TensorFlow. It’s these common frameworks that deep learning applications are actually built against, and as a result AMD has been lending their support to the developers of these frameworks to get MIOpen/AMD optimized paths added to them. All of this can sound a bit mundane to outsiders, but its importance cannot be overstated; it’s the low-level work that is necessary for AMD to turn the Instinct hardware into a complete ecosystem.

Alongside their direct library and framework support, when it comes to the Instinct software stack, expect to see AMD once again hammer the benefits of being open source. AMD has staked the entire ROCm platform on this philosophy, so it’s to be expected. None the less it’s an interesting point of contrast to NVIDA’s largely closed ecosystem. AMD believes that deep learning developers are looking for a more open software stack than what NVIDIA has provided – that being closed has limited developers’ ability to make full use of NVIDIA’s platform – so this will be AMD’s opportunity to put that to the test.

Instinct Servers

Finally, along with creating a full hardware and software ecosystem for the Instinct product family, AMD also has their eye on the bigger picture, going beyond individual cards and out to servers and whole racks. As a manufacturer of both GPUs and CPUs, AMD is in a rare position to be able to offer a complete hardware platform, and as a result the company is keen to take advantage of the synergy between CPU and GPU to offer something that their rivals cannot.

The basis for this effort is AMD’s upcoming Naples platform, the server platform based on Zen. Besides offering a potentially massive performance increase over AMD’s now well-outdated Bulldozer server platform, Naples lets AMD flex their muscle in heterogeneous applications, tapping into their previous experience with HSA. This is a bit more forward looking – Naples doesn’t have an official launch date yet – but AMD is optimistic about their ability to own large-scale deployments, providing both the CPU and the GPUs in large deep learning installations.

Going a bit off the beaten path here, perhaps the most interesting aspect of Naples as it intersects with Radeon Instinct comes down to PCIe lanes. All signs point to Naples offering at least 64 PCIe lanes per CPU; this is an important metric because it means there are enough lanes to give up to 4 Instinct cards a full, dedicated PCIe x16 link to the host CPU and the other cards. Intel’s Xeon platform only offers 40 PCIe lanes, which means a quad-card configuration has to either sacrifice on bandwidth or latency, trading off between a mix of x8 and x16 slots, using two Xeon CPUs, or building in a high-end PCIe switch to route together 4 x16 slots. Ultimately for installations focusing on GPU-heavy workloads, this gives AMD a distinct advantage since it means they can drive 4 Instinct cards off of a single CPU, making it a cheaper option than the aforementioned Xeon configurations.

In any case, AMD has already lined up partners to show off Naples server configurations for the Radeon Instinct. SuperMicro and Inventec are both showcasing server/rack designs for anywhere between 3 and 120 Instinct MI25 cards. The largest systems will of course involve off-system networking and more complex networking fabrics, and while AMD isn’t saying too much on the subject at this time, it’s clear that they’re being mindful of what they need to support truly massive clusters of cards.

Closing Thoughts

Wrapping things up with today’s Radeon Instinct announcement, while today’s revelations are closer to a teaser than a fully fleshed out product announcement, it’s none the less clear that AMD is about to embark on their most aggressive move in the GPU server space since their first FireStream cards almost a decade ago. Making gains in the server space has long been one of the keys to AMD’s success, both for CPUs and GPUs, and with the Radeon Instinct hardware and overarching initiative, AMD has laid out a critical plan for how to get there.

Not that any of this will come easy for AMD. Breaking back into the server market is a recurring theme for them, and their struggles there are why it’s recurring. NVIDIA is already several steps ahead of AMD in the deep learning GPU market, so AMD needs to be quick to catch up. The good news for AMD here is that unlike the broader GPU server market, the deep learning market is still young, so AMD has the opportunity to act before anyone gets too entrenched. It’s still early enough in the game that AMD believes that if they can flip just a few large customers – the Googles and Facebooks of the world – that they can make up for lost time and capture a significant chunk of the deep learning market.

With that said, as the Radeon Instinct products are not set to ship until H1 of 2017, a lot can change, both inside and outside of AMD. The company has laid down what looks to be a solid plan, but now they need to show that they can follow-through on it, executing on both hardware and software on schedule, and handling the inevitable curveball. If they can do that, then the deep learning market may very well be that server GPU success that the company has spent much of the past decade looking for.

Radeon Instinct Hardware: Polaris, Fiji, Vega
POST A COMMENT

39 Comments

View All Comments

  • CoD511 - Saturday, December 31, 2016 - link

    Well, what I find shocking is the P4 with a 5.5TFLOP rating at 50w/75w versions as the rated maximum power, not even using TDP to obfuscate the numbers. It's right near the output of the 1070 but the power numbers are just, what? If that's true and it may well be considering they're available products, I wonder how they've got that set up to draw so little power yet output so much or process at such speed. Reply
  • jjj - Monday, December 12, 2016 - link

    The math doesn't work like that at all.
    Additionally, we don't know the die size and the GPU die is not the only thing using power on a GPU AiB.
    What we do know is that it gets more at 300W rated TDP than Nvidia's P100.
    Reply
  • ddriver - Monday, December 12, 2016 - link

    "The math doesn't work like that at all." - neither do flamboyant statements devoid of substantiation. The numbers are exactly where I'd expect them to be based on rough estimates on the cost of implementing a more fine-grained execution engine. Reply
  • RussianSensation - Monday, December 12, 2016 - link

    Almost everyone has gotten this wrong this generation. It's not 2 node jumps because the 14nm GloFo and 16nm TSMC are really "20nm equivalent nodes." The 14nm/16nm at GloFo and TSMC are more marketing than a true representation.

    "Bottom line, lithographically, both 16nm and 14nm FinFET processes are still effectively offering a 20nm technology with double-patterning of lower-level metals and no triple or quad patterning."

    https://www.semiwiki.com/forum/content/1789-16nm-f...

    Intel's 14nm is far superior to the 14nm/16nm FinFET nodes offered by GloFo and TSMC at the moment.
    Reply
  • abufrejoval - Monday, December 19, 2016 - link

    Which is *exactly* why I find the rumor that AMD is licensing its GPUs to Chipzilla to replace Intel iGPUs so scary: With that Intel would be able to produce true HSA APUs with HBM2 and/or EDRAM which nobody else can match.

    Intel has given away more than 50% of silicon real-estate for years for free to starve off Nvidia and AMD (isn't that illegal silicon dumping?) and now they could be ripping the crown jewels off a starving AMD to crash NVidia where Knights Lansing failed.

    AMD having on-par CPU technology now is only going to pull some punch, when it's accompanied with a powerful GPU part in their APUs that Intel can't match and NVidia can't deliver.

    They license that to Intel, they are left with nothing to compete with.

    Perhaps Intel lured AMD by offering their foundries for dGPU, which would allow ATI to make a temporary return. I can't see Intel feeding snakes at their fab-bosom (or producing "Zenselves").

    At this point in the silicon end game, technology becomes a side show to politics and it's horribly fascinating to watch.
    Reply
  • cheshirster - Sunday, January 8, 2017 - link

    If Apple is a customer everything is possible. Reply
  • lobz - Tuesday, December 13, 2016 - link

    ddriver......

    do you have any idea what else is going on under the hood of that surprisingly big card? =}

    there could be a lot of things accumulating that add up to <300W, which is still lower then the P100's 10,6 TF @ 300W =}
    Reply
  • hoohoo - Wednesday, December 14, 2016 - link

    You're being hyperbolic.

    24.00 W/TF for Vega.
    21.34 W/TF for Fiji.

    12% higher power use for Vega. That's not really gutted.
    Reply
  • Haawser - Monday, December 12, 2016 - link

    PCIe P100 = 18.7TF of 16bit in 250W

    PCie Vega = ~24TF of 16bit in 300W

    In terms of perf/W Vega might get ~22% more perf for ~20% more power. So essentially they should be near as darnit the same. Except AMD will probably be cheaper, and because each card is more powerful, you'll be able to pack more compute into a given amount of rack space. Which is what the people who run multi-million $ HPC research machines will *really* be interested in, because that's kind of their job.
    Reply
  • Ktracho - Monday, December 12, 2016 - link

    Not all servers can handle providing 300 W to add in cards, so even if they are announced as 300 W cards, they may be limited to something closer to 250 W in actual deployments. Reply

Log in

Don't have an account? Sign up now