Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Name: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

by Anand Lal Shimpi on May 14, 2012 3:46 PM EST

101 Comments | Add A Comment

101 Comments

AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.

Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.

The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is will to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.

The software strategy is what AMD is working on now. AMD’s Fusion¹² Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.

That brings us to today. In advance of this year’s AFDS, Manju has agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Manju has a BS in Electrical Engineering (IIT, Bombay) and a PhD in Computer Information and Control Engineering (UMich, Ann Arbor) so make the questions as tough as you can. He'll be answering them on May 21st so keep the submissions coming.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

Ashkal - Monday, May 14, 2012 - link
I am just layman but I suggest instead of having more and more cores and transistor counts
just reduce transistor count and optimise for computing use with respect to power it is as good as not having power reserve meter in rollsroyce and huge engine or Honda with selective V8 engine, but Suzukis 660cc 60BHP 26km/lt engine or like that.
sfooo - Tuesday, May 15, 2012 - link
What are your thoughts on OpenCL and its adoption rate, or lack thereof? How about DirectCompute?
caecrank - Tuesday, May 15, 2012 - link
When is ansys going to start using amd hardware again? When can we expect to see an apu that can beat a Tesla on memory size and match it in terms of performance?
Can we also please have a 4 or 8 memory channel bulldozer? Fp performance on it is quite good, the only thing stopping us from adopting it is the max memory capacity. (sandy bridge e)

J
Loki726 - Tuesday, May 15, 2012 - link
AMD Fellow Mike Mantor has a nice statement that I believe captures the core difference between GPU and CPU design.

"CPUs are fast because they include hardware that automatically discovers and exploits parallelism (ILP) in sequential programs, and this works well as long as the degree of parallelism is modest. When you start replicating cores to exploit highly parallel programs, this hardware becomes redundant and inefficient; it burns power and area rediscovering parallelism that the programmer explicitly exposed. GPUs are fast because they spend the least possible area and energy on executing instructions, and run thousands of instructions in parallel."

Notice that nothing in here prevents a high degree of interoperability between GPU and CPU cores.

1) When will we see software stacks catch up with heterogeneous hardware? When can we target GPU cores with standard languages (C/C++/Objective-C/Java), compilers(LLVM, GCC, MSVS), and operating systems (Linux/Windows)? The fact that ATI picked a different ISA for their GPUs than x86 is not an excuse; take a page out of ARM's book and start porting compiler backends.

2) Why do we need new languages for programming GPUs that inherit the limitations of graphics shading languages? Why not toss OpenCL and DirectX compute, compile C/C++ programs, and launch kernels with a library call? You are crippling high level languages like C++-AMP, Python, and Matlab (not to mention applications) with a laundry list of pointless limitations.

3) Where's separable compilation? Why do you have multiple address spaces? Where is memory mapped IO? Why not support arbitrary control flow? Why are scratchpads not virtualized? Why can't SW change memory mappings? Why are thread schedulers not fair? Why can't SW interrupt running threads? The industry solved these problems in the 80s. Read about how they did it, you might be surprised that the exact same solutions apply.

Please fix these software problems so we can move onto the real hard problems of writing scalable parallel applications.
BenchPress - Tuesday, May 15, 2012 - link
AMD Fellow Mike Mantor is wrong, and he'd better know it. GPUs don't run thousands of instructions in parallel. They have wide vector units which execute THE SAME instruction across each element. So we're looking at only 24 instructions executing in parallel in the case of Trinity.

CPUs have wide vector units too now. A modest quad-core Haswell CPU will be capable of 128 floating-point operations per cycle, at three times the clock frequency of most GPUs!

AVX2 is bringing a massive amount of GPU technology into the CPU. So the only thing still setting them apart is the ILP technology and versatile caches of the CPU. Those are good features to have around for generic computing.

Last but not least, high power consumption of out-of-order execution will be solved by AVX-1024. Executing 1024-bit instructions on 256-bit vector units in four cycles allows the CPU's front-end to be clock gated, and there would be much less switching frequency in the schedulers. Hence you'd get GPU-like execution behavior within the CPU, without sacrificing any of the CPUs other advantages!

GPGPU is a dead end and the sooner AMD realizes this the sooner we can create new experiences for consumers.
Loki726 - Tuesday, May 15, 2012 - link
So your main argument is that AMD should just start bolting very wide vector units onto existing CPU cores?

If you want to summarize it at a very high level like that, then sure, that is the type of design the everyone is evolving towards, and I'd say that you are mostly right.

Doing this would go a long way towards solving the problems of supporting standard SW stacks. If GPGPU means heterogeneous ISAs and programming models, then I agree with you, it should be dead. If it means efficient SIMD and multi-core harwdare, then I think that we need software that can use it yesterday, and I agree that tightly integrating it with existing CPU designs is important. AVX is a good start.

Like so many other things, though, the details are not so straightforward.
Intel already built a multi-core CPU with wide vector units. They called it Knight's Corner/Ferry/Landing. However, even they used light-weight atom-like cores with wide vector units, not complex OOO cores. Why do you think they did that? Making CPU vector units reduces some overhead, but even with that do you think they could have fit 50 Haswell cores into the same area? I'll stand by Mike's point about factoring out redundant HW. A balanced number of OOO CPU cores is closer to 2-4, not 50.

Also, I disagree with your point about a GPU only executing one instruction for each vector instruction. GPUs procedurally factor out control overhead when threads are executing the same instruction at the same time. Although it is not true for all applications, for a class of programs (the ones that GPUs are good at), broadcasting a single instruction to multiple data paths with a single vector operation really is equivalent to running different instances of that same instruction in multiple threads on different cores. You can get the same effect with vector units plus some additional HW and compiler support. Wide AVX isn't enough, but it is almost enough.

Finally, you mention that Haswell will run at three times the frequency of most GPU designs. That is intentional. The high frequency isn't an advantage. Research and designs have shown over and over again that the VLSI techniques required to hit 3Ghz+ degrade efficiency compared to more modestly clocked designs. Maybe someone will figure out how to do it efficiently in the future, but AFAIK, my statement is true for all standard library based flows, and the semi-custom layout that Intel/others sometimes use.
BenchPress - Tuesday, May 15, 2012 - link
I'm not suggesting to just "bolt on" wide vector units. One of the problems is you still need scalar units for things like address pointers and loop counters. So you want them to be as close as possible to the rest of the execution core, not tacked on to the side. Fortunately all AMD has to do in the short term is double the width of its Flex FP unit, and implement gather. This shouldn't be too much of a problem with the next process shrink.

Indeed Knight's Corner uses very wide vectors, but first and foremost let's not forget that the consumer product got cancelled. Intel realized that no matter how much effort you put into making it suitable for generic computing, there are still severe bottlenecks inherent to heterogeneous computing. LRBni will reappear in the consumer market as AVX2.

And yes I do believe Haswell could in theory compete with MICs. You can't fit the same number of cores on it, but clock speed makes up for it in large part and you don't actually want too many cores. And most importantly, GPGPU has proven to never reach anywhere near the peak performance for real-world workloads. Haswell can do more, with less, thanks to the cache hierarchy and out-of-order execution.

Would do care to explain why you think AVX2 isn't enough for SPMD processing on SIMD?
Tanclearas - Tuesday, May 15, 2012 - link
Although I do agree that there are many opportunities for HSA, I am concerned that AMD's own efforts in using heterogeneous computing have been half-baked. The AMD Video Converter has a smattering of conversion profiles, lacks any user-customizable options (besides a generic "quality" slider), and hasn't seen any update to the profiles in a ridiculously long time (unless there were changes/additions within the last few months).

It is no secret that Intel has put considerable effort into compiler optimizations that required very little effort on the part of developers to take advantage of. AMD's approach to heterogeneous computing appears simply to wait for developers to do all the heavy lifting.

The question therefore is, when is AMD going to show real initiative with development, and truly enable developers to easily take advantage of HSA? If this is already happening, please provide concrete examples of such. (Note that a 3-day conference that also invites investors is hardly a long-term, on-going commitment to improvement in this area.)
jeff_rigby - Tuesday, May 15, 2012 - link
I'm sure you can't answer direct questions so I'll try a work around:

1) Would next generation game consoles benefit from HSA efficiencies?
2) Will AMD 2014 SOC designs be available earlier to a large 3rd party with game console volume.
3) Can third party CPUs like the IBM Power line be substituted for X86 processors in AMD SOCs. Is this a better choice than a Steamroller or Jaguar core for instance? Assume that the CPU will prefetch for the GPGPU.
4) Is my assumption correct that a Cell processor, because it already has an older attempt at HSA internal to the cell makes it unsuitable for inclusion in a AMD HSA SOC but individual PPC or SPU elements could be included. This is a simplistic question and does not take into account that a GPGPU is more efficient at many tasks for which SPUs would be used.
5) If using a SOC which has a CPU and GPU internal but more GPU needed, is there going to be a line of external GPUs that use system memory and a common address scheme rather than a PCIe buss and must have it's own RAM like PC cards.
6) In the future, will 3D stacked memory replace GDDR memory in AMD GPU cards for PCs. I understand that it will be faster, more energy efficient and eventually cheaper.

Foundries questions that applies to AMD SOC "process optimized" building blocks made to consortium standards that will reduce cost and time to market for SOCs.
1) Are we going to see very large SOCs in the near future.
2) Design tools to custom create a large substrate that has bumps and traces the AMD building blocks 2.5D attach to. Mentioned in PDFs was Global Foundries needing 2.5 years lead time to design custom chips. How much lead time will be needed in the future if using AMD building blocks to build SOCs?
3) Will 3D stacked memory be inside the SOC, Outside the SOC or a combination of the two due to DRAM being temp sensitive.
4) Are a line of FPGAs part of AMD building blocks?

Thanks for any questions you can answer.

.
SleepyFE - Tuesday, May 15, 2012 - link
Where is AMD going with their APU-s. Are you trying to make it more like a CPU or a GPU that can stand on it's own. Since you used SIMD for GCN it seems to me that you are going for a more CPU like design with a large amount of SIMD-s that will be recognized as a graphics card for backward compatibility purposes (possibly implemented in the driver). But since you promote GPGPU and your APU-s have weaker CPU capabilities, do you eventually intend to make it a GPU that can handle x86 instructions (for backwards compatibility).

Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Post Your Comment

101 Comments

View All Comments

Ashkal - Monday, May 14, 2012 - link

sfooo - Tuesday, May 15, 2012 - link

caecrank - Tuesday, May 15, 2012 - link

Loki726 - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

Loki726 - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

Tanclearas - Tuesday, May 15, 2012 - link

jeff_rigby - Tuesday, May 15, 2012 - link

SleepyFE - Tuesday, May 15, 2012 - link

Log in

Don't have an account? Sign up now