Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Name: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

by Anand Lal Shimpi on May 14, 2012 3:46 PM EST

101 Comments | Add A Comment

101 Comments

AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.

Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.

The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is will to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.

The software strategy is what AMD is working on now. AMD’s Fusion¹² Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.

That brings us to today. In advance of this year’s AFDS, Manju has agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Manju has a BS in Electrical Engineering (IIT, Bombay) and a PhD in Computer Information and Control Engineering (UMich, Ann Arbor) so make the questions as tough as you can. He'll be answering them on May 21st so keep the submissions coming.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

Fergy - Wednesday, May 16, 2012 - link
So why not but a cpu in the gpu? If you are worried about round trips and caches.
BenchPress - Wednesday, May 16, 2012 - link
Because to make a GPU run sequential workloads efficiently it would need lots of CPU technology like out-of-order execution and a versatile cache hierarchy, which sacrifices more graphics performance than people are willing to part with. The CPU itself however is a lot closer to becoming the ideal general purpose high throughput computing device. All it needs is wide vectors with FMA and gather: AVX2. It doesn't have to make any sacrifices for other workloads.

AVX2 is also way easier to adopt by software developers (including compiler developers like me). And even if AMD puts hundreds of millions of dollars into HSA's software ecosystem (which I doubt) to make it a seamless experience for application developers (i.e. just switching a compiler flag), it's still going to suffer from fundamental heterogeneous communication overhead which makes things run slower than the theoretical peak. Figuring out why that happens takes highly experienced engineers, again costing companies lots of money. And some of that overhead just can't be avoided.

Last but not least, AVX2 will be ubiquitous in a few years from now, while dedicated HSA will only be available in a minority of systems. The HSA roadmap even shows that the hardware won't be complete before 2014, and then they still have to roll out all of the complex software to support it. AVX2 compiler support on the other hand is in beta today, for all major platforms and programming languages/frameworks.
hwhacker - Monday, May 14, 2012 - link
Love this open dialogue, thanks Manju/Anand.

What balance of Radeon cores do you see as a pertinent mix to execute fp128 and 256-bit instructions? Is one 64sp unit realistic, or does the unit need to be comparably larger (or a multiple) to justify it's allocation within the multipurpose nature not only within the APU but across discrete GPU product lines that may also use the same DNA?

What are the obstacles in the transition from the current FPU unit(s) within bulldozer CPUs to such a design? Clockspeed/unit pairings per transistor budget that may mesh better on future process nodes, for example?
mrdude - Monday, May 14, 2012 - link
1 - The recent Kepler design has shown that there might be a chasm developing between how AMD and nVidia treat desktop GPUs. While GCN showed that it can deliver both fantastic compute performance (particularly on supported openCL tasks), it also weighs in heavier than Kepler and lags behind in terms of gaming performance. The added vRAM, bus width and die space for the 7970 allow for greater compute performance but at a higher cost; is this the road ahead and will this divide only further broaden as AMD pushes ahead? I guess what I'm asking is: Can AMD provide both great gaming performance as well as compute without having to sacrifice by increasing the overall price and complexity of the GPU?

2 - It seems to me that HSA is going to require a complete turnaround for AMD as far as how they approach developers. Personally speaking, I've always thought of AMD as the engineers in the background who did very little to reach out and work with developers, but now in order to leverage the GPU as a compute tool in tasks other than gaming it's going to require a lot of cooperation with developers who are willing to put in the extra work. How is AMD going about this? and what apps will we see being transitioned into GPGPU in the near future?

3 - Offloading FP-related tasks to the GPU seems like a natural transition for a type of hardware that already excels in such tasks, but was HSA partly the reason for the single FPU in a Bulldozer module compared to the 2 ALUs?

4 - Is AMD planning to transition into an 'All APU' lineup for the future, from embedded to mobile to desktop and server?
ToTTenTranz - Tuesday, May 15, 2012 - link
This I'm also really interested in knowing.
Especially the 3rd question.

It seems Bulldozer/Piledriver sacrificed quite a bit of parallel FP performance.
Does this mean tha HSA's purpose is to have only a couple of powerful FP units for some (rare ?) FP128 workloads while leaving the rest of the FP calculations (FP64 and below) for the GPU? Will that eventually be completely transparent for a developer?

And please, will someone just kick the spamming avx dude?
A5 - Monday, May 14, 2012 - link
What is AMD doing to make OpenCL more pleasant to work with?

The obvious deficiency at the moment is the toolchain, and (IMO) the language itself is more difficult to work with for people who are not experienced with OpenGL. As someone with a C background, I was able to get a basic CUDA program running in under 1/3rd of the time it took me to get the same program implemented and functional in OpenCL.
ltcommanderdata - Monday, May 14, 2012 - link
1. WinZip and AMD have been promoting their joint venture in implementing OpenCL hardware accelerated compression and decompression. Owning an AMD GPU I appreciate it. However, it's been reported that WinZip's OpenCL acceleration only works on AMD CPUs. What is the reasoning behind this? Isn't it hypocritical, given AMD's previous stance against proprietary APIs, namely CUDA, that AMD would then support development of a vendor specific OpenCL program?

2. This may be related to the above situation. Even with standardized, cross-platform, cross-vendor APIs like OpenCL, to get the best performance, developers would need to do vendor specific, even device generation within a vendor specific optimizations. Is there anything that can be done whether at the API level, at the driver level or at the hardware level to achieve the write-once, run-well anywhere ideal?

3. Comparing the current implementations of on-die GPUs, namely AMD Llano and Intel Sandy Bridge/Ivy Bridge, it appears that Intel's GPU is more tightly integrated with CPU and GPU sharing the last level cache for example. Admittedly, I don't believe CPU/GPU data sharing is exposed to developers yet and only available to Intel's driver team for multimedia operations. Still, what are the advantages and disadvantages of allowing CPUs and GPUs to share/mix data? I believe memory coherency is a concern. Is data sharing the direction that things are eventually going to be headed?

4. Related to the above, how much is CPU<>GPU communications a limitation for current GPGPU tasks? If this is a significant bottleneck, then tightly integrated on-die CPU/GPUs definitely show their worth. However, the amount of die space that can be devoted to an IGP is obviously more limited than what can be done with a discrete GPU. What can then be done to make sure the larger computational capacity of discrete GPUs isn't wasted doing data transfers? Is PCIe 3.0 sufficient? I don't remember if memory coherency was adopted for the final PCIe 3.0 spec, but would a new higher speed bus, dedicated to coherent memory transfers between the CPU and discrete GPU be needed?

5. In terms of gaming, when GPGPU began entering consumer consciousness with the R500 series, GPGPU physics seemed to be the next big thing. Now that highly programmable GPUs are common place and the APIs have caught up, mainstream GPGPU physics is no where to be found. It seems the common current use cases for GPGPU in games is to decompress textures and to do ambient occlusion. What happened to GPGPU physics? Did developers determine that since multi-core CPUs are generally underutilized in games, there is plenty of room to expand physics on the CPU without having to bother with the GPU? Is GPGPU physics coming eventually? I could see concerns about contention between running physics and graphics on the same GPU, but given most CPUs are coming integrated with a GPGPU IGP anyways, the ideal configuration would be a multi-core CPU for game logic, an IGP as a physics accelerator, and a discrete GPU for graphics.
ltcommanderdata - Tuesday, May 15, 2012 - link
As a follow up, it looks like the just released Trinity brings improved CPU/GPU data sharing as per question 3 above. Maybe you could compare and contrast Trinity and Ivy Bridge's approach to data sharing and give an idea of future directions in this area?
GullLars - Monday, May 14, 2012 - link
My question:
Will the GPGPU acceleration mainly improve embarrassingly parallel and compute bandwidth constrained applications, or will it also be able to accelerate smaller pieces of work that are parallel to a significant degree.
And what is the latency associated with branching off and running a piece of code on the parallel part of the APU? (f.ex. as a method called by a program to work on a large set of independent data in parallel)
j1o2h3n4 - Monday, May 14, 2012 - link
Under the impression that OPENCL for 3D Rendering is finally as fast as CUDA, Really? If yes what rendering systems can we use?

Nvidia has exclusivity on big names of GPU renderers like IRAY, VRAY, OCTANE, ARION, these companies has taken years developing & optimizing CUDA and under their requirements only Nvidia is mentioned. By going AMD we are giving all that up, what foreseeable effort does AMD take to boost this market?

Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Post Your Comment

101 Comments

View All Comments

Fergy - Wednesday, May 16, 2012 - link

BenchPress - Wednesday, May 16, 2012 - link

hwhacker - Monday, May 14, 2012 - link

mrdude - Monday, May 14, 2012 - link

ToTTenTranz - Tuesday, May 15, 2012 - link

A5 - Monday, May 14, 2012 - link

ltcommanderdata - Monday, May 14, 2012 - link

ltcommanderdata - Tuesday, May 15, 2012 - link

GullLars - Monday, May 14, 2012 - link

j1o2h3n4 - Monday, May 14, 2012 - link

Log in

Don't have an account? Sign up now