Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Name: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

by Anand Lal Shimpi on May 21, 2012 12:58 PM EST

15 Comments | Add A Comment

15 Comments

Latency and overhead by GullLars

Will the GPGPU acceleration mainly improve embarrassingly parallel and compute bandwidth constrained applications, or will it also be able to accelerate smaller pieces of work that are parallel to a significant degree.

Hitherto workloads with a significant amount of data parallel components only could benefit from heterogeneous compute. However since with HSA APUs the communication between GPU and CPU is no longer subject to unnecessary copies, no cache flushes are automatically invoked, and the optimization of the runtime and driver stacks greatly reduces the dispatch latency, the type and number of workloads that are benefited from heterogeneous compute are greatly increased.

And what is the latency associated with branching off and running a piece of code on the parallel part of the APU? (f.ex. as a method called by a program to work on a large set of independent data in parallel)

Different on different products

Change starts with you by Tanclearas

Although I do agree that there are many opportunities for HSA, I am concerned that AMD's own efforts in using heterogeneous computing have been half-baked. The AMD Video Converter has a smattering of conversion profiles, lacks any user-customizable options (besides a generic "quality" slider), and hasn't seen any update to the profiles in a ridiculously long time (unless there were changes/additions within the last few months).

AMD recognizes that heterogeneous compute requires specific and new measures to ease developer adoption. To this end AMD is adopting the strategy of delivering domain-specific SDKs and providing optimized sample applications. These serve as reference code to ease the developer's job of extracting performance especially for targeted and common use cases. APP SDK is an example - stay tuned for more

It is no secret that Intel has put considerable effort into compiler optimizations that required very little effort on the part of developers to take advantage of. AMD's approach to heterogeneous computing appears simply to wait for developers to do all the heavy lifting.

The question therefore is, when is AMD going to show real initiative with development, and truly enable developers to easily take advantage of HSA? If this is already happening, please provide concrete examples of such. (Note that a 3-day conference that also invites investors is hardly a long-term, on-going commitment to improvement in this area.)

Just to clarify, HSA is not available today. We outlined our roadmap for the future of APUs last year at AFDS, which included the evolution of HSA. Most of the HSA features will be available on our 2013 and 2014 platforms. We are going to announce the schedule for availability of our HSA software stack, our tools and the library plan at AFDS. AFDS is a continued forum where we will bring together software developers to interact with us and our partners to let them know the direction of our platforms in the future. The fact that investors attend does not detract from the fact that it is targeted primarily at software developers. The overwhelming majority of presentations and talks are directed at software developers. Several key partners will be delivering keynotes at AFDS expressing their aligned view of heterogeneous computing including technical leaders from Adobe. Cloudera, Penguin Computing, Gaikai and SRS.

We have just announced the increasing gamut of software who support OpenCL on our platforms today. These include companies such as SONY, Adobe, Arcsoft, Winzip, Cyberlink, Corel, Roxio, and many, many others. We are confident all of them will be enthusiastic about supporting HSA.

In addition see the answer to the above question and what we are doing wrt making OpenCL easier to use.

Two questions by markstock

Mr. Hegde, I have two questions which I hope you will answer.

To your knowledge, what are the major impediments preventing developers from thinking about this new hierarchy of computation and begin programming for heterogenous architectures?

See my answer to the first question where I list the hardware features of HSA and the issues they solve. Those are all issues with today's heterogeneous compute models.

AMD clearly aims to fill a void for silicon with tightly-coupled CPU-like and GPU-like computational elements, but are they only targeting the consumer market, or will future hardware be designed to also appeal to HPC users?

Absolutely. We will be bringing HSA based APUs to the market in the near future and all the aspects of ease of programming and much greater performance per joule that HSA brings to the market will greatly benefit the HPC space. In fact, Penguin Computing, is already implementing APUs in HPC server designs and will be sharing details on HPC heterogeneous compute at AFDS during their keynote.

When will the software catch up? by Loki726

AMD Fellow Mike Mantor has a nice statement that I believe captures the core difference between GPU and CPU design.

"CPUs are fast because they include hardware that automatically discovers and exploits parallelism (ILP) in sequential programs, and this works well as long as the degree of parallelism is modest. When you start replicating cores to exploit highly parallel programs, this hardware becomes redundant and inefficient; it burns power and area rediscovering parallelism that the programmer explicitly exposed. GPUs are fast because they spend the least possible area and energy on executing instructions, and run thousands of instructions in parallel."

Notice that nothing in here prevents a high degree of interoperability between GPU and CPU cores.

When will we see software stacks catch up with heterogeneous hardware? When can we target GPU cores with standard languages (C/C++/Objective-C/Java), compilers(LLVM, GCC, MSVS), and operating systems (Linux/Windows)? The fact that ATI picked a different ISA for their GPUs than x86 is not an excuse; take a page out of ARM's book and start porting compiler backends.

AMD is addressing this via HSA. HSA addresses these fundamental points by introducing an intermediate layer (HSAIL) that insulates software stacks from the individual ISAs. This is a fundamental enabler to the convergence of SW stacks on top of HC.

Unless the install base is large enough, the investment to port *all* standard languages across to an ISA is forbiddingly large. Individual companies like AMD are motivated but can only target a few languages at a time. And the software community is not motivated if the install base is fragmented. HSA breaks this deadlock by providing a "virtual ISA" in the form of HSAIL that unifies the view of HW platforms for SW developers. It is important to note that this is not just about functionality but preserves performance sufficiently to make the SW stack truly portable across HSA platforms

Why do we need new languages for programming GPUs that inherit the limitations of graphics shading languages? Why not toss OpenCL and DirectX compute, compile C/C++ programs, and launch kernels with a library call? You are crippling high level languages like C++-AMP, Python, and Matlab (not to mention applications) with a laundry list of pointless limitations.

AMD sees OpenCL as a critical and necessary step in the evolution of programming. Single-core programming evolved from assembly to C++ and Java. Starting with very few expert programmers doing close-to-metal coding, to a larger number of trained professionals driving products and finally making it easier for minimally trained programming masses to target CPUs. Symmetric multi-core programming went thru a similar trend thru pthreads to models like OpenMP and TBB.

Today, pioneered by experts who managed to write compute code within shaders, heterogeneous compute now has its first standard programming model in OpenCL. AMD introduced Aparapi that provides Java developers an easy way to access GPU compute. C++ AMP is the first instance of the natural next step in this evolution, i.e. extensions of existing programming models to target GPU compute and thus bringing in the (large) community adoption. AMD will strongly support this expansion into languages like Fortran, Python, Ruby, R, Matlab…

In addition, domain-specific libraries are also being targeted, e.g. OpenCV, x264, crypto++, to allow the programmer to focus on the job at hand, instead of the mechanics of obtaining performance. This is the fastest way to enable existing application code bases to leverage heterogeneous compute.

And of course, HSA is a key enabler of this next step since it expands the install base for SW developers to target via the portable performance it enables across various ISAs.

However, similar to assembly optimizations, AMD does see OpenCL continue to coexist with high-level programming to enable performance-critical developers to extract the most out of a particular platform.

Where's separable compilation? Why do you have multiple address spaces? Where is memory mapped IO? Why not support arbitrary control flow? Why are scratchpads not virtualized? Why can't SW change memory mappings? Why are thread schedulers not fair? Why can't SW interrupt running threads? The industry solved these problems in the 80s. Read about how they did it, you might be surprised that the exact same solutions apply.

- OpenCL 1.2 (supported by the upcoming AMD APP SDK 2.7) supports clCompileProgram and clLinkProgram.
- HSA MMU enables a shared address space between CPU and GPU
- HSAIL supports more flexible control flow.
- SI-based GPUs include high-performance read/write caches which effectively can be virtualized.
- Future AMD APUs will support HW context switching, including ability for SW to interrupt running threads

Question

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

15 Comments

View All Comments

spaceyyeti - Tuesday, May 22, 2012 - link
You, sir, make no sense to me. But that just as well could be me.
;)
Anyways; this article was a very interesting read.

Loved it.
tipoo - Tuesday, May 22, 2012 - link
Kepler has more stream processors but at half the clock speed of Fermi (hot clocks), they aren't directly comparable.
oldguybt - Tuesday, May 22, 2012 - link
Ok, its not just steam processors but what i`m saying sounds like AMD findout that their cards is working much better in programing, counting and so on.
I`m noob here but with linux i could do that before and they saying i could do that in future..
Ok. Good job AMD anyway. First + in your side in 10 years.
ullix - Wednesday, May 23, 2012 - link
What remains unclear to me: in order to take advantage of e.g. OpenCL must I use an APU with CPU and GPU on the same die, or can I also use a discrete CPU and a GPU on a plugin card? Apparently the latter would always offer more power to begin with, which then could (or could not?) be enhanced with OpenCL and the like?

Or is it that an APU has the advantage, as it is already designed to use memory common to its CPU/GPU?

I am thinking primarily of x264 encoding - will I be better off with a "slow" A10-xxxx then with a FX8150+Radeon card 7xxx?
TC2 - Thursday, May 24, 2012 - link
big words, big plans aaand big red\green power point presentations, but 1 or 2 generations behind Intel & Nvidia :)))
about sw support, they just haven't idea what is the meaning of this!!!

Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Latency and overhead by GullLars

Change starts with you by Tanclearas

Two questions by markstock

When will the software catch up? by Loki726

Post Your Comment

15 Comments

View All Comments

spaceyyeti - Tuesday, May 22, 2012 - link

tipoo - Tuesday, May 22, 2012 - link

oldguybt - Tuesday, May 22, 2012 - link

ullix - Wednesday, May 23, 2012 - link

TC2 - Thursday, May 24, 2012 - link

Log in

Don't have an account? Sign up now