Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegdeby Anand Lal Shimpi on May 21, 2012 12:58 PM EST
I haz questions by B3an
Could an OS use GPU compute in the future to speed up everyday tasks, apart from the usual stuff like the UI? What possible tasks would this be? And is it possible we'll see this happen within the next few years?
Yes, definitely. OSes are moving towards providing some base functionality in terms of security, voice recognition, face detection, biometrics, gesture recognition, authentication, some core database functionality. All these benefit significantly from the optimizations in HSA described above. With the industry support we are building this should happen in the next few years.
Are you excited about Microsofts C++ Accelerated Massive Parallelism (AMP)? Do you think we'll see a lot more software using GPU compute now that Visual Studio 11 will include C++ AMP support?
We see C++amp as a great alternative to OpenCL. Both OpenCL and C++amp provide a method for utilizing the underlying GPU compute infrastructure, each with its own benefits. AMD realizes that different class of programmers may have different language preferences, so we will support both languages with the same level of quality, in order to serve our developer community better. C++AMP is a small delta to C++ and as such will appeal to many mainstream programmers and with Microsoft's support be able to reach a vast audience. So yes, we expect to see a lot more software using heterogeneous compute through this direction.
Do you expect the next gen consoles to make far more use of GPU compute?
Cannot comment further on this since these are products being brought forward by other companies.
Question by mrdude
The recent Kepler design has shown that there might be a chasm developing between how AMD and nVidia treat desktop GPUs. While GCN showed that it can deliver both fantastic compute performance (particularly on supported openCL tasks), it also weighs in heavier than Kepler and lags behind in terms of gaming performance. The added vRAM, bus width and die space for the 7970 allow for greater compute performance but at a higher cost; is this the road ahead and will this divide only further broaden as AMD pushes ahead? I guess what I'm asking is: Can AMD provide both great gaming performance as well as compute without having to sacrifice by increasing the overall price and complexity of the GPU?
Yes. You will see our future products continue to balance great gaming and compute performance.
It seems to me that HSA is going to require a complete turnaround for AMD as far as how they approach developers. Personally speaking, I've always thought of AMD as the engineers in the background who did very little to reach out and work with developers, but now in order to leverage the GPU as a compute tool in tasks other than gaming it's going to require a lot of cooperation with developers who are willing to put in the extra work. How is AMD going about this? and what apps will we see being transitioned into GPGPU in the near future?
Things have changed J. AMD has a team focused on developer solutions and outreach. This team drives the definition and deployment of tools, libraries, SDKs to the developer ecosystem including enablement content such as blogs, white papers and webinars. In addition, AMD works with key developers and also contributes to prominent open source code bases to promote GPU compute. The launch of the 2nd Generation AMD A-Series "Trinity" APU includes numerous applications that use the GPU for compute acceleration – Photoshop CS6, Winzip, x264/Handbrake, GIMP to name a prominent few. There are more plans in the works to reach out to developers and make it easy for them to extract the most from HSA platforms.
Offloading FP-related tasks to the GPU seems like a natural transition for a type of hardware that already excels in such tasks, but was HSA partly the reason for the single FPU in a Bulldozer module compared to the 2 ALUs?
We constantly evaluate the tradeoff of where to add compute execution resources. It is more expensive to add more computation resources in the CPU core since CPU vector execution resource are typically clocked higher, have multi-ported register files, support out-of-execution for latency hiding, etc. That said, the Bulldozer FPU does include support for new FMAC instructions and higher clock rate. So we really are investing in both CPU and GPU.
Is AMD planning to transition into an 'All APU' lineup for the future, from embedded to mobile to desktop and server?
AMD is all about meeting customer needs. We already have APUs for embedded, mobile (tablet and notebook) and desktop, and will address APUs for server as we continue to monitor what the market and customer needs are. In fact, we already have some partners incorporating APUs into server designs, one of which is Penguin Computing who is keynoting AFDS…
OpenCL by A5
What is AMD doing to make OpenCL more pleasant to work with?
Some of the initiatives that AMD has already driven are:
- Improved Debugger and Profiler (Visual Studio plugin, Standalone Eclipse, Linux)
- Static C++ interface (APP SDK 2.7)
- Extended tools thru MCW (PPA, GMAC, TM)
- OpenCL Book, Programming Guide (US, India, China editions)
- University course kit
- Webinars (Various topics)
- Online self-training material
- Hands-on tutorial, content in AFDS
- Moderated OpenCL forum
- OpenCL Training and Services Partners
- OpenCL acceleration of major Open Source codebases
- Aparapi to enable and ease Java users in using OpenCL
Questions by ltcommanderdata
WinZip and AMD have been promoting their joint venture in implementing OpenCL hardware accelerated compression and decompression. Owning an AMD GPU I appreciate it. However, it's been reported that WinZip's OpenCL acceleration only works on AMD CPUs. What is the reasoning behind this? Isn't it hypocritical, given AMD's previous stance against proprietary APIs, namely CUDA, that AMD would then support development of a vendor specific OpenCL program?
The OpenCL version of Winzip has been optimized on AMD GPUs and achieves significant performance gains. While OpenCL is not vendor specific optimizations on any application are essentially vendor specific since they depend on the microarchitecture of eachvendor. We worked closely with Winzip to get these optimizations in. We have the most mature OpenCL implementation currently and even then we just managed to get the QA done before WinZip's launch date. I am sure that they will be coming out with OpenCL optimized versions on Intel and Nvidia soon -- you should ask them). That is in fact the beauty of OpenCL – one code base gives functional portability across vendor platforms and optimizations are the only components that need to be scheduled. So yes this is in line with our embrace of open and industry standards.
This may be related to the above situation. Even with standardized, cross-platform, cross-vendor APIs like OpenCL, to get the best performance, developers would need to do vendor specific, even device generation within a vendor specific optimizations. Is there anything that can be done whether at the API level, at the driver level or at the hardware level to achieve the write-once, run-well anywhere ideal?
Device specific optimizations will always have a beneficial impact on performance. This is true even with CPUs. While differences between GPUs are more dramatic, this is due to the fact that today's GPUs are designed to excel at graphics while compute is a secondary consideration. Reluctance to spend more chip area on compute results in having many device specific performance "cliffs". For example, VLIW instructions, 64 thread wavefronts, and the need for coalesced accesses to memory. As GPUs are increasingly used for compute, and as it becomes possible to add yet more transistors, these "cliffs" will continue to disappear. Advances in compilers will also help.
Comparing the current implementations of on-die GPUs, namely AMD Llano and Intel Sandy Bridge/Ivy Bridge, it appears that Intel's GPU is more tightly integrated with CPU and GPU sharing the last level cache for example. Admittedly, I don't believe CPU/GPU data sharing is exposed to developers yet and only available to Intel's driver team for multimedia operations. Still, what are the advantages and disadvantages of allowing CPUs and GPUs to share/mix data? I believe memory coherency is a concern. Is data sharing the direction that things are eventually going to be headed?
For some workloads we expect data sharing between the CPU and GPU. In many cases the data being shared is quite large – for example a single frame of HD video with 4bytes/pixel is 8MB of data, and many algorithms are dealing with multiple frames of video so even seemingly large shared caches are not effective at capturing real-world working sets. We see clear advantages from CPU/GPU shared address spaces (same page tables) and high-bandwidth memory access from both devices.
As a follow up, it looks like the just released Trinity brings improved CPU/GPU data sharing as per question 3 above. Maybe you could compare and contrast Trinity and Ivy Bridge's approach to data sharing and give an idea of future directions in this area?
Related to the above, how much is CPU<>GPU communications a limitation for current GPGPU tasks? If this is a significant bottleneck, then tightly integrated on-die CPU/GPUs definitely show their worth. However, the amount of die space that can be devoted to an IGP is obviously more limited than what can be done with a discrete GPU. What can then be done to make sure the larger computational capacity of discrete GPUs isn't wasted doing data transfers? Is PCIe 3.0 sufficient? I don't remember if memory coherency was adopted for the final PCIe 3.0 spec, but would a new higher speed bus, dedicated to coherent memory transfers between the CPU and discrete GPU be needed?
For some applications, the CPU/GPU communication is so severe a limitation that it eliminates the gains from using the GPU. (For other algorithms, the communication is small or can be overlapped and the GPU can be used quite effectively). PCIe3.0 helps for large-block data transfers. Inherently though discrete GPUs will continue to provide higher peak computation capabilities (since the entire die is dedicated to GPU) but less tightly integrated than what can be achieved with an APU.
In terms of gaming, when GPGPU began entering consumer consciousness with the R500 series, GPGPU physics seemed to be the next big thing. Now that highly programmable GPUs are common place and the APIs have caught up, mainstream GPGPU physics is no where to be found. It seems the common current use cases for GPGPU in games is to decompress textures and to do ambient occlusion. What happened to GPGPU physics? Did developers determine that since multi-core CPUs are generally underutilized in games, there is plenty of room to expand physics on the CPU without having to bother with the GPU? Is GPGPU physics coming eventually? I could see concerns about contention between running physics and graphics on the same GPU, but given most CPUs are coming integrated with a GPGPU IGP anyways, the ideal configuration would be a multi-core CPU for game logic, an IGP as a physics accelerator, and a discrete GPU for graphics.
GPUs can run physics just fine. The problem with GPU physics is scaling. Unlike graphics, which easily scales across a wide range of hardware capabilities (eg: by changing resolution, using antialiasing, and changing texture resolution), it is very difficult to scale simulation compute requirements. Game programmers and artists must do a lot of extra work to take advantage of increased simulation capability, which they are reluctant to do, since they are usually happy to target lowest common denominator (consoles). This will continue to be the case until tools and runtimes are available which allow artists to create scalable physics content with little to no additional effort. HSA is an ideal architecture for running physics.