Original Link: http://www.anandtech.com/show/2302

It's not often these days that we see AMD's name attached to new x86 instructions. While monumental to the creation of the AMD64/x86-64 standard for 64bit processors and the No-eXecute bit for buffer overflow protection, we haven't otherwise heard much out of AMD lately. It's been Intel driving new instruction sets, with standards such as SSE3, SSE4, and VT for virtualization (with AMD's own implementation, AMD-v following).

With AMD's impressive track record in recent years, we're all ears when they are announcing something new about x86 instructions. However with that said, what we're looking at today is not a new instruction set for high performance instructions, or a new safety measure. In fact AMD's proposal doesn't even directly apply to most computer users, they'll likely never use these instructions. What AMD is proposing is to our knowledge a first for the x86 instruction set: a set of instructions solely for developers.

As part of the newly-launched Extensions for Software Parallelism initiative, AMD is making its first move by announcing the Lightweight Profiling Proposal(LWP), a proposed standard for adding hardware and instructions to help fine tune their code and improve application performance by profiling the performance of their applications. With this addition to the x86 instruction set, AMD is specifically targeting managed code environments and developers producing multithreaded applications, two of the biggest areas of growth in the software industry. AMD believes that these groups stand to benefit the most from LWP given the unique difficulties faced by those two fields.

Although the hardware to come from this proposal is still some time off, we're in a position today to talk about some the benefits that can be extracted from such hardware, and some of the hurdles in bringing about such a change. Profilers can be extremely powerful tools, with special purpose platforms such as embedded computers and video game consoles having used such hardware and software to squeeze an amazing amount of performance out of what can be very limited hardware. In some ways what AMD is proposing is simply bringing the PC up to par with other systems, and what they're proposing is simply simple. But never the less, the potential performance can be huge.

So what is profiling, why does it have such a potential to improve performance, and how does AMD intend to improve profiling? Let's take a look.

The Importance of Profiling

To understand what exactly AMD is attempting to accomplish, we first need to backtrack somewhat and talk about performance analysis, the task AMD is intending to improve with the LWP. Because of the sheer complexity of modern high-performance processors and high-performance applications, searching for performance bottlenecks solely through an understanding of code is virtually impossible. Due to this limitation, developers have over the years written toolsets that attempt to straddle the hardware/software boundary by recording and interpret the actions of the hardware, to identify what code is being run and what the hardware is doing that is greatly impacting the performance of an application. This is performance analysis, and the tools used to do this task are called profilers.

Traditional profilers attempt to measure performance via timing, hardware interrupts, and other low-level tricks to coerce the hardware in to giving additional information on its state and current instructions being run. This kind of performance analysis can be extremely effective, but it also has an inherent downside: these profilers slow down the system and interrupt the program being profiled. The hardware may be acting differently because of the profiler, conflicting with the intended goal.

Unfortunately for this problem, profiling is increasingly necessary as developers continue to move to higher-level languages and parallel processing technologies. C and C++, the traditional high-level languages for high-performance applications are being usurped by managed environments such as Java and the Microsoft .Net framework. Managed code can offer better security, improved threading, and write-once run-anywhere functionality through virtual machines which in turn avoids many issues with porting a program to other platforms.

Meanwhile the entire imperative programming model that is behind the design of C, Java, Visual Basic, C#, and the other major programming languages is poor suited for multithreading. Some of the most pessimistic predictions for game development for example put the development time of a multithreaded engine at three times that of a single threaded engine. Profilers help in this regard by helping developers catch stalls and other problems that result from managing multiple threads.

It's all of these problems that AMD wants to resolve with their Lightweight Profiling Proposal. What AMD proposes is a section of silicon on a CPU dedicated to assisting with profiling, and a new set of instructions to work with the hardware. The profiling hardware would be able to properly monitor the rest of the CPU, as opposed to the guessing done by software profilers, and return this more precise information to the developer.

It's a fundamentally simple concept, the proposal calls for all of this being done with only two instructions: LLWPCB and SLWPCB to enable/disable profiling and retrieve the data respectively. Yet the potential results could be extremely useful, allowing developers to identify the precise latency of certain operations, count cache hits, or retrieve the exact instruction being processed. Furthermore all of this occurs while triggering a fraction of the interrupts (the reasoning behind the "lightweight" name) and without causing the processor to act different as software profiling tools can cause, all of this leading to better profiling that should translated in to more finely optimized applications.

Even wilder ideas for using these instructions exist in the realm of managed code. Because LWP is lightweight and real time, the possibility is left open that Just-In-Time(JIT) compilers used by managed environments could use the profilers on themselves and change how they're compiling code and handing data to improve performance on the fly. As we'll see there are some outstanding issues with AMD's proposal that would specifically affect this use, but the potential is there.

First Thoughts

In spite of the various performance opportunities offered by LWP it's important to note that this is by no means a silver bullet. Profiling tools in general are a very fine tool used to get the last, smallest parts of performance out of an application. Improved tools aren't a replacement for code that's better designed, or compilers that are better at identifying parallelizable code or more in tune with how a specific processor design performs. LWP won't solve the problem of poor implementations here hurting performance far more than a profiler can ever help.

More fundamentally however is that this is just a proposal, which is unusual for an industry that has silicon and roadmaps to go with new technology announcements. AMD has not announced what chip of theirs will be the first to implement the technology, and with Barcelona shipping this month we can safely assume that this technology won't be in that chip. The earliest we would see this technology would be in late 2008 with the Shanghai core, if not in 2009 with Bulldozer. Given the average two-year development cycle for most programs, this would push programs that make significant use of the technology out to 2010 and beyond, a long way away in the fast-changing computer industry.

There are also some miscellaneous issues that bear mentioning. Operating system support of the new instructions is required, and meanwhile Windows Vista does away with the ability to use the traditional interrupt controls; in a slightly humorous tone given the open-community nature of this proposal, AMD states "So let's figure one [method] out, or create a new one, or share an existing interrupt." There's also the question of Intel following AMD along with this as they have done on AMD's past few instruction set extensions; such a feature would be far more useful if both major vendors supported it.

Beyond LWP, AMD has indicated that there will be further new specifications released as part of the Extensions for Software Parallelism initiative. At this point AMD isn't offering any concrete details on what those will be, but our best guess would be general-use instructions for enhancing multithreading performance to go along with new hardware, and possibly more tools for developers. The former in particular was precedent in the x86 instruction set as Intel released a pair of such instructions as part of SSE3 for improving the performance of their now-abandoned hyperthreading technology.

With all of that said, we're left in an interesting situation where we're looking at a very interesting AMD proposal, but not a whole lot of hard numbers to back things up. For developers this proposal and the new instructions could be of significant help, but we're left without a way to measure or even predict the value of "significant." As we stated earlier, similar tools have offered an exceptional benefit to developers on embedded/closed platforms using standardized hardware, but we have to keep in mind that this isn't a perfect predictor for a general purpose computer.

Until AMD hammers out the finer details of their proposal and releases some hardware to go with it, we are going to be left with bated breath waiting for AMD to move this proposal to the next stage. We are looking forward to seeing what developers can do with such a dramatic leap in the ability of their profiling tools in the coming years.

Log in

Don't have an account? Sign up now