The GPU Advances: ATI's Stream Processing & Folding@Home

Name: The GPU Advances: ATI's Stream Processing & Folding@Home
Item: The GPU Advances: ATI's Stream Processing & Folding@Home
Author: Ryan Smith

by Ryan Smith on September 30, 2006 8:00 PM EST

Posted in
GPUs

43 Comments | Add A Comment

43 Comments

Folding@Home

For several years now, Dr. Vijay Pande of Stanford has been leading the Folding@Home project in order to research protein folding. Without diving unnecessarily into the biology of his research, as proteins are produced from their basic building blocks - amino acids - they must go through a folding process to achieve the right shape to perform their intended function. However, for numerous reasons protein folding can go wrong, and when it does it can cause various diseases as malformed proteins wreck havoc in the body.

Click to enlarge

Although Folding@Home's research involves multiple diseases, the primary disease they are focusing on at this point is Alzheimer's Disease, a brain-wasting condition affecting primarily older people where they slowly lose the ability to remember things and think clearly, eventually leading to death. As Alzheimer's is caused by malformed proteins impairing normal brain functions, understanding how exactly Alzheimer's occurs - and more importantly how to prevent and cure it - requires a better understanding on how proteins fold, why they fold incorrectly, and why malformed proteins cause even more proteins to fold incorrectly.

The biggest hurdle in this line of research is that it's very computing intensive: a single calculation can take 1 million days (that's over 2700 years) on a fast CPU. Coupled with this is the need to run multiple calculations in order to simulate the entire folding process, which can take upwards of several seconds. Even splitting this load among processors in a supercomputer, the process is still too computing intensive to complete in any reasonable amount of time; a processor will simulate 1 nanosecond of folding per day, and even if all grant money given out by the United States government was put towards buying supercomputers, it wouldn't even come close to being enough.

This is where the "@Home" portion of Folding@Home comes in. Needing even more computing power than they could hope to buy, the Folding@Home research team decided to try to spread processing to computers all throughout the world, in a process called distributed computing. Their hopes were that average computer users would be willing to donate spare/unused processor cycles to the Folding@Home project by running the Folding@Home client, which would grab small pieces of data from their central servers and return it upon completion.

The call for help was successful, as computer owners were more than willing to donate computer cycles to help with this research, and hopefully help in coming up with a way to cure diseases like Alzheimer's. Entire teams formed in a race to see who could get more processing done, including our own Team AnandTech, and the combined power of over two-hundred thousand CPUs resulted in the Folding@Home project netting over 200 Teraflops (one trillion Floating-point Operations Per Second) of sustained performance.

While this was a good enough start to do research, it was still ultimately falling short of the kind of power the Folding@Home research group needed to do the kind of long-runs they needed along side short-run research that the Folding@Home community could do. Additionally, as processors have recently hit a cap in terms of total speed in megahertz, AMD and Intel have been moving to multiple-core designs, which introduce scaling problems for the Folding@Home design and is not as effective as increasing clockspeeds.

Since CPUs were not growing at speeds satisfactory for the Folding@Home research group, and they were still well short of their goal in processing power, the focus has since returned to stream processors, and in turn GPUs. As we mentioned previously, the massive floating-point power of a GPU is well geared towards doing research work, and in the case of Folding@Home, they excel in exactly the kind of processing the project requires. To get more computing power, Folding@Home has now turned towards utilizing the power of the GPU.

Index Enter the GPU & Conclusion

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

43 Comments

View All Comments

photoguy99 - Sunday, October 1, 2006 - link
The folding team just hasn't designed their architecture efficiently for parallelism within a system.

No doubt they are brilliant computational biologists, but it's simply an oxymoron to claim a system can scale well using thousands of systems but not with the cores within those systems - Nonsense.

In fact I challenge anyone from their coding team to explain this contradiction.

Now if they say look, we're busy, we just haven't had time to optimize the architecture for multi-core yet, then that makes perfect sense. But to say inherently the problem doesn't lend itself to that is not right.
JarredWalton - Sunday, October 1, 2006 - link
Not at all true! See above comments, but data dependency is a key. They know the starting point, but beyond that they don't know anything. So they might generate 100,000 (or more) starting points. There's 100K WUs out there. They can't even start the second sequence of any of those points until the first point is complete.

Think of it within a core: They can split up a task into several (or hundreds) of pieces only if each piece is fully independent. It's not like searching for primes where scanning from 2^100000 to 2^100001 is totally unrelated to what happened in 2^99999 to 2^100000. Here, what happens at stage x of Project 2126 (Run 51, Clone 9, Gen 7) absolutely determins where stage x+1 of Project: 2126 (Run 51, Clone 9, Gen 7) begins. A separate task of Project: 2126 (Run 51, Clone 9, Gen 6) or whatever can be running, but the results there have nothing to do with Project: 2126 (Run 51, Clone 9, Gen 7).
photoguy99 - Monday, October 2, 2006 - link
Jared, I respectfully submit that you are not correct.

Think of it this way - what is the algorithmic difference between submiting jobs to distributed PCs vs. distributed processes within a PC?

Multiple processes within a PC could operate indepedently and easily take advantage of the multi-core parallelism. A master UI process could manage the sub processes on the machine so the that user would not even require special setup by the user.

I'm telling you the problem with leveraging multi-core is not inherent to the folding problem, it's just a limitation of how they've designed their architecture.

Again not to take away credit from all the goodness they have achieved, but if you think about it this is really indisputable. I'm sure their developers would agree.
JarredWalton - Monday, October 2, 2006 - link
Are we talking about *can* they get some advantage from multiple cores with different code, or are we talking about gaining a nearly 2X performance boost? I would agree that there is room for them to use more than one core, but I would guess the benefit will be more like a 50% speedup.

Right now, running two instances of FAH nearly doubles throughput, but no individual piece is completed faster. They could build in support for executing multiple cores without user intervention, but that's not a big deal since you can already do that on your own. Their UI could definitely be improved. The difficulty is that they aren't able to crank out individual pieces faster; they can get more pieces done, but if there's a time sensitive project they can't explore it faster. For example, what if they come on a particular folding sequence that seems promising, and they'd like to investigate it further with 100K slices covering several seconds (or whatever). If piece one determines piece 2, and 2 determimes 3... well, they're stuck with a total time to calculate 100K segments that would be in the range of thousands of years (assuming a day or two per piece).

Anyway, there are tasks which are extremely difficult to thread, though I wouldn't expect this to be one. Threading and threading really well aren't the same, though. Four years from now, if they get octal core CPUs, that increases the total number of cores people can process, but they wouldn't be able to look at any longer sequences than today if CPUs are still at the same clockspeed. (GPUs doing 40X faster means they could look at 40X more length/complexity.)

Anyway, without low level access to their code and an understanding of the algorithms they're using, the simple truth is that neither of us can say for sure what they can or can't get from multithreading. Then there's the whole manpower problem - is it more beneficial to work on multithread, or to work on something else? Obviously, so far they have done "something else". :)
smitty3268 - Monday, October 2, 2006 - link
Looking at their website, they are working on a multithreaded core which would take advantage of smp systems. Regardless of how well that turns out, a 40x increase is not going to happen until we get > 40 cores in a cpu, so this GPU client is still a very big deal.

I understand what you mean about data dependence and not being able to move on to more involved simulations due to time factors of individual work units, but it seems like this would be fairly easy to solve by simply splitting the work units in half or in quarters, etc. This could definitely be difficult to do, though, depending on how their software has been designed. Perhaps they would have to completely rewrite their software and it isn't worth the trouble.
JarredWalton - Tuesday, October 3, 2006 - link
I don't think they can split a WU in half, though, or whatever. Best they can do would be to split off a computation so that, i.e. atoms 1-5000 are solved at each stage on core 1 and 5001-10000 are on core 2. You still come back to the determination of the "trajectory". If you start at A and you want to know where you end up, the only way to know is to compute each point on the path. You can't just break that calculation into A-->C and then C-->B with C being halfway.

I know the Pande people are working on a lot of stuff right now, so GPUs, PS3, SMP, etc. are all being explored to varying extents.
icarus4586 - Wednesday, October 4, 2006 - link
The reason that modern GPUs are so powerful is that they have many parallel processing pipelines, which is only a little different than saying that they have many processing cores. Even the diagram given in this article is titled: "Modern GPU: 16-48 Multi-threaded cores." If the F@H algorithm can be optimized to use the parallelism that exists within modern GPUs, it should also be optimizeable for the parallelism of multi-core CPUs.
smitty3268 - Sunday, October 1, 2006 - link

quote:
As for the core stuff, this is something the Folding team explicitly brought up with us.

I still don't really see what the actual problem is, but I'll certainly take their word for it. Maybe if I ever get a degree in biochemistry I'll try and figure out what's going on :)

Thanks for the info. I think I'll go ahead and install F@H. It's something I've occasionally meant to do but I keep forgetting about it.
Furen - Sunday, October 1, 2006 - link
I think it's about data dependency. Let's say you start 2000 processes on different PCs and run them for 1 unit time. The result from this is 2000 processes at 1 unit time, not 1 process at 2000 units time, which is probably what you'd prefer. Having a massive speed up on a single node means that that node can push a single "calculation" farther along. I'd guess that the client itself is not multithreaded because of the threading overhead, it may not be worth the effort to optimize heavily for a dual-core speed up since the overhead will take a chunk out of that but a 40x speed up is another thing altogether.
JarredWalton - Sunday, October 1, 2006 - link
The way FAH currently works is that pieces of a similation are distributed; some will "fail" (i.e. fold improperly or hit a dead end) early, others will go for a long time. So they're trying to simulate the whole folding sequence under a large set of variables (temperatures, environment, acid/base, whatever), and some will end earlier than others. Eventually, they reach the stage where most of the sequences are in progress, and new work units are generated as old WUs are returned. That's where the problem comes.

If we were still scaling to higher clock speed, they could increase the size/complexity of simulations and still get WUs back in 1-5 days on average. If you add multiple cores at the same clock speed as earlier CPUs (i.e. X2 3800+ is the same as two Athlon 64 3200+ CPUs), you can do twice as many WUs at a time, but you're still waiting the same amount of time for results that may be important for future WU creation.

Basically, Pande Group/Stanford has simulations that they'd like to run that might take months on current high-end CPUs, and then they don't know how fast each person is really crunching away - that's why some WUs have a higher priority. Now they can do those on an X1900 and get the results in a couple days, which makes the work a lot more feasible to conduct.

That's one scenario, at least.

The GPU Advances: ATI's Stream Processing & Folding@Home

Post Your Comment

43 Comments

View All Comments

photoguy99 - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

photoguy99 - Monday, October 2, 2006 - link

JarredWalton - Monday, October 2, 2006 - link

smitty3268 - Monday, October 2, 2006 - link

JarredWalton - Tuesday, October 3, 2006 - link

icarus4586 - Wednesday, October 4, 2006 - link

smitty3268 - Sunday, October 1, 2006 - link

Furen - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

Log in

Don't have an account? Sign up now