Reducing GPU Power: A Quick Lesson in Clock Gating

In order to prepare a desktop GPU for use in a mobile environment, one of the fundamentals of chip design is violated by introducing something called clock gating. All GPUs are composed of tens of millions of logic gates that make up everything from functional units to memory storage on the GPU itself. Each gate receives an input from the GPU-wide clock to ensure that all parts of the GPU are working at the same frequency. But with GPUs getting larger and larger, it becomes increasingly difficult to ensure that all parts of the chip receive the same clock signal at the same time. In order to compensate, elaborate clock trees are created carrying a network of the same clock signal to all parts of the chip, so that when the clock goes high (instructing all of the logic in the chip to perform their individual tasks, sort of like a green light to begin work) all of the gates get the signal at the same time.

The one principle that is always taught in chip design is to make sure that you never allow the clock signal to pass through any sort of logic gates, it should always go from the source of the signal to its target without passing through anything else. The reason being that if you start putting logic in between the clock signal source and its target, you make the clock tree extremely complicated - since now, you not only have to worry about getting the same clock signal to all parts of the GPU at the same time, but you must also worry about delays introduced by feeding the clock through logic. There are benefits to clock gating however, and they primarily are for power savings.

It turns out that the easiest way to turn off a particular part of a chip is to stop feeding a clock signal to it; if the clock never goes high, then that part of the chip never knows to start working, so it remains in its initial state which is of non-operation. But you obviously don't want the clock disabled all of the time, so you need to implement logic that determines whether or not the clock should be fed to a particular part of the chip.

When you're typing in Word, all your GPU is doing consists of 2D and memory operations. The floating point units of the GPU are not needed, nor is anything related to the 3D pipeline. So let's say we have a part of the chip that detects whether or not you are typing in Word instead of playing a game, and that part of the chip sends a signal called 2D_Power_Save. When you are just in Word and don't need any sort of 3D acceleration, 2D_Power_Save goes high (the signal carries a value of '1', or electrically, whatever the high voltage is on the core), otherwise the signal stays low ('0' or 0V).

Using this 2D_Power_Save signal we could construct some logic using the clock that is fed to all of the parts of the 3D engine on the GPU. The logic could look something like this:

The very simple logic illustrated above is a logical AND gate with two input signals and one output. The 2D_Power_Save signal is inverted, so when it is high the value fed to the AND gate is low and vice versa. If the 2D_Power_Save signal is high, it is inverted and passed to the AND gate as a low signal, meaning that the Clock_Out signal will never be high and thus anything connected to it will always be low. If 2D_Power_Save is low, then the clock gets pass through to the rest of the GPU. That's how clock gating works.

We mentioned earlier that modern day GPUs are composed of tens of millions of gates (each gate is made up of multiple transistors), and while it would be nice to, it's virtually impossible to put this sort of logic in place for every single one of those gates. For starters, you'd have an incredibly huge chip thanks to spending even more transistors on logic for your clock gating, and it would also make your clock tree incredibly difficult to construct. So what happens is that the clock fed to large groups of gates, known as blocks, is gated, instead of gating the clocks to individual gates. The trick here is that the smaller the blocks you gate (or the more granular you clock-gate), the more efficient your power savings will be.

Let's say we've taken our little gated clock from above and fed it to the entire 3D rendering pipeline. So when 3D acceleration is not required (e.g. we're just typing away in MS Word), the entire 3D pipeline and all of its associated functional units are shut off, thus saving us lots of power. But now, when we fire up a game of Doom 3, all of our power savings are lost as the entire 3D engine is turned back on.

What if we could turn off parts of the GPU not only depending on what type of application we're running (2D or 3D) but also based on what the specific requirements of that application are. For example, Doom 3's shaders perform certain operations that will stress some parts of the GPU, while a game like Grand Theft Auto will stress other parts of the GPU. A more granular implementation of clock gating would allow the GPU to differentiate between the requirements of the two applications and thus offer more power savings.

While we're not quite at the level of the latter example, the one thing that is true is that today's mobile GPUs offer more granular clock gating than the previous generation. This will continue to be true for future mobile GPUs as smaller manufacturing processes and improvements in GPU architecture will allow for more and more granular clock gating.

So where are we today with the GeForce 6800 Go?

With the NV3x series of GPUs, as soon as a request hit the 3D pipeline, the entire 3D pipeline powered up and it didn't power down until the last bits of data left the pipeline. With the GeForce 6800 Go, different stages of the 3D pipeline will only power up if they are being used, otherwise they remain disabled thanks to clock gating. What this means is that power consumption in 3D applications and games is much more optimized now than it ever was before and it will continue to improve with future mobile GPUs.

Since ATI's M28 has not officially been launched yet we don't have any information on its power consumption, however given that the X800 consumes less power than the 6800 on the desktop, we wouldn't be too surprised to see a similar situation emerge on the mobile side of things as well.

Index The Test
POST A COMMENT

24 Comments

View All Comments

  • bollwerk - Monday, November 08, 2004 - link

    bah, #13 beat me to it. I was also going to point out that the mobility 9800 was based on the X800, not the 9800. I think it was confusing of ATI to do this, but what can ya do... *shrug* Reply
  • MAValpha - Monday, November 08, 2004 - link

    For accuracy's sake, the Mobility 9800 was based on the R420 core- not the desktop R350/R360 (cite: http://www.trustedreviews.com/article.aspx?art=611... Granted, it was an AGP chip, but it bore more technological resemblance to an X800 than to a 9800. Even so, I think that the name "Mobility X800" does make sense, in keeping with ATI's naming convention; then again, remember the Mobility 9700. Reply
  • DeathByDuke - Monday, November 08, 2004 - link

    not a good comparision really. 8 pipeline chip vs 12 pipeline chip. ATi no doubt plan a Mobility X800. Which is no doubt the '9800' with 12-16 pipes. It'd be fun. Reply
  • ActuaryTm - Monday, November 08, 2004 - link

    Anand:

    Thank you for the review, and for the clarification regarding the available testing time for each machine. Especially enjoyed the clear, concise portion regarding clock gating.

    It should be noted to those with negative comments that this was not a review of either machine, but rather a simple comparison of the two GPUs.

    Look forward to the coming reviews, Anand. Well done.

    Regards,
    Michael
    Reply
  • Anand Lal Shimpi - Monday, November 08, 2004 - link

    We wanted to run more tests but we only had the M28 laptop for a matter of a few hours and the Geforce 6800 Go laptop for less than a day before we had to send it back. Given more time with the solutions we would have gladly performed more tests. I'm hoping to have a shipping version of M28 by the end of this month for more thorough tests.

    As far as a comparison to other notebooks, the best comparison point is the Dell XPS equipped with the Mobility Radeon 9800, however Dell isn't very eager to send out review samples unless the review will benefit Dell - in this case, it definitely wouldn't, thus we could not secure a review sample in time.

    The request for desktop reference scores is a good one, while we didn't have time to include them in this review I'll make sure they get in the review of the shipping M28.

    Take care,
    Anand
    Reply
  • skunkbuster - Monday, November 08, 2004 - link

    anyone know why ati can't make better OpenGL drivers? they really need to work on those more.
    thats the only thing i see lacking with their offerings.


    also #7 and #8? i think a person who buys this sort of laptop isnt really concerned about battery life. its more of a 'desktop replacement' than a 'portable'.

    i agree on the point of reference thing though. it would have been nice to have something to compare them to other than each other.

    Reply
  • LoneWolf15 - Monday, November 08, 2004 - link

    I am greatly disappointed by the lack of battery life tests. Unlike desktops, where fastest with good image quality is important, if the numbers are relatively close (say, within 5-10%) performance-wise, a laptop-buyer will almost always go for the setup with battery life. I understand like others that the notebooks aren't identical, but there has got to be a way to test this. Also, there were no tests regarding CPU usage during DVD playback, something I consider a big deal. Numbers are nice, but this review is like a cake without the icing --it's kind of bland. Reply
  • Guspaz - Monday, November 08, 2004 - link

    I'm very disapointed with this article for a few reasons:

    1) There is no point of reference. Where are the benchmarks for a radeon 9700 Mobility or Radeon 9800 Mobility? We have no idea how much faster these things are than existing mobility parts

    2) There are no dekstop points of reference either. Users want to know how these compare to desktop processors.

    3) The configs were not identical. These are desktop CPUs in the laptops, why didn't you take out the 3.4 and put in a 3.2?

    4) Why was the lower clocked 6800 Go used to test? Was the 450/600 not available?

    5) Why are there no battery runtime comparisons? I understand they are different notebooks that can't be directly compared, but if they have similar hardware with similar rated batteries, the results would be ballpark at least. Even so, there could have been runtime benchmarks comparing having the power saving features on and off.

    I'm sure there's some other missing things I just haven't noticed. Because Anandtech has grown into such a well respected site, there is an expectation of quality and quantity that we readers have come to expect. I feel this article just isn't up to the Anandtech snuff.
    Reply
  • gordon151 - Monday, November 08, 2004 - link

    Damn, now my 9800xt is getting whooped by laptop graphics cards *sigh*. Wonder if there is gonna be an M28 XT or 6800 Go Ultra? Reply
  • dextrous - Monday, November 08, 2004 - link

    Where's the battery life numbers Anand? Reply

Log in

Don't have an account? Sign up now