Reducing GPU Power: A Quick Lesson in Clock Gating

In order to prepare a desktop GPU for use in a mobile environment, one of the fundamentals of chip design is violated by introducing something called clock gating. All GPUs are composed of tens of millions of logic gates that make up everything from functional units to memory storage on the GPU itself. Each gate receives an input from the GPU-wide clock to ensure that all parts of the GPU are working at the same frequency. But with GPUs getting larger and larger, it becomes increasingly difficult to ensure that all parts of the chip receive the same clock signal at the same time. In order to compensate, elaborate clock trees are created carrying a network of the same clock signal to all parts of the chip, so that when the clock goes high (instructing all of the logic in the chip to perform their individual tasks, sort of like a green light to begin work) all of the gates get the signal at the same time.

The one principle that is always taught in chip design is to make sure that you never allow the clock signal to pass through any sort of logic gates, it should always go from the source of the signal to its target without passing through anything else. The reason being that if you start putting logic in between the clock signal source and its target, you make the clock tree extremely complicated - since now, you not only have to worry about getting the same clock signal to all parts of the GPU at the same time, but you must also worry about delays introduced by feeding the clock through logic. There are benefits to clock gating however, and they primarily are for power savings.

It turns out that the easiest way to turn off a particular part of a chip is to stop feeding a clock signal to it; if the clock never goes high, then that part of the chip never knows to start working, so it remains in its initial state which is of non-operation. But you obviously don't want the clock disabled all of the time, so you need to implement logic that determines whether or not the clock should be fed to a particular part of the chip.

When you're typing in Word, all your GPU is doing consists of 2D and memory operations. The floating point units of the GPU are not needed, nor is anything related to the 3D pipeline. So let's say we have a part of the chip that detects whether or not you are typing in Word instead of playing a game, and that part of the chip sends a signal called 2D_Power_Save. When you are just in Word and don't need any sort of 3D acceleration, 2D_Power_Save goes high (the signal carries a value of '1', or electrically, whatever the high voltage is on the core), otherwise the signal stays low ('0' or 0V).

Using this 2D_Power_Save signal we could construct some logic using the clock that is fed to all of the parts of the 3D engine on the GPU. The logic could look something like this:

The very simple logic illustrated above is a logical AND gate with two input signals and one output. The 2D_Power_Save signal is inverted, so when it is high the value fed to the AND gate is low and vice versa. If the 2D_Power_Save signal is high, it is inverted and passed to the AND gate as a low signal, meaning that the Clock_Out signal will never be high and thus anything connected to it will always be low. If 2D_Power_Save is low, then the clock gets pass through to the rest of the GPU. That's how clock gating works.

We mentioned earlier that modern day GPUs are composed of tens of millions of gates (each gate is made up of multiple transistors), and while it would be nice to, it's virtually impossible to put this sort of logic in place for every single one of those gates. For starters, you'd have an incredibly huge chip thanks to spending even more transistors on logic for your clock gating, and it would also make your clock tree incredibly difficult to construct. So what happens is that the clock fed to large groups of gates, known as blocks, is gated, instead of gating the clocks to individual gates. The trick here is that the smaller the blocks you gate (or the more granular you clock-gate), the more efficient your power savings will be.

Let's say we've taken our little gated clock from above and fed it to the entire 3D rendering pipeline. So when 3D acceleration is not required (e.g. we're just typing away in MS Word), the entire 3D pipeline and all of its associated functional units are shut off, thus saving us lots of power. But now, when we fire up a game of Doom 3, all of our power savings are lost as the entire 3D engine is turned back on.

What if we could turn off parts of the GPU not only depending on what type of application we're running (2D or 3D) but also based on what the specific requirements of that application are. For example, Doom 3's shaders perform certain operations that will stress some parts of the GPU, while a game like Grand Theft Auto will stress other parts of the GPU. A more granular implementation of clock gating would allow the GPU to differentiate between the requirements of the two applications and thus offer more power savings.

While we're not quite at the level of the latter example, the one thing that is true is that today's mobile GPUs offer more granular clock gating than the previous generation. This will continue to be true for future mobile GPUs as smaller manufacturing processes and improvements in GPU architecture will allow for more and more granular clock gating.

So where are we today with the GeForce 6800 Go?

With the NV3x series of GPUs, as soon as a request hit the 3D pipeline, the entire 3D pipeline powered up and it didn't power down until the last bits of data left the pipeline. With the GeForce 6800 Go, different stages of the 3D pipeline will only power up if they are being used, otherwise they remain disabled thanks to clock gating. What this means is that power consumption in 3D applications and games is much more optimized now than it ever was before and it will continue to improve with future mobile GPUs.

Since ATI's M28 has not officially been launched yet we don't have any information on its power consumption, however given that the X800 consumes less power than the 6800 on the desktop, we wouldn't be too surprised to see a similar situation emerge on the mobile side of things as well.

Index The Test
Comments Locked

24 Comments

View All Comments

  • gibhunter - Monday, November 8, 2004 - link

    Yeah, nice performance, but it's not a laptop. First of all, you need to keep it on a flat surface so that the fans under the system are not blocked. Second, at 12 pounds it's rediculously heavy. An 8 pound notebook is already pushing the limits. At 12 pounds, you can't call it a notebook with a straight face.

    The only exciting part of this product launch is what Anand said. The possibility of the 9800 being the new midrange and being found in smaller notebooks.
  • HardwareD00d - Monday, November 8, 2004 - link

    These new graphics solutions are going to give Intel's upcoming integrated graphics solutions a beating. Hooray for ATI & Nvidia!
  • allnighter - Monday, November 8, 2004 - link

    Nice comeback for nVidia. And a very good review as well. Not exactly apples to apples but under the given circumstances more than enough. If M28 is not availabe at a launch it seems ATi may see another piece of the market chipped away in the mobile sector. Nice indeed. I'm actually happy for nVidia. Can't believe I said that lol, but I really like to see some real competition in the mobile market. It's been a one player game for a long time.
  • Aquila76 - Monday, November 8, 2004 - link

    Let me be the first to say, Holy sh1t! Doom3 at 1280x1024 High Quality 4xAA on a LAPTOP at 40+FPS? I'd love to see these vid cards on an A64 proc, though. I remember not too long ago you had to settle for a couple gen's ago graphic capability. Now they're on par with current desktop offerings. We've come a long way, baby.

Log in

Don't have an account? Sign up now