An Unexpected Loss of Performance at Higher Speeds

Our testing provided us with many opportunities to explore the full limits of the QX9650. Along the way, we noticed some unexpected performance problems when overclocking above ~4.25GHz. It appeared as though the processor refused to maintain full load when forced into a period of sustained 100% CPU usage. Our first indication of a problem came when we observed rapidly fluctuating on-die core temperatures while running an instance of Prime95 on all four cores. Manually assigning the processor core affinities for each thread and the observing the system showed that throttling was occurring on Core 1.

Because traditional CPU frequencies detection tools showed no real-time change in operating frequency of any kind, we concluded that the core clocking signal was somehow subjected to a modulated duty cycle. This confused us though as what we were seeing did not fit with what we already knew about Core 2 thermal protection mechanisms. Although Intel processors do make use of a feature intended to lower processor power consumption should they get too hot, all the documentation we can get our hands on suggests that all cores will be affected, not just one. Besides that, core temperatures are well under control and nowhere near the QX9650's maximum allowable T-junction limit of 105°C.


It looks as though Core 1 is having trouble keeping up with the other three

We first suspected our motherboard's VRM circuitry might have been overheating while supplying the high load current. If this were the case the PWM IC would communicate with the processor using the PROCHOT pad and the CPU would respond by modulating an internal clocking signal to each core thereby artificially lowering the load and allowing the VRM to cool - a failsafe meant to save the VRM should things start to get too hot. Eventually our frustration in the matter led us to modify our board by disconnecting the control signal altogether. Unfortunately, there was no change.

We discussed the possibility of having discovered an undocumented erratum, thinking that maybe some internal control logic was at fault. The Analog Devices ADP3228 PWM controller used on the ASUS P5E3 motherboard, designed in compliance with Intel's new VRM 11.1 specification, includes a new power management feature intended to improve power circuit efficiency during periods of light loading. When directed by the CPU the VRM essentially disables four of the eight power delivery phases until they are later command back on. (This is not unlike the concept put to use in the automobile industry wherein half an internal combustion engine shuts down while cruising in order to improve fuel economy.) However, we are unable to completely rule out a possible incompatibility as no one is aware of how to go about disabling this feature.

In fact, we believe what we are seeing may be nothing more than a CPU protection mechanism in action. The Core 2 family of processors is extremely resilient to abuse - reports of failures due to overvoltage or over current incidents are exceedingly rare. Features such as these work by clamping processor input voltage (and current) to tolerable levels in order to prevent permanent damage. Further testing reveals we have some level of control with regards to the "throttling" - it seems that by slightly lowering the VID, and thus in turn the CPU supply voltage, we are able to complete testing at some of the same frequencies with no noticeable performance degradation issues. Could it be possible that we found a processor protection limit with nothing more than common water-cooling? Normally, such discoveries are the domain of those that freeze their CPUs with one or more rotary compressors or copious amounts of liquid nitrogen. Given the enormous power increases observed at these higher speeds due to what might be a processor capacitance effect, we cannot help but wonder if these new limitations are an unintended consequence of Intel's 45nm process.

If what we believe is true the implications could be enormous. The need for certain industries built on delivering high-performance cooling solutions to the overclocking community would be largely invalidated. What's the point in spending more money on a more effective heatsink if there's nothing to gain? With that said, we honestly believe a new direction in CPU overclocking may soon be upon us. While there will always be those that continue to push processors to their absolute limits, the majority of us will find our new "performance" benchmark in efficiency. This makes sense though - the market has been heading this way for years now and overclockers may have simply chosen to ignore the obvious. The multi-core era we now live in places a heavy emphasis on performance-per-watt figures and other measurable efficiencies. Does anyone else find it odd that Intel's flagship product, the QX9650, comes at exactly the same speed bin as the previous 65nm offering? All this talk of improved performance and efficiency and not even a measly frequency bump - perhaps Intel is trying to tell us something.

Exploring the Limits of 45nm Silicon Intel Processor Power Delivery Guidelines
Comments Locked

56 Comments

View All Comments

  • Lifted - Wednesday, December 19, 2007 - link

    Very impressive. Seems more like a thesis paper than a typical tech site article. While the content on AT is of a higher quality than the rest of the sites out there, I think the other authors, founder included, could learn a thing or two from an article like this. Less commentary/controversy and more quality is the way to go.
  • AssBall - Wednesday, December 19, 2007 - link

    Shouldn't page 3's title be "Exlporing the limits of 45nm Halfnium"? :D

    http://www.webelements.com/webelements/elements/te...">http://www.webelements.com/webelements/elements/te...
  • lifeguard1999 - Wednesday, December 19, 2007 - link

    "Do they worry more about the $5000-$10000 per month (or more) spent on the employee using a workstation, or the $10-$30 spent on the power for the workstation? The greater concern is often whether or not a given location has the capacity to power the workstations, not how much the power will cost."

    For High Performance Computers (HPC a.k.a. supercomputers) every little bit helps. We are not only concerned about the power from the CPU, but also the power from the little 5 Watt Ethernet port that goes unused, but consumes power. When you are talking about HPC systems, they now scale into the tens-of-thousands of CPUs. That 5 Watt Ethernet port is now a 50 KWatt problem just from the additional power required. That Problem now has to be cooled as well. More cooling requires more power. Now can your infrastructure handle the power and cooling load, or does it need to be upgraded?

    This is somewhat of a straw-man argument since most (but not all) HPC vendors know about the problem. Most HPC vendors do not include items on their systems that are not used. They know that if they want to stay in the race with their competitors that they have to meet or exceed performance benchmarks. Those performance benchmarks not only include how fast it can execute software, but also how much power and cooling and (can you guess it?) noise.

    In 2005, we started looking at what it would take to house our 2009 HPC system. In 2007, we started upgrades to be able to handle the power and cooling needed. The local power company loves us, even though they have to increase their power substation.

    Thought for the day:
    How many car batteries does it take to make a UPS for a HPC system with tens-of-thousands of CPUs?
  • CobraT1 - Wednesday, December 19, 2007 - link

    "Thought for the day:
    How many car batteries does it take to make a UPS for a HPC system with tens-of-thousands of CPUs?"

    0.

    Car batteries are not used in neither static nor rotary UPS's.
  • tronicson - Wednesday, December 19, 2007 - link

    this is a great article - very technical, will have to read it step by step to get it all ;-)

    but i have one question that remains for me.. how is it about electromigration with the very filigran 45nm structures? we have here new materials like the hafnium based high-k dielectricum, guess this may improove the resistance agains em... but how far may we really push this cpu until we risk very short life and destruction? intel gives a headroom until max 1.3625V .. well what can i risk to give with a good waterchill? how far can i go?

    i mean feeding a 45nm core p.ex. 1,5V is the same as giving a 65nm 1,6375! would you do that to your Q6600?
  • eilersr - Wednesday, December 19, 2007 - link

    Electromigration is an effect usually seen in the interconnect, not in the gate stack. It occurs when a wire (or material) has a high enough current density that the atoms actually move, leading to an open circuit, or in some cases, a short.

    To address your questions:
    1. The high-k dielectric in the gate stack has no effect on the resistance of the interconnect
    2. The finer features of wires on a 45nm process do have a lower threshold to electromigration effects, ie smaller wires have a lower current density they can tolerate before breaking.
    3. The effects of electromigration are fairly well understood at this point, there are all kinds of automated checks built in to the design tools before tapeout as well as very robust reliability tests performed on the chips prior to volume production to catch these types of reliability issues.
    4. The voltage a chip can tolerate is limited by a number of factors. Ignoring breakdown voltages and other effects limited by the physics of transistor operation, heat is where most OC'ers are concerned. As power dissipation is most crudely though of in terms of CVf^2 (capacitance times voltage times frequency-squared), the reduced capacitance in the gate due to the high-k dielectric does dramatically lower power power dissipation, and is well cited. The other main component in modern CPU's is the leakage, which again is helped by the high-k dielectric. So you should expect to be able to hit a bit higher voltage before hitting a thermal envelope limitation. However, the actual voltage it can tolerate is going to depend on the CPU and what corner of the process it came from. In all, there's no general guideline for what is "safe". Of course, anything over the recommended isn't "safe", but the only way you'll find out, unfortunately, is trial and error.
  • eilersr - Wednesday, December 19, 2007 - link

    Doh! Just noticed my own mistake:
    high-k dielectric does not reduce capacitance! Quite the contrary, a high-k dielectric will have higher capacitance if the thickness is kept constant. Don't know what I was thinking.

    Regardless, the capacitance of the gate stack is a factor, as the article mentioned. I don't know how the cap of Intel's 45nm gate compares with that of their 65nm gate, but I would venture it is lower:

    1. The area of the FET's is smaller, so less W*L parallel plate cap.
    2. The thickness of the dielectric was increased. Usually this decreases cap, but the addition of high-k counter acts that. Hard to say what balance was actually achieved.

    This is just a guess, only the process engineers no for sure :)
  • kjboughton - Wednesday, December 19, 2007 - link

    Asking how much voltage can be safetly applied to a (45nm) CPU is a lot like asking which story of a building can you jump from without the risk of breaking both legs on the landing. There's inherent risk in exceeding the manufacturer's specification at all and if you asked Intel what they thought I know exactly what they would say -- 1.3625V (or whatever the maximum rated VID value is). The fact of the matter is that choices like these can only be made by you. Personally, I feel exceeding about 1.4V with a quad 45nm CPU is a lot like beating your head against a wall, especially if your main concern is stability. My recommendation is that you stay below this value, assuming you have adequate cooling and can keep your core temperatures in check.
  • renard01 - Wednesday, December 19, 2007 - link

    I just wanted to tell you that I am impressed by your article! Deep and practical at the same time.

    Go on like this.

    This is an impressive CPU!!

    regards,
    Alexander
  • defter - Wednesday, December 19, 2007 - link

    People stop posting silly comments like: "Intel's TDP is below real power consumption, it isn't comparable to AMD's TDP".

    Here we have a 130W TDP CPU consuming 54W under load.

Log in

Don't have an account? Sign up now