Original Link: http://www.anandtech.com/show/4481/details-on-amd-bulldozer-opterons-to-feature-configurable-tdp
Details on AMD Bulldozer: Opterons to Feature Configurable TDPby Johan De Gelas & Kristian Vättö on July 15, 2011 12:00 AM EST
AMD’s new Bulldozer-based CPUs are just around the corner. AMD has said the release of Zambezi CPUs will happen in Q3, which means any time from now. The latest word on the street suggests October release though. We know quite a lot about these CPUs already but there is at least one thing we didn't know until now and it may end up being a big thing in server market. AMD’s John Fruehe has published an interesting blog post where he reveals that AMD’s upcoming server CPUs, Operons, will feature a user-configurable TDP. Read on for our overview on Bulldozer, thoughts on configurable TDP, and summary of AMD's future plans.
Overview of Bulldozer Lineup
AMD’s new Bulldozer-based CPUs are just around the corner. AMD has said the release of Zambezi CPUs will happen in Q3, which means any time from now. The latest word on the street suggests October release though. We know quite a lot about these CPUs already but there is at least one thing we didn't know until now and it may end up being a big thing in server market. AMD’s John Fruehe has published an interesting blog post where he reveals that AMD’s upcoming server CPUs, Operons, will feature a user-configurable TDP.
|AMD Bulldozer lineup|
|Market||High-end consumers||Low-end servers||High-end servers|
|Core count||8, 6 or 4||8 or 6||16, 12 or 8|
|Supported CPU configurations||Single CPU||Up to dual CPU||Up to quad CPU|
Lets start with a brief on Bulldozer. It’s AMD’s first new micro-architecture since K10 (if we ignore Bobcat), which was released in late 2007, and frankly it’s long overdue. It will be manufactured using GlobalFoundries’ 32nm SOI, just like Llano. Some of the architectural changes are covered here, so lets not get into that.
The regular desktop CPUs are codenamed Zambezi and will feature up to eight cores. They will use the AM3+ socket and some AM3 boards will also support the new Zambezi CPUs after a BIOS update. These CPUs will not feature an integrated GPU (unlike Llano and Ontario/Zacate) and will support up to 1866MHz DDR3 in dual-channel configuration.
Bulldozer actually gets more interesting when talking about the server parts, Opterons. For low-end and power efficient servers, AMD will offer CPUs codenamed Valencia. Specification wise these CPUs are pretty similar to Zambezi, with 8-core and 6-core variants. The memory support is also dual-channel just like in Zambezi but will be limited to 1600MHz. Valencia will be released under the Opteron 4200 Series brand and will support single- and dual-CPU configurations. It will aslo be compatible with AMD's current San Marino and Adelaide platforms (Opteron 4000 Series) for socket C32.
For high-end servers, AMD’s answer is Interlagos. It will feature up to 16 cores which is achieved by combining two 8-core dies into one package, similar to AMD’s current 12-core Magny Cours. There will also be 12-core and 8-core variants. Interlagos has up to four Hyper-Transport 3.0 links, meaning that quad-CPU configurations are supported. Apparently, there will also be CPUs with only two links, aimed at dual-CPU configurations. Memory support will be quad-channel 1600MHz DDR3, just like Intel’s Sandy Bridge-E (although we don’t know the speed of DDR3 that SB-E supports). Interlagos will be branded as the Opteron 6200 Series and will retain support for Maranello platform (Opteron 6000 Series) which utilizes socket G34.
TDP Power Cap
What makes these new Opterons truly intriguing is the fact that they will offer user-configurable TDP, which AMD calls TDP Power Cap. This means you can buy pretty much any CPU and then downscale the TDP to fit within your server’s power requirements. In the server market, the performance isn’t necessarily the number one concern like it is when building a gaming rig. As all the readers of our data center section are aware, what really counts is the performance per watt ratio. Servers need to be as energy efficient as possible while still providing excellent performance.
John Fruehe (AMD) states, "With the new TDP Power Cap for AMD Opteron processors based on the upcoming 'Bulldozer' core, customers will be able to set TDP power limits in 1 watt increments." It gets even better: "Best of all, if your workload does not exceed the new modulated power limit, you can still get top speed because you aren’t locking out the top P-state just to reach a power level."
That sounds too good to be true: we can still get the best performance from our server while we limit the TDP of the CPU. Let's delve a little deeper.
Power capping is nothing new. The idea is not to save energy (kWh), but to limit the amount of power (Watt) that a server or a cluster of servers can use. That may sound contradictory, but it is not. If your CPU processes a task at maximum speed, it can return to idle very quickly and save power. If you cap your CPU, the task will take longer and your server will have used about the same amount of energy as the CPU spends less time in idle, where it can save power in a lower p-state or even go to sleep (C-states). So power capping does not make any sense in a gaming rig: it would reduce your fps and not save you any energy at all. Buying CPUs with lower maximum TDP is similar: our own measurements have shown that low power CPUs do not necessarily save energy compared to their siblings with higher TDP specs.
In a data center, you have lots of servers connected to the same power lines that can only deliver a certain amount of current at a certain voltage (48, 115, 230 V...), e.g. amps. You are also limited by the heat density of your servers. So the administrator wants to make sure that the cluster of servers never exceeds the cooling capacity and the amps limitations of the power lines. Power capping makes sure that the power usage and the cooling requirements of your servers become predictable.
The current power capping techniques limit the processor P-states. Even under heavy utilization, the CPU never reaches the top frequency. This is a rather crude and pretty poor way of keeping the maximum power under control, especially from a performance point of view. The thing to remember here is that high frequencies always improve processing performance, while extra cores only improve performance in ideal circumstances (no lock contention, enough threads, etc.). Limiting frequency in order to reduce power often results in a server running far below where it could in terms of performance and power use, just to be "safe".
Bulldozer's Power Management
AMD confirmed that the power management of the Bulldozer core is an improved version of the power management improvements that are part of the “Llano” CPU. Just like Llano, Bulldozer has a Digital APM Module. The APM modules samples a number of performance counter signals and these samples are used to estimate dynamic power with 98% accuracy. Now combine this power estimate with Bulldozer's power gating at the module level and vastly improved clock gating and you can start to understand what is possible.
Bulldozer reduces the number of active and power consuming circuits by vastly improved clock gating
If your application runs only one or a few threads on your 8-module, 16-core Interlagos CPU, several of those modules might be power gated. Or if you run integer-only threads, the fact that quite a few unused parts (i.e. the FPU) of the module will be clockgated might be enough to stay under the configured TDP. So in those cases, it won't be necessary to limit the clock speed. And that is really great, especially in the real world.
In the real world, only a few HPC application behave like the SPEC CPU rate benchmarks, which spawns threads accross all cores. Most server applications do not fully utilize all available cores all the time. Sometimes, only one thread will be really critical and the perceived application performance will depend on it. A little bit later several threads might demand CPU power (but not all cores will be busy). Only a certain percentage of the time are all the cores used. That is exactly the reason why the cheaper Magny-Cours make so much sense for HPC applications, yet it struggles to keep up with the higher clocked, higher IPC Xeon Westmere cores when running OLTP and ERP applications. Putting a power cap on a Magny-Cours means even lower frequencies, and as a result even higher response times (as we have measured here).
By adding power consumption measurements to the CPU, Bulldozer will run most server applications at full speed unless you lower the TDP too far. (Obviously, if the TDP is lowered enough, the CPU will not be able to operate at higher frequencies, thus degrading the response time performance too.) The maximum throughput will be a little bit lower, but most server applications almost never run at maximum throughput. In fact, maximum throughput only matters for HPC applications and benchmarks. For real human users, response times are the only thing that matter.
The beauty of this new power cap system is that in normal circumstances (e.g. the server is running at 40-70% load), the response times will hardly be any longer. At the same time, the adminstrator can make sure that the server cluster does not exceed the capacity of the cooling equipment and the power lines.
This TDP Power Cap technology could be very interesting to small and medium businesses too, and not only to owners of large server clusters. TDP Power Cap could be a way to make sure that your collocated servers never exceeds the maximum amount of amps allocated to you, and as result you will not have to pay unexpected high electricity bills. However, whether or not this ideal world of low response times and low electricity bills will become a reality for the Bulldozer server owners will also depend on the availability of a good and decently priced management software tool that allows the administrator to configure the TDP on all servers simultaneously.
On a standard server, you will get a section in BIOS that allows you to tweak the TDP in 1W increments (or a maximum of 64 power settings), a good step forward compared to the current p-state setting. But to control a server cluster in an efficient way, good management software is needed. Currently, you either have to buy all your servers from the same vendor (HP for example) and then pay for management software such as HP's Insight Control software. To really unlock this technology, AMD or one of their partners needs to make sure this kind of software is widely available--some open source code perhaps?
In the last few years, AMD hasn't really been able to fight against Intel in the high-end CPU market. Pretty much since the release of the Nehalem microarchitecture in late 2008, Intel has held the crown of fastest CPUs and AMD has only been the best option for budget builds. Bulldozer has suffered from delays and recently AMD delayed it even more because the performance didn't meet their expectations. However, Bulldozer could have the potential to shake Intel's position in other than the budget CPU market.
According to leaked product positioning slides, Zambezi is aimed to fight against Intel's Core i5 and i7 lineups. Zambezi will feature up to eight cores, which is twice as many as i7-2600(K)'s four cores. AMD said that they won't join the Hyper-Threading club and they will deliver as many physical cores as Intel delivers physical and virtual cores combined. It looks like AMD is keeping their word, though they're only delivering half as many "FP/SSE cores". Intel will probably still provide the best single-threaded performance but AMDs aggressive approach with many physcial cores may bring them the trophy of best multi-threaded performance. We shall hopefully see this very soon.
In the server market, AMD's role is a lot more complex. For some HPC applications, AMD offers the best performance at a much lower price. In the midrange, AMD based servers offer more cores (quad-socket) and (in most cases) higher performance for a relatively small price premium over the typical dual-socket Xeon servers. At the same time, if your applications cannot make good use of all those cores, dual-socket Xeon servers can offer a better performance/watt ratio and lower response times. In the high end, Intel Xeon E7 completely dominates, and AMD has left this market for now. In the low power market, Intel's low power Xeons offer a better performance/watt and AMD can only compete when every dollar counts. In most cases, the price of the server CPU is less important in the grand TCO scheme.
In other words, AMD really needs a server CPU with a much higher performance per core and a better performance/watt ratio. TDP Power Cap or configurable TDP helps AMD's server CPUs keep the electricity bill down by avoiding "bursty" power usage. At the same time, with their implementation, TDP Power Cap should have little effect on the real world (not pure throughput benchmarking) performance if you do not lower the TDP too much. We won't be sure until we have measured it, but it looks like a big step in the right direction: lower TCO and more predictable power usage without a (large) performance penalty.
AMD's Future Plans
|Second Generation AMD Fusion lineup|
|Codename||Krishna and Wichita||Trinity||Komodo||Sepang||Terramar|
|Architecture||Enhanced Bobcat||NG Bulldozer||NG Bulldozer||NG Bulldozer||NG Bulldozer|
|Core count||1-4||2-4||6-10||Up to 10||Up to 20|
Bulldozer will make its way to mainstream CPUs in 2012. Llano's successor, Trinity, will feature up to four next-generation Bulldozer cores. Next-generation (NG) in this context appears to mean that AMD will tweak the architecture because the CPUs will still be manufactured using 32nm SOI. Zambezi's successor, Komodo, will again increase the core count and make it up to 10.
As for the server market, AMD's approach will be a bit more aggressive. AMD will again increase the amount of cores to up to 20 NG Bulldozer cores. Valencia's successor will be 10-core Sepang and Interlagos' will be 20-core Terramar. The server CPUs will also feature PCIe 3.0 support.
Krishna and Wichita will also replace AMD's current Ontario and Zacate APUs. There will be a die shrink from 40nm to 28nm so at this point, Krishna and Wichita look the most interesting from the 2nd gen Fusion lineup. Doubling the cores should yield a nice performance boost in heavily threaded scendarios, though single-threaded performance is still a sore spot for Bobcat compared to other architectures.