Power Optimizations

It is well known that the Haswell has been optimized for low idle power. Servers run at idle a lot less than mobile devices and thus more power management capabilities were needed. 

In Haswell EP, Intel introduces per core p-states (PCPS). PCPS is not necessarily a blessing, as a wrongly chosen p-state can result in higher response times. However, Intel is convinced PCPS will save power or will at least shift the power to where it is needed: to other cores or to the uncore (rings). 

Floating Point intensive code is known to cause power peaks. AVX has doubled the theoretical FLOPS, and the FMA (Fused Multiply Add) of AVX 2.0 promises to double it again. 

To cope with the huge difference between the power consumption of Integer and AVX code, Intel is introducing new base and Turbo Boost frequencies for all their SKUs; these are called AVX base/Turbo. For example, the E5-2693 v3 will start from a base frequency of 2.3GHz and turbo up to 3.3GHz when running non-AVX code. When it encounters AVX code however, it will not able to boost its clock to more than 3GHz during a 1 ms window of time. If the CPU comes close to thermal and TDP limits, clock speed will drop down to 1.9GHz, the "AVX base clock".

Here, Intel illustrates how this will work for their top SKU, the Xeon E5-2699 v3. Notice the lower base clock for AVX code (1.9 vs 2.3).

The Magic Inside the Uncore DDR4
POST A COMMENT

84 Comments

View All Comments

  • Kevin G - Monday, September 08, 2014 - link

    I'm actually surprised they released the 18 core chip for the EP line. In the Ivy Bridge generation, it was the 15 core EX die that was harvested for the 12 core models. I was expecting the same thing here with the 14 core models, though more to do with power binning than raw yields.

    I guess with the recent TSX errata, Intel is just dumping all of the existing EX dies into the EP socket. That is a good means of clearing inventory of a notably buggy chip. When Haswell-EX formally launches, it'll be of a stepping with the TSX bug resolved.
    Reply
  • SanX - Monday, September 08, 2014 - link

    You have teased us with the claim that added FMA instructions have double floating point performance. Wow! Is this still possible to do that with FP which are already close to the limit approaching just one clock cycle? This was good review of integer related performance but please combine with Ian to continue with the FP one. Reply
  • JohanAnandtech - Monday, September 08, 2014 - link

    Ian is working on his workstation oriented review of the latest Xeon Reply
  • Kevin G - Monday, September 08, 2014 - link

    FMA is common place in many RISC architectures. The reason why we're just seeing it now on x86 is that until recently, the ISA only permitted two registers per operand.

    Improvements in this area maybe coming down the line even for legacy code. Intel's micro-op fusion has the potential to take an ordinary multiply and add and fuse them into one FMA operation internally. This type of optimization is something I'd like to see in a future architecture (Sky Lake?).
    Reply
  • valarauca - Monday, September 08, 2014 - link

    The Intel compiler suite I believe already converts

    x *= y;
    x += z;

    into an FMA operation when confronted with them.
    Reply
  • Kevin G - Monday, September 08, 2014 - link

    That's with source that is going to be compiled. (And don't get me wrong, that's what a compiler should do!)

    Micro-op fusion works on existing binaries years old so there is no recompile necessary. However, micro-op fusion may not work in all situations depending on the actual instruction stream. (Hypothetically the fusion of a multiply and an add in an instruction stream may have to be adjacent to work but an ancient compiler could have slipped in some other instructions in between them to hide execution latencies as an optimization so it'd never work in that binary.)
    Reply
  • DIYEyal - Monday, September 08, 2014 - link

    Very interesting read.
    And I think I found a typo: page 5 (power optimization). It is well known that THE (not needed) Haswell HAS (is/ has been) optimized for low idle power.
    Reply
  • vLsL2VnDmWjoTByaVLxb - Monday, September 08, 2014 - link

    Colors or labeling for your HPC Power Consumption graph don't seem right. Reply
  • JohanAnandtech - Monday, September 08, 2014 - link

    Fixed, thanks for pointing it out. Reply
  • cmikeh2 - Monday, September 08, 2014 - link

    In the SKU comparison table you have the E5-2690V2 listed as a 12/24 part when it is in fact a 10/20 part. Just a tiny quibble. Overall a fantastic read. Reply

Log in

Don't have an account? Sign up now