Using a Mobile Architecture Inside a 145W Server Chip

About 15 months after the appearance of the Haswell core in desktop products (June 2013), the "optimized-for-mobile" Haswell architecture is now being adopted into Intel server products.

Left to right: LGA1366 (Xeon 5600), LGA2011 (Xeon E5-2600v1/v2) and LGA2011v3 (E5-2600v3) socket. 

Haswell is Intel's fourth tock, a new architecture on the same succesful 22nm process technology (the famous P1270 process) that was used for the Ivy Bridge EP or Xeon E5-2600 v2. Anand discussed the new Haswell architecture in great detail back in 2012, but as a refresher, let's quickly go over the improvements that the Haswell core brings.

Very little has changed in the front-end of the core compared to Ivy Bridge, with the exception of the usual branch prediction improvements and enlarged TLBs. As you might recall, it is the back-end, the execution part, that is largely improved in the Haswell architecture:

  • Larger OoO Window (192 vs 168 entries)
  • Deeper Load and Store buffers (72 vs 64, 42 vs 36)
  • Larger scheduler (60 vs 54)
  • The big splash: 8 instead of 6 execution ports: more execution resources for store address calculation, branches and integer processing.

All in all, Intel calculated that integer processing at the same clock speed should be about 10% better than on Ivy Bridge (Xeon E5-2600 v2, launched September 2013), 15-16% better than on Sandy Bridge (Xeon E5-2600, March 2012), and 27% than Nehalem (Xeon 5500, March 2009).

Even better performance improvements can be achieved by recompiling software and using the AVX2 SIMD instructions. The original AVX ISA extension was mostly about speeding up floating point intensive workloads, but AVX2 makes the SIMD integer instructions capable of working with 256-bit registers.

Unfortunately, in a virtualized environment, these ISA extensions are sometimes more curse than blessing. Running AVX/SSE (and other ISA extensions) code can disable the best virtualization features such as high availability, load balancing, and live migration (vMotion). Therefore, administrators will typically force CPUs to "keep quiet" about their newest ISA extensions (VMware EVC). So if you want to integrate a Haswell EP server inside an existing Sandy Bridge EP server cluster, all the new features including AVX2 that were not present in the Sandy Bridge EP are not available. The results is that in virtualized clusters, ISA extensions are rarely used.

Instead, AVX2 code will typically run on a "native" OS. The best known use of AVX2 code is inside video encoders. However, the technology might still prove to be more useful to enterprises that don't work with pixels but with business data. Intel has demonstrated that the AVX2 instructions can also be used for accelerating the compression of data inside in-memory databases (SAP HANA, Microsoft Hekaton), so the integer flavor of AVX2 might become important for fast and massive data mining applications.

Last but not least, the new bit field manipulation and the use of 256-bit registers can speed up quite a few cryptographic algorithms. Large websites will probably be the application inside the datacenter that benefits quickly from AVX2. Simply using the right libraries might speed up RSA-2048 (opening a secure connection), SHA-256 (hashing), and AES-GCM. We will discuss this in more detail in our performance review.

Floating point

Floating point code should benefit too, as Intel has finally included Fused Multiply Add (FMA) instructions. Peak FLOP performance is doubled once again. This should benefit a whole range of HPC applications, which also tend to be recompiled much quicker than the traditional server applications. The L1 and L2 cache bandwidth has also been doubled to better cope with the needs of AVX2 instructions.

Introduction Next Stop: the Uncore
Comments Locked

85 Comments

View All Comments

  • MorinMoss - Friday, August 9, 2019 - link

    Hello from 2019.
    AMD has a LOT of ground to make up but it's a new world and a new race
    https://www.anandtech.com/show/14605/the-and-ryzen...
  • Kevin G - Monday, September 8, 2014 - link

    As an owner of a dual Opteron 6376 system, I shudder at how far behind that platform is. Then I look down and see that I have both of my kidneys as I didn't need to sell one for a pair of Xeons so I don't feel so bad. For the price of one E5-2660v3 I was able to pick up two Opteron 6376's.
  • wallysb01 - Monday, September 8, 2014 - link

    But the rest of the system cost is about the same. So you get 1/2 the performance for a 10% discount. YEPPY!
  • Kevin G - Monday, September 8, 2014 - link

    Nope. Build price after all the upgrades over the course of two years is some where around $3600 USD. The two Opterons accounted for a bit more than a third of that price. Not bad for 32 cores and 128 GB of memory. Even with Haswell-E being twice as fast, I'd have to spend nearly twice as much (CPU's cost twice as much as does DDR4 compared to when I bought my DDR3 memory). To put it into prespective, a single Xeon E5 2999v3 might be faster than my build but I was able to build an entire system for less than the price Intel's flagship server CPU.

    I will say something odd - component prices have increased since I purchased parts. RAM prices have gone up by 50% and the motherboard I use has seemingly increased in price by $100 due to scarcity. Enthusiast video card prices have also gotten crazy over the past couple of years so a high end video card is $100 more for top of the line in the consumer space.
  • wallysb01 - Tuesday, September 9, 2014 - link

    Going to the E5 2699 isn’t needed. A pair of 2660 v3s is probably going to be nearly 2x as fast the 6376, especially for floating point where your 32 cores are more like 16 cores or for jobs that can’t use very many threads. True a pair of 2660s will be twice as expensive. On a total system it would add about $1.5K. We’ll have to wait for the workstation slanted view, but for an extra $1.5K, you’d probably have a workstation that’s much better at most tasks.
  • Kevin G - Friday, September 12, 2014 - link

    Actually if you're aiming to double the performance of a dual Opteron 6376, two E5-2695v3's look to be a good pick for that target according to this review. A pair of those will set you pack $4848 which is more than what my complete system build cost.

    Processors are only one component. So while a dual Xeon E5-2695v3 system would be twice as fast, total system cost is also approaching double due to memory and motherboard pricing differences.
  • Kahenraz - Monday, September 8, 2014 - link

    I'm running a 6376 server as well and, although I too yearn for improved single-threaded performance, I could actually afford to own this one. As delicious as these Intel processors are, they are not priced for us mere mortals.

    From a price/performance standpoint, I would still build another Opteron server unless I knew that single-threaded performance was critical.
  • JDG1980 - Tuesday, September 9, 2014 - link

    The E5-2630 v3 is cheaper than the Opteron 6376 and I would be very surprised if it didn't offer better performance.
  • Kahenraz - Tuesday, September 9, 2014 - link

    6376s can be had very cheaply on the second-hand market, especially bundled with a motherboard. Additionally, the E5-2630 v3 requires both a premium on the board and DDR4 memory.

    I'd wager you could still build an Opteron 6376 system for half or less.
  • Kevin G - Tuesday, September 9, 2014 - link

    It'd only be fair to go with the second hand market for the E5-2630v3's but being new means they don't exist. :)

    Still going by new prices, an Opteron 6376 will be cheaper but roughly 33% from what I can tell. You're correct that the new Xeon's have a premium pricing on motherboards and DDR4 memory.

Log in

Don't have an account? Sign up now