The super-high-end of Intel’s Xeon CPU range, based on servers with as many cores and as much memory as you can throw at them, represent a good part of Intel’s business with the potential to offer large margins: some customers want the most, the best, the powerful, and are willing to pay for it. For a number of generations, this has come via the Intel E7 line, consisting of two families of products designed for quad-socket servers (the E7-4000 v4) and eight socket servers (the E7-8000 v4). The new element to this launch is the use of ‘v4’, meaning that following the launch of Broadwell-EP for 1S/2S systems a couple of months ago and Broadwell-E (high-end desktop, HEDT) two weeks go, Intel has now filled out the v4 product line as we would typically expect. The new Xeons will be under the Broadwell-EX nomenclature (following Haswell-EX, Ivy Bridge-EX and so on), and using the Brickland platform aimed at mission critical environments.

Intel currently runs several processor lines in the Xeon/enterprise space, from E3-1200 v5 processors using consumer level performance in a Xeon package, the recently released E3-1500 v5 processors with embedded DRAM to help accelerate visual/video workflow, all the way up to the large EX core platforms.

Intel Xeon Families (June 2016)
  E3-1200 v5 E3-1500 v5
E3-1500M v5
E5-1600 v4
E5-2600 v4
E7-4800 v4 E7-8800 v4
Core Family Skylake Skylake Broadwell Broadwell Broadwell
Core Count 2 to 4 2 to 4 4 to 22 8 to 16 4 to 24
Integrated Graphics Few, HD 520 Yes, Iris Pro No No No
DRAM Channels 2 2 4 4 4
Max DRAM Support (per CPU) 64 GB 64 GB 1536 GB 3072 GB 3072GB
DMI/QPI DMI 3.0 DMI 3.0 2600: 1xQPI 3 QPI 3 QPI
Multi-Socket Support No No 2600: 1S or 2S 1S, 2S or 4S Up to 8S
PCIe Lanes 16 16 40 32 32
Cost $213 to
$612
$396 to
$1207
$294 to
$4115
$1223 to
$3003
$4061 to
$7174
Suited For Entry Workstations QuickSync,
Memory Compute
High-End Workstation Many-Core Server World Domination

As referred to in Johan’s very detailed review of the dual socket E5-2600 v4 platform, Broadwell Xeon processor dies come in three die sizes: a low core count (LCC) featuring ten physical cores at 246.24 mm2 for ~3.2 billion transistors, a medium core count (MCC) with fifteen physical cores at 306.18 mm2 for ~4.7 billion transistors, and high core count (HCC) with 24 physical cores at 456.12mm2 for ~7.2 transistors. The MCC and HCC arrangements use dual memory controllers to address four memory channels whereas the LCC die uses a single memory controller which results in a slight performance hit compared to the other two. Most of the new E7 v4 processors however will be using the HCC die.

Intel has formally announced eleven processors between the 4S and 8S families, varying in core count, frequency, power consumption and L3 cache. The design of the HCC core is such that a processor can have certain cores fused off but the rest of the die can have access to the L3 cache, providing some SKUs with more ‘total cache per core’, such as the E7-8893 v4 which will be a four-core design but with 60 MB of L3 cache between them. These are classified by Intel as 'segment optimized', where applications require faster cache rather than more cores. This is arguably a stone-throw away from an eDRAM SKU with 64MB of eDRAM, but in this case Intel is still going with a large (and faster than eDRAM) L3 cache.

Intel E7-8800 v4 Xeon Family
  E7-8860 v4 E7-8867 v4 E7-8870 v4 E7-8880 v4 E7-8890 v4   E7-8891 v4 E7-8893 v4
TDP 140 W 165 W 140 W 150 W 165 W 165 W 140 W
Cores 18 / 36 18 / 36 20 / 40 22 / 44 24 / 48 10 / 20 4 / 8
Base (MHz) 2200 2400 2100 2200 2200 2800 3200
Turbo (MHz) 3200 3300 3000 3300 3400 3500 3500
L3 Cache 45 MB 45 MB 50 MB 55 MB 60 MB 60 MB 60 MB
QPI (GT/s) 3 x 9.6 3 x 9.6 3 x 9.6 3 x 9.6 3 x 9.6 3 x 9.6 3 x 9.6
DRAM Support DDR4-1866
DDR3-1600
DDR4-1866
DDR3-1600
PCIe Support 3.0 x32 3.0 x32 3.0 x32 3.0 x32 3.0 x32 3.0 x32 3.0 x32
Price $4061 $4672 $4762 $5895 $7174 $6841 $6841

The flagship model is the E7-8890 v4, a 165W processor supporting the full 24 cores in the HCC die with hyperthreading, offering 48 threads per CPU. At a base frequency of 2.2 GHz, this processor can be used in an eight-socket glueless configuration (an 8S implementation means 192 cores/384 threads) or up to 128 sockets using third party controllers. In the eight socket configuration, a system can support up to 24TB of DDR4 LRDIMMs (three modules per channel, 12 modules per socket, 256GB per module). All the CPUs listed will support DDR4 and DDR3 with the dual controller configuration.

Intel E7-4800 v4 Xeon Family
  E7-4809 v4 E7-4820 v4 E7-4830 v4 E7-4850 v4
TDP 115 W 115 W 115 W 115 W
Cores 8 / 16 10 / 20 14 / 28 16 / 32
Base (MHz) 2100 2000 2000 2100
Turbo (MHz) - - 2800 2800
L3 Cache 20 MB 25 MB 35 MB 40MB
QPI (GT/s) 3 x 6.4 3 x 6.4 3 x 8.0 3 x 8.0
DRAM Support DDR4-1866
DDR3-1600
PCIe Support 3.0 x32 3.0 x32 3.0 x32 3.0 x32
Price $1223 $1502 $2170 $3003

The E7-4800 v4 line by comparison will use a reduced QPI speed (6.4 or 8.0 gigatransfers per second compared to 9.6 gigatransfers per second on the E7-8800 v4) as well as some of the family having no Turbo frequencies. These non-turbo processors will run at their given frequency no matter the loading.

The new E7 v4 carries over all of the new features that Johan covered in our E5 v4 review, including:

  • VM cache allocation (the ability for a supported hypervisor to mark a VM as high priority or partition cache as needed for QoS),
  • New memory bandwidth monitoring tools,
  • New frequency/power management tools to reduce frequency adjustment latency (see slide 29),
  • Transactional extension support (TSX, was a feature in Haswell but disabled due to a fundamental hardware bug),
  • A new non-deterministic random bit generator instruction for seed generation,
  • Haswell to Broadwell generational improvements (decreased divider latency, 40% faster vector floating point multiplier, hardware assist for vector gather, cryptography focused instructions),
  • AVX Turbo modes affect single cores rather than the whole processor,
  • Entry/Exit latency for virtualization environments reduced to ~400 cycles from ~500 cycles.

There are a couple of features for the HCC based processors that may be more relevant for the 4S systems, such as an upgraded version of Cluster on Die. Due to the configuration of the die and the dual ring design, if a core needs data in an L3 cache on the other side of the die, the latency would be higher than if it was closer to the die. To alleviate this, Haswell E5/E7 Xeons separated each die into two clusters such that each part would be seen by the BIOS as a non-unified memory domain. This allows the home agent/system agent to manage the likelihood that memory requests are aimed at data closer to the core that needs it. In Broadwell, this feature is now brought up from dual-processor systems to four-processor systems, and should reduce last level cache latency and performance for larger systems.

The new E7 v4 processors use the same socket as the previous generation, the E7 v3 processors. With a BIOS update, the new processors are a drop in with the older platform.  The usual Intel partners (Supermicro, HP Enterprise, Dell, Cray) are expected to offer systems based on the new processors. We expect the new processors to cost in line with the previous generation with a typical generational increase. I believe Johan is currently in the process of testing a few parts, and I’m looking forward to the review. 

POST A COMMENT

26 Comments

View All Comments

  • ZeDestructor - Monday, June 13, 2016 - link

    I think that's a typo, since it's a more or less drop-in upgrade to the Brickland platform that only has 4-channel memory. Besides, ark also lists it as having 4 channels. Reply
  • Ian Cutress - Monday, June 13, 2016 - link

    It's technically four, but my using memory expanders it effectively splits each memory channel into two, allowing for 3DPC.

    http://www.anandtech.com/show/9193/the-xeon-e78800...
    Reply
  • ZeDestructor - Tuesday, June 14, 2016 - link

    Ahh, very nice.

    How I wish I had the cash and power to have a Brickland machine for my homeserver... would do wonders for a silly ZFS host...
    Reply
  • Eden-K121D - Monday, June 13, 2016 - link

    Well i heard a rumour about a zen naples server processor having 32 cores 8 channel DDR4 meory and 128 PCIE gen 3 lanes Reply
  • Meteor2 - Monday, June 13, 2016 - link

    So... What are 8S servers used for? VM farms? When is effective to buy one of these rather than use several smaller, cheaper servers? Reply
  • FunBunny2 - Monday, June 13, 2016 - link

    -- When is effective to buy one of these rather than use several smaller, cheaper servers?

    any embarrassingly parallel problem. OLTP systems are the archetype.
    Reply
  • mdw9604 - Sunday, June 19, 2016 - link

    I am not embarrassed that my problems are parallel. Its the perpendicular ones that I tend to cover up. Reply
  • Kevin G - Monday, June 13, 2016 - link

    Several smaller cheaper servers introduces networking overhead and in most cases centralized storages. A single system image with equivalent processing power ends up being faster due the removal of this overhead, sometimes by a surprising amount.

    The other thing is that these systems support a lot of memory per socket: 24 TB using the largest DIMMs available today an in eight socket configuration. Many production datasets can fit into that amount. Intel is offering quad core chips with full support for this capacity which is interesting from a licensing cost standpoint.
    Reply
  • Meteor2 - Tuesday, June 14, 2016 - link

    Big in-memory databases are interesting, though I understand it takes something like 15 minutes to load them into memory. Plus NMVeF is blurring local and remote memory.

    I guess the problem for these machines though is there aren't many embarrassingly parallel problems out there. We run HPC workloads where I am, and they're best suited to just so many E5s on a very fast network. The jobs are many times too big to fit on one of these.
    Reply
  • mapesdhs - Tuesday, June 14, 2016 - link

    On the contrary, there are many relevant workloads, from GIS to medical and defense imaging. Just look into the history of the customer base at SGI, those who bought their Origin systems, etc. Hence the existence of the modern UV series (256 sockets atm). Customers were already dealing with multi-GB datasets 20 years ago, and SGI was the first to design something that could load such files in mere seconds (Group Station for Defense Imaging). I'm not sure about the modernUV systems, but the bisection bandwidth of the last-Gen Origin was 512GB/sec (or it might be 1TB/sec if they made the usual 2X larger system for selected customers) and the tech has moved on a lot since then, with new features such as hw MPI offload, etc.

    But yes, other loads don't scale well, all depends on the task. Hence the existence of cluster products aswell, and of course the ability to partition a UV into multiple subunits, each of which can be optimised to match the task scalability, while also allowing fast intercommunication between them, aswell as shared access to data, etc. Meanwhile, work goes on to improve scalability methods, eg. an admin at the Cosmos centre told me they're working hard to improve the scaling of various cosmological simulation codes to exploit up to 512 CPUs. In other fields however, GPU acceleration has taken over, but often that needs big data access aswell. It's a mixed bag as usual.

    Speaking of which, Ian, re the usual Intel partners, you forgot to mention SGI. There's no doubt they'll be using these new CPUs in their UV range.

    One thing I don't get concerning the E7-8893 v4, if it only has 4 cores, why aren't the max Turbo levels much higher? Indeed, the base clock could surely be a lot higher aswell.
    Reply

Log in

Don't have an account? Sign up now