Intel Xeon E5 Version 3: Up to 18 Haswell EP Cores
by Johan De Gelas on September 8, 2014 12:30 PM ESTWebsite Performance: Drupal 7.21
While there are few web servers that actually need such processing behemoths, we decided to go ahead and test in this area, just for the sake of satifying our curiosity. Most websites are based on the LAMP stack: Linux, Apache, MySQL, and PHP. Few people write HTML/PHP code from scratch these days, so we turned to running a Drupal 7.21 based site. The web server is Apache 2.4.7 and the database is MySQL 5.5.38 on top of Ubuntu 14.04 LTS.
Drupal powers massive sites like The Economist and MTV Europe and has a reputation of being a hardware resources hog. That is a price more and more developers pay happily for lowering the time to market for their work. We tested the Drupal website with our vApus stress testing framework and increased the number of connections from 5 to 1500.
First we report the maximum throughput achievable with 95% percent of requests being handled faster than 100 ms. It is important to note that there's a chance that a user experiences a much slower response time on a request, which could be much longer than 100 ms. Also, as each page view consists of many requests, there's an increased chance that one of the "slow responses" is among them. So the average response time is definitely a very bad indicator of user experience, and ensuring the 95% percentile is still fast enough is a lot safer.
In the case of our Drupal testing, the new Haswell EP Xeons definitely take the lead, but at the top of the stack we don't see a lot of scaling with additional cores – the E5-2699 v3 and the E5-2695 v3 deliver nearly the same result. There are several reasons for this. The first is that the database of our current test website is too small. The second is that we still need to fine tune the configuration of our website to scale better with such high core counts.
We'll remedy this in the future as we adapt our tuning. Right now, it seems that we get good scaling up to 24 physical cores, but beyond that our tuning probably needs more work. Nevertheless, we felt we should share this result as most website owners do not have a specialized "make it scale" engineering team like Google and Facebook. And yes, it is probably better to load balance your website over several smaller nodes.
Still, the results are quite interesting. It looks like the new Xeon v3 scales better. The Xeon E5-2690 has no trouble keeping up – thanks to its higher clock speed – with the Ivy Bridge EP Xeon, which features a higher core count. The Xeon E5-2650L v3 has a lower clock speed but is able to use its higher core count to perform better. One of the reasons might be the fact that synchronization latency has been significantly improved.
85 Comments
View All Comments
martinpw - Monday, September 8, 2014 - link
There is a nice tool called i7z (can google it). You need to run it as root to get the live CPU clock display.kepstin - Monday, September 8, 2014 - link
Most Linux distributions provide a tool called "turbostat" which prints statistical summaries of real clock speeds and c state usage on Intel cpus.kepstin - Monday, September 8, 2014 - link
Note that if turbostat is missing or too old (doesn't support your cpu), you can build it yourself pretty quick - grab the latest linux kernel source, cd to tools/power/x86/turbostat, and type 'make'. It'll build the tool in the current directory.julianb - Monday, September 8, 2014 - link
Finally the e5-xxx v3s have arrived. I too can't wait for the Cinebench and 3DS Max benchmark results.Any idea if now that they are out the e5-xxxx v2s will drop down in price?
Or Intel doesn't do that...
MrSpadge - Tuesday, September 9, 2014 - link
Correct, Intel does not really lower prices of older CPUs. They just gradually phase out.tromp - Monday, September 8, 2014 - link
As an additional test of the latency of the DRAM subsystem, could you please run the "make speedup" scaling benchmark of my Cuckoo Cycle proof-of-work system at https://github.com/tromp/cuckoo ?That will show if 72 threads (2 cpus with 18 hyperthreaded cores) suffice to saturate the DRAM subsystem with random accesses.
-John
Hulk - Monday, September 8, 2014 - link
I know this is not the workload these parts are designed for, but just for kicks I'd love to see some media encoding/video editing apps tested. Just to see what this thing can do with a well coded mainstream application. Or to see where the apps fades out core-wise.Assimilator87 - Monday, September 8, 2014 - link
Someone benchmark F@H bigadv on these, stat!iwod - Tuesday, September 9, 2014 - link
I am looking forward to 16 Core Native Die, 14nm Broadwell Next year, and DDR4 is matured with much better pricing.Brutalizer - Tuesday, September 9, 2014 - link
Yawn, the new upcoming SPARC M7 cpu has 32 cores. SPARC has had 16 cores for ages. Since some generations back, the SPARC cores are able to dedicate all resources to one thread if need be. This way the SPARC core can have one very strong thread, or massive throughput (many threads). The SPARC M7 cpu is 10 billion transistors:http://www.enterprisetech.com/2014/08/13/oracle-cr...
and it will be 3-4x faster than the current SPARC M6 (12 cores, 96 threads) which holds several world records today. The largest SPARC M7 server will have 32-sockets, 1024 cores, 64TB RAM and 8.192 threads. One SPARC M7 cpu will be as fast as an entire Sunfire 25K. :)
The largest Xeon E5 server will top out at 4-sockets probably. I think the Xeon E7 cpus top out at 8-socket servers. So, if you need massive RAM (more than 10TB) and massive performance, you need to venture into Unix server territory, such as SPARC or POWER. Only they have 32-socket servers capable of reaching the highest performance.
Of course, the SGI Altix/UV2000 servers have 10.000s of cores and 100TBs of RAM, but they are clusters, like a tiny supercomputer. Only doing HPC number crunching workloads. You will never find these large Linux clusters run SAP Enterprise workloads, there are no such SAP benchmarks, because clusters suck at non HPC workloads.
-Clusters are typically serving one user who picks which workload to run for the next days. All SGI benchmarks are HPC, not a single Enterprise benchmark exist for instance SAP or other Enterprise systems. They serve one user.
-Large SMP servers with as many as 32 sockets (or even 64-sockets!!!) are typically serving thousands of users, running Enterprise business workloads, such as SAP. They serve thousands of users.