Put me in front of a dual processor motherboard and a pair of eight core Xeons with HyperThreading and I will squeal with delight.  Then I will take it to the cleaners with multithreaded testing to actually see how good it is.  Watching a score go up or the time taken to do a test going down is part of the parcel as a product reviewer, so watching the score go higher or the time taken going down is almost as good as product innovation.

Back in research, two things can drive the system: publication of results and future relevance for those results.  Understanding the system to get results is priority number one, and then being able to obtain results could be priority number two.  In theoretical fields, where a set of simulations can take from seconds to months and even years, having the hardware to deal with many simulations (or threads within a simulation) and the single threaded speed means more results per unit time.  Extremely useful when you get a weeks worth of results back and you missed a negative sign in the code (happens more often than you think).  Some research groups, with well-developed code, take it to clusters.  Modern takes on the code point towards GPUs, if the algorithm allows, but that is not always the case.

So when it comes to my perspective on the GA-7PESH1, I unfortunately do have not much of a comparison to point at.  As an overclocking enthusiast, I would have loved to see some overclock, but the only thing a Sandy Bridge-E processor with an overclock will do is increase single threaded speed – the overall multithreaded performance on most benchmarks is still below an i7-3960X at 5 GHz (from personal testing).  For simulation performance, it really depends on the simulation itself if it will blaze though the code while using ~410 watts.

Having an onboard 2D chip negates needing a dedicated display GPU, and the network interfaces allow users to remotely check up on system temperatures and fan speeds to reduce overheating or lockups due to thermals.  There are plenty of connections on board for mini-SAS cabling and devices, combined with an LSI SAS chip if RAID is a priority.  The big plus point over consumer oriented double processor boards is the DIMM slot count, with the GA-7PESH supporting up to 512 GB.

Compared to the consumer oriented dual processor motherboards available, one can criticize the GA-7PESH1 for not being forthcoming in terms of functionality.  I would have assumed that being a B2B product that it would be highly optimized for efficiency and a well-developed platform, but the lack of discussion and communication between the server team and the mainstream motherboard team is a missed opportunity when it comes to components and user experience.

This motherboard has been reviewed in a few other places around the internet with different foci with respect to the reviewer experience.  One of the main criticisms was the lack of availability – there is no Newegg listing and good luck finding it on eBay or elsewhere.  I send Gigabyte an email, to which I got the following response:

  • Regarding the availability in the US, so far all our server products are available through our local branch, located at:

    17358 Railroad St.
    City of Industry
    CA 91748
    +1-626-854-9338

As a result of being a B2B product, pricing for the GA-7PESH1 (or the GA-7PESH2, its brother with a 10GbE port) is dependent on individual requirements and bulk purchasing.  In contrast, the ASUS Z9PE-D8 WS is $580, and the EVGA SR-X is $650.

Review References for Simulations:

[1] Stripping Voltammetry at Microdisk Electrode Arrays: Theory, IJ Cutress, RG Compton, Electroanalysis, 21, (2009), 2617-2625.
[2] Theory of square, rectangular, and microband electrodes through explicit GPU simulation, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 645, (2010), 159-166.
[3] Using graphics processors to facilitate explicit digital electrochemical simulation: Theory of elliptical disc electrodes, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 643, (2010), 102-109.
[4] Electrochemical random-walk theory Probing voltammetry with small numbers of molecules: Stochastic versus statistical (Fickian) diffusion, IJ Cutress, EJF Dickinson, RG Compton, Journal of Electroanlytical Chemistry, 655, (2011), 1-8.
[5] How many molecules are required to measure a cyclic voltammogram? IJ Cutress, RG Compton, Chemical Physics Letters, 508, (2011), 306-313.
[6] Nanoparticle-electrode collision processes: Investigating the contact time required for the diffusion-controlled monolayer underpotential deposition on impacting nanoparticles, IJ Cutress, NV Rees, YG Zhou, RG Compton, Chemical Physics Letters, 514, (2011), 58-61.
[7] D. Britz, Digital Simulation Electrochemistry, in: D. Britz (Ed.), Springer, New York, 2005, p. 187.
[8] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing, Cambridge University Press, 2007.
[9] D.E. Knuth, in: D.E. Knuth (Ed.), Seminumerical Algorithms, Addison-Wesley, 1981, pp. 130–131
[10] K.Gregory, A.Miller, C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++, 2012, link.
+ others contained within the references above.

System Benchmarks
Comments Locked

64 Comments

View All Comments

  • Hakon - Saturday, January 5, 2013 - link

    Thank you for the detailed answer. I very much appreciate your article and hope to see more stuff like this on Anandtech.

    What I meant regarding to NUMA is the following. When you have a dual socket Xeon you have two memory controllers. The first time you 'touch' a memory location it is assigned to the memory controller of the CPU that runs the current thread. This assignment is in general permanent and all further memory read/writes to that location will be served by that memory controller.

    If you first-touch (e.g. initialize the array to zero) using one thread, then the whole array is assigned to one of the two memory controllers. When you then run the multi-threaded code on that array one memory controller is idle while the other is oversubscribed since it has to serve both CPUs.

    In contrast, if you first-touch your array in an OpenMP loop and use the same access pattern as in the algorithm, you will benefit from both memory controllers later on. In this case your large array is correctly 'distributed' over both memory controllers.

    This kind of memory layout optimization becomes extremely important when you deal with quad socket Opterons. You then have eight memory controllers. A NUMA aware code is therefore up to eight times as fast since it utilizes all memory controllers.
  • toyotabedzrock - Saturday, January 5, 2013 - link

    You should go ask the people on the assembly boards for help with making your code faster.

    They are very friendly compared to a Linux kernel devs, I think they just enjoy the acknowledgement that they still exist and are useful.
  • snajpa - Saturday, January 5, 2013 - link

    Blame the scheduler. Neither Windows nor linux can effectively handle larger NUMA systems. It randomly moves the process across the physical hardware.
  • psyq321 - Sunday, January 6, 2013 - link

    Hmm, this is definitely not true at least for Windows Server 2008 R2 / Windows 7, and I am sure it holds true for some versions of Linux (I am not a Linux expert).

    Windows Server 2008 R2 / Windows 7 scheduler will try to match the memory allocations (even if they are not tagged for a specific NUMA node) with the NUMA node the process/thread resides on, and they will not move a thread to a foreign NUMA node unless if that has been explicitly requested by the application (by setting the thread affinity)

    Of course, without explicit NUMA node tagging when doing allocations, application code is the main culprit for not respecting the NUMA layout (e.g. creating bunch of threads, allocating memory from one of them - and then pinning the threads to different CPUs - you will have lots of LLC requests from remote DRAM because memory was a-priori allocated on one node).

    For this - some sane coding helps a lot, here:

    http://www.dimkovic.com/node/15

    I describe how I extracted more than double performance by careful memory allocation (NUMA-aware) - please note that neither Windows nor Linux scheduler is able to cope with code which is not written to be NUMA aware and it is using large number of threads that are supposed to run on all CPUs.. Simply put, application writer will have to manage memory allocation and usage in the way so that there are as little remote DRAM requests as possible.
  • snajpa - Sunday, January 6, 2013 - link

    About Windows scheduler - I only worked with Windows XP, now I don't have any reason to work with Win anymore, so what you say probably is really true. As for the linux versions - well, long story short, CFS sucks and everyone knows it - this is particularly noticeable if you have fully virtualized VMs which appear as one single process at the host system - the process is randomly swapped between CPU cores and even CPU dies.... sad story. That's why people have to pin their CPUs to their tasks manually.
  • psyq321 - Sunday, January 6, 2013 - link

    Ah, XP - that explains it. True, XP did not care about NUMA at all.

    Windows Server 2008 / Vista introduced NUMA-aware memory allocations, and changed their CPU scheduler so it does not move the thread across NUMA nodes. They will also try to allocate the memory from the thread's own NUMA node when legacy VirtualAlloc etc. APIs are used.

    Windows Server 2008 R2 / Windows 7 introduced the concept of CPU groups - allowing more than 64 CPUs. This does require some adaptation of the application, as old threading APIs only work with 64-bit affinity bitmask which only allowed recognizing 64 CPUs. Now, there is a new set of APIs that work with GROUP_AFFINITY structure, allowing control of CPU groups, too. However, this needs explicit change of the legacy process/threading APIs to the new ones.

    Furthermore, none of the above can replace some manual intervention*- while Windows scheduler will, indeed, respect NUMA node boundaries and not try to mess around with moving threads across them - it still does not know what the underlying algorithm wants to do.

    * There is no need to set the thread affinity to one specific CPU anymore - this prevents running the thread on any other CPU completely. Instead, there is an API called SetThreadIdealProcessor(Ex) which signals Windows scheduler that thread >should< run on that particular CPU - but, under certain circumstances the scheduler can move the thread somewhere else - if the CPU is completely taken away by some other thread/process. Scheduler will try to move the thread as close as possible - to the next core in the socket, for example - or to the next core in the group (group is always contained within a NUMA node).

    You can, however, absolutely forbid Windows scheduler from passing the thread to another NUMA node under any circumstances by simply getting the said NUMA node affinity mask (GetNumaNodeProcessorMask(Ex)) and setting this affinity as a thread affinity. This + setting the "ideal" processor still gives Windows scheduler some headroom to move the thread to another core if it is found to be better in a given moment, but it will not even attempt to cross the NUMA boundary in any case whatsoever.
  • lmcd - Monday, January 7, 2013 - link

    While I haven't personally researched them, there are tons of other schedulers that have been written for Linux and I'm certain *at least* one of them is more fitting to this line of work. I've heard of alternatives like BFS and the Linux kernel is so widely used I'm sure there's a gem out there for this application.
  • toyotabedzrock - Saturday, January 5, 2013 - link

    Have you ever tried the Intel Math Kernel Library? It might speed up some of the equations. It also hands off work to the Intel MIC card if it thinks it will speed it up.

    http://software.intel.com/en-us/intel-mkl/
  • KAlmquist - Saturday, January 5, 2013 - link

    The GA-7PESH1 motherboard is $855, and the CPU's are $2020 each, which adds up to $4895. On tasks which don't parallelize well, you can get similar performance from the i7-3770K, which costs an order of magnitude less. (Prices: i7-3770K $320, ASRock Z77 Extreme6 motherboard $152, total from motherboard and CPU $472.) On tasks which parallelize well enough that they can be run on a GPU, the system with the GA-7PESH1 will beat the i7-3770K, but will be crushed by a midrange GPU. So the price/performance of this system is pretty bad unless you throw just the right workload at it.

    The motherboard price from super-laptop-parts dot com, and the other prices are from a major online retailer that I won't name in order to get around the spam filter.
  • Death666Angel - Saturday, January 5, 2013 - link

    So, your 3770K has ECC memory or VT-D, TXT etc.?

Log in

Don't have an account? Sign up now