Put me in front of a dual processor motherboard and a pair of eight core Xeons with HyperThreading and I will squeal with delight.  Then I will take it to the cleaners with multithreaded testing to actually see how good it is.  Watching a score go up or the time taken to do a test going down is part of the parcel as a product reviewer, so watching the score go higher or the time taken going down is almost as good as product innovation.

Back in research, two things can drive the system: publication of results and future relevance for those results.  Understanding the system to get results is priority number one, and then being able to obtain results could be priority number two.  In theoretical fields, where a set of simulations can take from seconds to months and even years, having the hardware to deal with many simulations (or threads within a simulation) and the single threaded speed means more results per unit time.  Extremely useful when you get a weeks worth of results back and you missed a negative sign in the code (happens more often than you think).  Some research groups, with well-developed code, take it to clusters.  Modern takes on the code point towards GPUs, if the algorithm allows, but that is not always the case.

So when it comes to my perspective on the GA-7PESH1, I unfortunately do have not much of a comparison to point at.  As an overclocking enthusiast, I would have loved to see some overclock, but the only thing a Sandy Bridge-E processor with an overclock will do is increase single threaded speed – the overall multithreaded performance on most benchmarks is still below an i7-3960X at 5 GHz (from personal testing).  For simulation performance, it really depends on the simulation itself if it will blaze though the code while using ~410 watts.

Having an onboard 2D chip negates needing a dedicated display GPU, and the network interfaces allow users to remotely check up on system temperatures and fan speeds to reduce overheating or lockups due to thermals.  There are plenty of connections on board for mini-SAS cabling and devices, combined with an LSI SAS chip if RAID is a priority.  The big plus point over consumer oriented double processor boards is the DIMM slot count, with the GA-7PESH supporting up to 512 GB.

Compared to the consumer oriented dual processor motherboards available, one can criticize the GA-7PESH1 for not being forthcoming in terms of functionality.  I would have assumed that being a B2B product that it would be highly optimized for efficiency and a well-developed platform, but the lack of discussion and communication between the server team and the mainstream motherboard team is a missed opportunity when it comes to components and user experience.

This motherboard has been reviewed in a few other places around the internet with different foci with respect to the reviewer experience.  One of the main criticisms was the lack of availability – there is no Newegg listing and good luck finding it on eBay or elsewhere.  I send Gigabyte an email, to which I got the following response:

  • Regarding the availability in the US, so far all our server products are available through our local branch, located at:

    17358 Railroad St.
    City of Industry
    CA 91748

As a result of being a B2B product, pricing for the GA-7PESH1 (or the GA-7PESH2, its brother with a 10GbE port) is dependent on individual requirements and bulk purchasing.  In contrast, the ASUS Z9PE-D8 WS is $580, and the EVGA SR-X is $650.

Review References for Simulations:

[1] Stripping Voltammetry at Microdisk Electrode Arrays: Theory, IJ Cutress, RG Compton, Electroanalysis, 21, (2009), 2617-2625.
[2] Theory of square, rectangular, and microband electrodes through explicit GPU simulation, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 645, (2010), 159-166.
[3] Using graphics processors to facilitate explicit digital electrochemical simulation: Theory of elliptical disc electrodes, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 643, (2010), 102-109.
[4] Electrochemical random-walk theory Probing voltammetry with small numbers of molecules: Stochastic versus statistical (Fickian) diffusion, IJ Cutress, EJF Dickinson, RG Compton, Journal of Electroanlytical Chemistry, 655, (2011), 1-8.
[5] How many molecules are required to measure a cyclic voltammogram? IJ Cutress, RG Compton, Chemical Physics Letters, 508, (2011), 306-313.
[6] Nanoparticle-electrode collision processes: Investigating the contact time required for the diffusion-controlled monolayer underpotential deposition on impacting nanoparticles, IJ Cutress, NV Rees, YG Zhou, RG Compton, Chemical Physics Letters, 514, (2011), 58-61.
[7] D. Britz, Digital Simulation Electrochemistry, in: D. Britz (Ed.), Springer, New York, 2005, p. 187.
[8] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing, Cambridge University Press, 2007.
[9] D.E. Knuth, in: D.E. Knuth (Ed.), Seminumerical Algorithms, Addison-Wesley, 1981, pp. 130–131
[10] K.Gregory, A.Miller, C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++, 2012, link.
+ others contained within the references above.

System Benchmarks


View All Comments

  • toyotabedzrock - Saturday, January 05, 2013 - link

    There is a large number of very smart people on Google+. You really should come join us. Reply
  • JlHADJOE - Tuesday, January 08, 2013 - link

    Of course there are lots of smart people on G+! You're all google employees right? =P Reply
  • Activate: AMD - Saturday, January 05, 2013 - link

    As a fellow chemist, I must say that you have to be some kind of nut to want to do computational/physical chemistry. If you need me, I'll be at the bench!

    Good article too!
  • engrpiman - Saturday, January 05, 2013 - link

    I didn't read the article in full but what I did read was top notch. I found your simulations and mathematics very interesting. I took a Physics class which was focused in writing code to run mathematical simulations . Using the given java lib. I wrote my own code to calculate PI. When I returned from the gym the program had calculated 3.1 . I then re-wrote the program from scratch and ditched the built in libs. and reran. I had 20 decimals in 30 sec it was an epic improvement.

    All in all I think your article could be very useful to me.

    Thanks for writing.
  • SodaAnt - Saturday, January 05, 2013 - link

    THIS is why I read Anandtech. I'll admit that I wasn't quite in the mood to read all the equations (I'll have to do that later), but really, these kind of reviews make my day. Reply
  • Cardio - Saturday, January 05, 2013 - link

    Wonderful review...as always. Thanks Reply
  • Hakon - Saturday, January 05, 2013 - link

    Hi Ian. Thanks for the nice article. I have one suggestion regarding the explicit finite difference code:

    You could try to reorder the loops such that the memory access is more cache friendly. Right now 'pos' is incremented by NX (or even NX*NY in 3D) which will generate a lot cache misses for large grids. If you switch the x and the y loop (in the 2D case) this can be avoided.
  • IanCutress - Saturday, January 05, 2013 - link

    Either way I order the loops, each point has to read one up, one down, one left and one right. My current code tries to keep three as consecutive reads and jump once, keeping the old jump in local memory. If I adjusted the loops, I could keep the one dimension in local memory, but I'd have to jump outside twice (both likely cache misses) to get the other data. I couldn't cache those two values as I never use them again in the loop iteration.

    When I did this code on the GPU, one method was to load an XY block into memory and iterate in the Z-dimension, meaning that each thread per loop iteration only loaded one element, with a few of them loading another for the halo, but all cache aligned.

    I hope that makes sense :)
  • Hakon - Saturday, January 05, 2013 - link

    Yes, but when you access the array 'cA' at 'pos' the CPU will fetch the entire cache line (64 byte in case of your machine, i.e. 16 floats) of the corresponding memory address into the CPU cache. That means that subsequent accesses to say 'pos + 1', 'pos + 2' and so on will be served by the cache. Accessing an array in such a sequential manner is therefore fast.

    However, when you access an array in a nasty way, e.g. 'NX + x' -> '2*NX + x', -> '3*NX + x', then each such access implies a trip to main memory if NX happens to be large.

    That you need to move up / down and sideways in memory does not matter. When you write down the accesses of the code with the reordered loops you will notice that they just access three "lines" in memory in a cache friendly way.

    Not reusing the old values of the last iteration should not affect performance in a measurable way. Even if the compiler fails to see this optimization, the accesses will be served by the L1 cache.

    Btw, did you allocate the array having NUMA in mind, i.e. did you initialize your memory in an OpenMP loop with the same access pattern as used in the algorithms? I am a bit surprised by the bad performance of your dual Xeon system.
  • IanCutress - Saturday, January 05, 2013 - link

    Memory was allocated via the new command as it is 1D. When using a 2D array the program was much slower. I was unaware you could allocate memory in an OpenMP way, which thinking about it could make the 2D array quicker. I also tried writing the code using the PPL and lambdas, but that was also slower than a simple OpenMP loop.

    I'm coming at these algorithms from the point of view of a non-CompSci interested in hardware, and the others in the research group were chemists content to write single threaded code on multi-core machines. Transferring the OpenMP variations of that code from a 1P to a 2P, as the results show, give variable results depending on the algorithm.

    There are always ways to improve the efficiency of the code (and many ways to make it unreadable), but for a large part moving to the 2P system all depends on how your code performs. Please understand that my examples being within my limits of knowledge and representative of the research I did :) I know that SSE2/SSE4/AVX would probably help, but I have never looked into those. More often than not, these environments are all about research throughput, so rather than spend a few week to improve efficiency by 10% (or less), they'd rather spend that money getting a faster system which theoretically increases the same code throughput 100%.

    I'll have a look at switching the loops if I write an article similar to this in the future :)


Log in

Don't have an account? Sign up now