Put me in front of a dual processor motherboard and a pair of eight core Xeons with HyperThreading and I will squeal with delight.  Then I will take it to the cleaners with multithreaded testing to actually see how good it is.  Watching a score go up or the time taken to do a test going down is part of the parcel as a product reviewer, so watching the score go higher or the time taken going down is almost as good as product innovation.

Back in research, two things can drive the system: publication of results and future relevance for those results.  Understanding the system to get results is priority number one, and then being able to obtain results could be priority number two.  In theoretical fields, where a set of simulations can take from seconds to months and even years, having the hardware to deal with many simulations (or threads within a simulation) and the single threaded speed means more results per unit time.  Extremely useful when you get a weeks worth of results back and you missed a negative sign in the code (happens more often than you think).  Some research groups, with well-developed code, take it to clusters.  Modern takes on the code point towards GPUs, if the algorithm allows, but that is not always the case.

So when it comes to my perspective on the GA-7PESH1, I unfortunately do have not much of a comparison to point at.  As an overclocking enthusiast, I would have loved to see some overclock, but the only thing a Sandy Bridge-E processor with an overclock will do is increase single threaded speed – the overall multithreaded performance on most benchmarks is still below an i7-3960X at 5 GHz (from personal testing).  For simulation performance, it really depends on the simulation itself if it will blaze though the code while using ~410 watts.

Having an onboard 2D chip negates needing a dedicated display GPU, and the network interfaces allow users to remotely check up on system temperatures and fan speeds to reduce overheating or lockups due to thermals.  There are plenty of connections on board for mini-SAS cabling and devices, combined with an LSI SAS chip if RAID is a priority.  The big plus point over consumer oriented double processor boards is the DIMM slot count, with the GA-7PESH supporting up to 512 GB.

Compared to the consumer oriented dual processor motherboards available, one can criticize the GA-7PESH1 for not being forthcoming in terms of functionality.  I would have assumed that being a B2B product that it would be highly optimized for efficiency and a well-developed platform, but the lack of discussion and communication between the server team and the mainstream motherboard team is a missed opportunity when it comes to components and user experience.

This motherboard has been reviewed in a few other places around the internet with different foci with respect to the reviewer experience.  One of the main criticisms was the lack of availability – there is no Newegg listing and good luck finding it on eBay or elsewhere.  I send Gigabyte an email, to which I got the following response:

  • Regarding the availability in the US, so far all our server products are available through our local branch, located at:

    17358 Railroad St.
    City of Industry
    CA 91748
    +1-626-854-9338

As a result of being a B2B product, pricing for the GA-7PESH1 (or the GA-7PESH2, its brother with a 10GbE port) is dependent on individual requirements and bulk purchasing.  In contrast, the ASUS Z9PE-D8 WS is $580, and the EVGA SR-X is $650.

Review References for Simulations:

[1] Stripping Voltammetry at Microdisk Electrode Arrays: Theory, IJ Cutress, RG Compton, Electroanalysis, 21, (2009), 2617-2625.
[2] Theory of square, rectangular, and microband electrodes through explicit GPU simulation, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 645, (2010), 159-166.
[3] Using graphics processors to facilitate explicit digital electrochemical simulation: Theory of elliptical disc electrodes, IJ Cutress, RG Compton, Journal of Electroanalytical Chemistry, 643, (2010), 102-109.
[4] Electrochemical random-walk theory Probing voltammetry with small numbers of molecules: Stochastic versus statistical (Fickian) diffusion, IJ Cutress, EJF Dickinson, RG Compton, Journal of Electroanlytical Chemistry, 655, (2011), 1-8.
[5] How many molecules are required to measure a cyclic voltammogram? IJ Cutress, RG Compton, Chemical Physics Letters, 508, (2011), 306-313.
[6] Nanoparticle-electrode collision processes: Investigating the contact time required for the diffusion-controlled monolayer underpotential deposition on impacting nanoparticles, IJ Cutress, NV Rees, YG Zhou, RG Compton, Chemical Physics Letters, 514, (2011), 58-61.
[7] D. Britz, Digital Simulation Electrochemistry, in: D. Britz (Ed.), Springer, New York, 2005, p. 187.
[8] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing, Cambridge University Press, 2007.
[9] D.E. Knuth, in: D.E. Knuth (Ed.), Seminumerical Algorithms, Addison-Wesley, 1981, pp. 130–131
[10] K.Gregory, A.Miller, C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++, 2012, link.
+ others contained within the references above.

System Benchmarks
POST A COMMENT

64 Comments

View All Comments

  • dj christian - Monday, January 14, 2013 - link

    No please!

    This article should be a one time only or once every 2 years at most.
    Reply
  • nadana23 - Sunday, January 06, 2013 - link

    From the looks of results some of the benchmarks are HIGHLY sensitive to effective bandwidth per thread (ie, GDDR5 feeding a GPU stream processor >> DDR3 feeding a Xeon HT core).

    However - it must be noted that 8x DIMMS is insufficient to achieve full memory bandwidth on Xeon E5 2S!

    I'd suggest throwing a pure memory bandwidth test into the mix to make sure you're actually getting the rated number (51.2GB/s)...

    http://ark.intel.com/products/64596/Intel-Xeon-Pro...

    ... as I strongly suspect your memory config is crippling results.

    Dell's 12G config guidelines are as good a place as any to start on this :-

    http://en.community.dell.com/cfs-file.ashx/__key/c...

    Simply removing one E5-2590 and moving to 1-Package, 8 DIMM config may (counter-intuitively) bench(market) faster... for you.
    Reply
  • dapple - Sunday, January 06, 2013 - link

    Great article, thanks! This is the sort of benchmark I've been wanting to see for quite some time now - simple, brute-force numerics where the code is visible and straightforward. Too many benchmarks are black boxes with processor- and compiler-specific tunes to make manufacturer "X" appear superior to "Y". That said, it would be most illustrative to perform a similar 'mark using vanilla gcc on both MS and *nix OS. Reply
  • daosis - Sunday, January 06, 2013 - link

    It is long known issue, when windows does not start after changing hardware, especially GPU (not always so). There is as long known trick so. Just before last "power off" one should replace GPU's own driver with basic microsoft's one. In case of GPU it is "standart Vga adapter" (device manager - update driver - browse my computer - let me pick up). In fact one can replace all specific drivers on OS with similiar basic from MS and then to put this hard drive virtually to any system without any need for fresh install. Mind you, that after first boot it takes some time for OS to find and install specific drivers. Reply
  • jamesf991 - Sunday, January 06, 2013 - link

    In the early '70s I was doing very similar simulations using a PDP 11/40 minicomputer. (I can send citations to my publications if anyone is interested.) At Texas Tech and later at Caltech, I simulated systems involving heterogeneous electron transfer kinetics, various chemical reactions in solution, coulostatics, galvanostatics, voltammetry, chronocoulometry, AC voltammetry, migration, double layer effects, solution hydrodynamics (laminar only), etc. Much of this was done on a PDP 11/40, originally with 8K words (= 16K bytes) of core memory. Later the machine was upgraded to 24 K words (!), we got a floating point board, and a hard disk drive (5 M words, IIRC). My research director probably paid in excess of $50K for the hardware. One cute project was to put a simulation "inside" a nonlinear regression routine to solve for electrode kinetic parameters such as k and alpha. Each iteration of the nonlinear solver required a new simulation -- hand-coding the innermost loops using floating point assembly instructions was a big speedup!

    I wonder how the old PDP would stack up against the 3770?
    Reply
  • flynace - Monday, January 07, 2013 - link

    Do you guys think that once Haswell moves the VRM on package that someone might do a 2 socket mATX board?

    Even if it means giving up 2 of the 4 PCIe slots and/or 2 DIMMs per socket it would be nice to have a high core count standard SFF board for those that need just that.
    Reply
  • samsp99 - Monday, January 07, 2013 - link

    I found this review interesting, but I don't think this board is really targeted at the HPC market. It seems like it would be good as part of a 2U / 12 + 2 drive system, similar to the Dell C2100. It would make a good virtual host, SQL, active web server etc. Having the 3 mSAS connectors would enable 4 drive each without the need for a SAS expander.

    Servers are designed for 99.999% uptime, remote management, and hands-off operation. To achieve that you need redundent power, UPS, Networking, storage etc. They also require high airflow, which is noisy and not something you want sitting under your desk. Based on that, it makes sense that the MB is intended for sale to system builders not your general build your own enthusiast.

    HW manufactuerers are faced with a similar problem to airlines - consumers gravitate to the cheapest price, and so the only real money to be made is selling higher profit margin products to businesses. Servers are where intel etc makes their profits.

    For the computational problems the author is trying to solve, to me it would seem to be better to consider:
    a) At one point, I think google was using commodity hardware, with custom shelving etc. Assuming the algorithms can be paralleled on different hosts, you shouldn't need the reliability of traditional servers, so why not use a number of commodity systems together, choosing the components that have the best perf/$.

    b) There are machines designed for HPC scenarios, such as HPC Systems E5816 that supports 8x Xeon E7-8000 (10 core) processors, or the E4002G8 - that will take 8 nVidia Tesla cards.

    c) What about developing and testing the software on cheap worstations, and then when you are sure its ready, buying compute time from Amazon cloud services etc.
    Reply
  • babysam - Monday, January 07, 2013 - link

    It is quite delighting to look at your review on Anandtech (especially when I am using software and computer configurations of similar nature for my studies), as it is quite difficult for me to evaluate the performance gain of "real-life" software (i.e. science oriented in my case) on new hardware before buying.

    From what I have seen in your code segments provided (especially for the n-body simulation part) , there are large amount of floating-point divisions. Is there any possibility that the code is not only limited by the cache size(and thrashing), but by the limited throughput of the floating-point divider? (i.e. The performance degradations when HT is enabled may also be caused by the competition of the two running threads on the only floating-point divider in the core)
    Reply
  • SanX - Tuesday, January 08, 2013 - link

    if you post zipped sources and exes for anyone to follow, learn, play, argue and eventually improve.

    I'd also preferred to see Fortran sources and benchmarks when possible.

    Intel/AMD should start promote 2/4/8 socket monster mobos for enthusiasts and then general public since this is the beginning of the infinite in time era for multiprocessing.

    Also where are games benchmarks like for example GTA4 which benefits a lot from multicores as well as from GPUs?
    Reply
  • IanCutress - Wednesday, January 09, 2013 - link

    The n-body simulations are part of the C++ AMP example page, free for everyone to use. The rest of the code is part of a benchmark package I'm creating, hence I only give the loops in the code. Unfortunately I know no Fortran for benchmarks.

    Most mainstream users (i.e. gamers) still debate whether 4 or 6 cores are even necessary, so moving to 2P/4P/8P is a big leap in that regard. Enthusiasts can still get the large machines (a few folders use quad AMD setups) if they're willing to buy from ebay which may not always be wholly legal. You may see 2P/4P/8P becoming more mainstream when we start to hit process node limits.

    Ian
    Reply

Log in

Don't have an account? Sign up now