Visual Inspection

I knew server boards were large, but coming from the ATX and E-ATX standards, this thing is huge.  It measures 330mm x 305mm (13” x 12”) which correlates to the SSI EEB specification for server motherboards.  This is the size exact size of an official E-ATX motherboard (despite a small amount of loose definition), but to put the icing on the cake, the mounting holes for the motherboard are different to the normal ATX standards.  If we took a large case, like the Rosewill Blackhawk-Ultra, it supports ATX, SSI CEB, XL-ATX, E-ATX and HPTX, up to 13.6” x 15”, but not SSI EEB.  Thus drilling extra holes for standoffs may be required.

Unlike the SR-X or Z9PE-D8 WS, the GA-7PESH1 supports two memory modules per channel for all channels on board.  In terms of specifications this means support for up to 128 GB UDIMM (i.e. regular DDR3), 128 GB UDIMM ECC, and 512 GB RDIMM ECC.  Due to the nature of the design, only 1066-1600 MHz is supported, but the GA-7PESH1 supports 1600 MHz when all slots are populated.  For our testing, Kingston has kindly supplied us with 8x4GB of their 1600 C11 ECC memory.

As with the majority of server boards, stability and longevity is a top priority.  This means no overclocking, and Gigabyte can safely place a six phase power delivery on each CPU – it also helps that all SB-E Xeons are multiplier locked and there is no word of unlocked CPUs being released any time soon.  As we look at the board, standards dictate that the CPU on the right is designated as the first CPU.  Each CPU has access to a single fan header, and specifications for coolers are fairly loose in both the x and the y directions, limited only by memory population and the max z-height of the case or chassis the board is being placed into.  As with all dual CPU motherboards, each CPU needs its own Power Connector, and we find them at the top of the board behind the memory slots and at opposite ends.  The placement of these power connectors is actually quite far away for a normal motherboard, but it seems that the priority of the placement is at the edge of the board.  In between the two CPU power connectors is a standard 24-pin ATX power connector.

One of the main differences I note coming from a consumer motherboard orientation is the sheer number of available connectors and headers on such a server motherboard.  For example, the SATA ports have to be enabled by moving the jumpers the other side of the chipset.  The chipset heatsink is small and basic – there is no need for a large heatsink as the general placement for such a board would be in a server environment where noise is not particularly an issue if there are plenty of Delta fans to help airflow.

On the bottom right of the board we get a pair of SATA ports and three mini-SAS connections.  These are all perpendicular to the board, but are actually in the way of a second GPU being installed in a ‘normal’ motherboard way.  Users wishing to use the second PCIe x8 slot on board may look into PCIe risers to avoid this situation.  The heatsink on the right of this image covers up an LSI RAID chip, allowing the mSAS drives to be hardware RAIDed.

As per normal operation on a C602 DP board, the PCIe slots are taken from the PEG of one CPU.  On some other boards, it is possible to interweave all the PCIe lanes from both CPUs, but it becomes difficult when organizing communication between the GPUs on different CPUs.  From top to bottom we get an x8 (@x4), x16, x8 (@x4), x16 (@x8), x4(@x1).  It seems odd to offer these longer slots at lower speed ratings, but all of the slots are Gen 3.0 capable except the x4(@x1).  The lanes may have been held back to maintain data coherency.

To those unfamiliar with server boards, of note is the connector just to the right of center of the picture above.  This is the equivalent of the front panel connection on an ATX motherboard.  At almost double the width it has a lot more options, and where to put your cables is not printed on the PCB – like in the old days we get the manual out to see what is what.

On the far left we have an ASPEED AST2300 chip, which has multiple functions.  On one hand it is an onboard 2D graphics chip which powers the VGA port via its ARM926EJ (ARM9) core at 400 MHz.  For the other, it as an advanced PCIe graphics and remote management processor, supporting dual NICs, two COM ports, monitoring functions and embedded memory.  Further round this section gives us a removable BIOS chip, a COM header, diagnostic headers for internal functions, and a USB 2.0 header.

The rear IO is very bare compared to what we are normally used to.  From left to right is a serial port, the VGA port, two gigabit Ethernet NICs (Intel I350), four USB 2.0 ports, the KVM server management port, and an ID Switch button for unit identification.  There is no audio here, no power/reset buttons, and no two-digit debug LED.  It made for some rather entertaining/hair removing scenarios when things did not go smoothly during testing.

Board Features

Gigabyte GA-7PESH1
Price Contact:
17358 Railroad St.
City of Industry
CA 91748
CPU Interface LGA 2011
Chipset Intel C602
Memory Slots Sixteen DDR3 DIMM slots supporting:
128GB (UDIMM) @ 1.5V
512GB (RDIMM) @ 1.5V
128GB DDR3L @ 1.35 V
Quad Channel Arcitecture
ECC RDIMM for 800-1600 MHz
Non-ECC UDIMM for 800-1600 MHz
Video Outputs VGA via ASPEED 2300
Onboard LAN 2 x Intel I350 supporting uo to 1000 Mbps
Onboard Audio None
Expansion Slots 1 x PCIe 3.0 x16
1 x PCIe 3.0 x16 (@ x8)
2 x PCIe 3.0 x8 (@ x4)
1 x PCIe 2.0 x4 (@ x1)
Onboard SATA/RAID 2 x SATA 6 Gbps, Supporting RAID 0,1
2 x mini-SAS 6 Gbps, Supporting RAID 0,1
1 x mini-SAS 3 Gbps, Supporting RAID 0,1
USB 6 x USB 2.0 (Chipset) [4 back panel, 2 onboard]
Onboard 2 x SATA 6 Gbps
2 x mSAS 6 Gbps
1 x mSAS 3 Gbps
1 x USB 2.0 Header
4 x Fan Headers
1 x PSMI header
1 x TPM header
1 x SKU KEY header
Power Connectors 1 x 24-pin ATX Power Connector
2 x 8-pin CPU Power Connector
Fan Headers 2 x CPU (4-pin)
2 x SYS (4-pin, 3-pin)
IO Panel 1 x Serial Port
1 x VGA
2 x Intel I350 NIC
4 x USB 2.0
1 x ID Switch
Warranty Period Refer to Sales
Product Page Link

Without having a direct competitor to this board on hand there is little we can compare such a motherboard to.  In this level having server grade Intel NICs should be standard, and this board can take 8GB non-ECC memory sticks or 32GB ECC memory sticks, for a maximum of 512 GB.  If your matrix solvers are yearning for memory, then this motherboard can support it.

The Perspective Gigabyte GA-7PESH1 BIOS
Comments Locked


View All Comments

  • Hakon - Saturday, January 5, 2013 - link

    Thank you for the detailed answer. I very much appreciate your article and hope to see more stuff like this on Anandtech.

    What I meant regarding to NUMA is the following. When you have a dual socket Xeon you have two memory controllers. The first time you 'touch' a memory location it is assigned to the memory controller of the CPU that runs the current thread. This assignment is in general permanent and all further memory read/writes to that location will be served by that memory controller.

    If you first-touch (e.g. initialize the array to zero) using one thread, then the whole array is assigned to one of the two memory controllers. When you then run the multi-threaded code on that array one memory controller is idle while the other is oversubscribed since it has to serve both CPUs.

    In contrast, if you first-touch your array in an OpenMP loop and use the same access pattern as in the algorithm, you will benefit from both memory controllers later on. In this case your large array is correctly 'distributed' over both memory controllers.

    This kind of memory layout optimization becomes extremely important when you deal with quad socket Opterons. You then have eight memory controllers. A NUMA aware code is therefore up to eight times as fast since it utilizes all memory controllers.
  • toyotabedzrock - Saturday, January 5, 2013 - link

    You should go ask the people on the assembly boards for help with making your code faster.

    They are very friendly compared to a Linux kernel devs, I think they just enjoy the acknowledgement that they still exist and are useful.
  • snajpa - Saturday, January 5, 2013 - link

    Blame the scheduler. Neither Windows nor linux can effectively handle larger NUMA systems. It randomly moves the process across the physical hardware.
  • psyq321 - Sunday, January 6, 2013 - link

    Hmm, this is definitely not true at least for Windows Server 2008 R2 / Windows 7, and I am sure it holds true for some versions of Linux (I am not a Linux expert).

    Windows Server 2008 R2 / Windows 7 scheduler will try to match the memory allocations (even if they are not tagged for a specific NUMA node) with the NUMA node the process/thread resides on, and they will not move a thread to a foreign NUMA node unless if that has been explicitly requested by the application (by setting the thread affinity)

    Of course, without explicit NUMA node tagging when doing allocations, application code is the main culprit for not respecting the NUMA layout (e.g. creating bunch of threads, allocating memory from one of them - and then pinning the threads to different CPUs - you will have lots of LLC requests from remote DRAM because memory was a-priori allocated on one node).

    For this - some sane coding helps a lot, here:

    I describe how I extracted more than double performance by careful memory allocation (NUMA-aware) - please note that neither Windows nor Linux scheduler is able to cope with code which is not written to be NUMA aware and it is using large number of threads that are supposed to run on all CPUs.. Simply put, application writer will have to manage memory allocation and usage in the way so that there are as little remote DRAM requests as possible.
  • snajpa - Sunday, January 6, 2013 - link

    About Windows scheduler - I only worked with Windows XP, now I don't have any reason to work with Win anymore, so what you say probably is really true. As for the linux versions - well, long story short, CFS sucks and everyone knows it - this is particularly noticeable if you have fully virtualized VMs which appear as one single process at the host system - the process is randomly swapped between CPU cores and even CPU dies.... sad story. That's why people have to pin their CPUs to their tasks manually.
  • psyq321 - Sunday, January 6, 2013 - link

    Ah, XP - that explains it. True, XP did not care about NUMA at all.

    Windows Server 2008 / Vista introduced NUMA-aware memory allocations, and changed their CPU scheduler so it does not move the thread across NUMA nodes. They will also try to allocate the memory from the thread's own NUMA node when legacy VirtualAlloc etc. APIs are used.

    Windows Server 2008 R2 / Windows 7 introduced the concept of CPU groups - allowing more than 64 CPUs. This does require some adaptation of the application, as old threading APIs only work with 64-bit affinity bitmask which only allowed recognizing 64 CPUs. Now, there is a new set of APIs that work with GROUP_AFFINITY structure, allowing control of CPU groups, too. However, this needs explicit change of the legacy process/threading APIs to the new ones.

    Furthermore, none of the above can replace some manual intervention*- while Windows scheduler will, indeed, respect NUMA node boundaries and not try to mess around with moving threads across them - it still does not know what the underlying algorithm wants to do.

    * There is no need to set the thread affinity to one specific CPU anymore - this prevents running the thread on any other CPU completely. Instead, there is an API called SetThreadIdealProcessor(Ex) which signals Windows scheduler that thread >should< run on that particular CPU - but, under certain circumstances the scheduler can move the thread somewhere else - if the CPU is completely taken away by some other thread/process. Scheduler will try to move the thread as close as possible - to the next core in the socket, for example - or to the next core in the group (group is always contained within a NUMA node).

    You can, however, absolutely forbid Windows scheduler from passing the thread to another NUMA node under any circumstances by simply getting the said NUMA node affinity mask (GetNumaNodeProcessorMask(Ex)) and setting this affinity as a thread affinity. This + setting the "ideal" processor still gives Windows scheduler some headroom to move the thread to another core if it is found to be better in a given moment, but it will not even attempt to cross the NUMA boundary in any case whatsoever.
  • lmcd - Monday, January 7, 2013 - link

    While I haven't personally researched them, there are tons of other schedulers that have been written for Linux and I'm certain *at least* one of them is more fitting to this line of work. I've heard of alternatives like BFS and the Linux kernel is so widely used I'm sure there's a gem out there for this application.
  • toyotabedzrock - Saturday, January 5, 2013 - link

    Have you ever tried the Intel Math Kernel Library? It might speed up some of the equations. It also hands off work to the Intel MIC card if it thinks it will speed it up.
  • KAlmquist - Saturday, January 5, 2013 - link

    The GA-7PESH1 motherboard is $855, and the CPU's are $2020 each, which adds up to $4895. On tasks which don't parallelize well, you can get similar performance from the i7-3770K, which costs an order of magnitude less. (Prices: i7-3770K $320, ASRock Z77 Extreme6 motherboard $152, total from motherboard and CPU $472.) On tasks which parallelize well enough that they can be run on a GPU, the system with the GA-7PESH1 will beat the i7-3770K, but will be crushed by a midrange GPU. So the price/performance of this system is pretty bad unless you throw just the right workload at it.

    The motherboard price from super-laptop-parts dot com, and the other prices are from a major online retailer that I won't name in order to get around the spam filter.
  • Death666Angel - Saturday, January 5, 2013 - link

    So, your 3770K has ECC memory or VT-D, TXT etc.?

Log in

Don't have an account? Sign up now