Trials of an Intel Quad Processor System: 4x E5-4650L from SuperMicro

In recent months at AnandTech we have tackled a few issues of dual processor systems for regular use, and whether having a dual processor system as a theoretical scientist may help or hinder various benchmark scenarios.  For the problems that I encountered as a theoretical physical chemist, using a dual processor system without any form of formal training dealing with memory allocation (NUMA) resulted in a severe performance hit for anything that required a significant level of memory accesses, especially grid solvers that required pulling information from large arrays held in memory.  Part of the issue was latency access dealing with data that was in the memory of the other CPU, and thus a formal training in writing NUMA code would be applicable for multi-processor systems.  Nevertheless in my AnandTech testing we did see significant speedup when dealing with various ‘pre-built’ software scenarios such as video conversion using Xilisoft Video Converter, rendering using PovRay and our 3D Particle Movement Benchmark.

To take this testing one stage further, SuperMicro kindly agreed to loan me remote desktop access to one of their internal quad processor (4P) systems.  The movement from 2P to 4P is almost strictly in the realms of business investment, except for a few Folding@home enthusiasts that have seen large gains moving to a quad processor AMD system using obscure buyers for motherboards and eBay for processors.  But with 4P in the business realm, the software has to match that usage scenario and scale appropriately.

Our testing scenario will cover our server motherboard CPU tests only – as I only had remote desktop access I was not fortunate enough to do any ‘gaming’ tests, although our gaming CPU article may have shown that unless you are doing a massive multi-screen multi-GPU setup then anything more than a single Sandy Bridge-E system may be overkill.

Test Setup:

Supermicro X9QR7-TF+
4x Intel Xeon E5-4650L @ 2.6 GHz (3.1 GHz Turbo), 8 cores (16 threads) each
Kingston 128GB ECC DDR3-1600 C11
Windows Server Edition 2012 Standard

Issues Encountered

As you might imagine, moving from 1P to 2P and then to 4P without much experience in the field of multi-processor calculations was initially very daunting.  The main issue moving to 4P was having an operating system that actually detected all the threads possible and then communicated that to software using the Windows APIs.  In both Windows Server 2008 R2 Standard and 2012 Standard, the system would detect all 64 threads in task manager, but only report 32 threads to software.  This raises a number of issues when dealing with software that automatically detects the number of threads on a system and only issues that number.  In this scenario the user would need to manually set the number of threads, but it all depends on the way the program was written.  For example, our Xilisoft and 3DPM tests do an automatic thread detection but set the threads to what is detected, whereas PovRay spawns a large number of threads despite automatic detection.  Cinebench as well detected half the threads automatically, but at least has an option to spawn a custom number of threads.

Point Calculations - 3D Movement Algorithm Test

The algorithms in 3DPM employ both uniform random number generation or normal distribution random number generation, and vary in various amounts of trigonometric operations, conditional statements, generation and rejection, fused operations, etc.  The benchmark runs through six algorithms for a specified number of particles and steps, and calculates the speed of each algorithm, then sums them all for a final score.  This is an example of a real world situation that a computational scientist may find themselves in, rather than a pure synthetic benchmark.  The benchmark is also parallel between particles simulated, and we test the single thread performance as well as the multi-threaded performance.

3D Particle Movement Single Threaded3D Particle Movement MultiThreaded

The 3DPM test falls under the half-thread detection issue, and as a result of the high threads but lower single core speed we only just get an improvement over a 2P Westmere-EP system.  For single thread performance the single thread speed of the E5-4650L (3.1 GHz) is too low to compete with other Sandy Bridge and above processors.

Compression - WinRAR 4.2

With 64-bit WinRAR, we compress the set of files used in the USB speed tests. WinRAR x64 3.93 attempts to use multithreading when possible, and provides as a good test for when a system has variable threaded load.  WinRAR 4.2 does this a lot better! If a system has multiple speeds to invoke at different loading, the switching between those speeds will determine how well the system will do.

WinRAR 3.93WinRAR 4.2

As WinRAR is ultimately dependent on memory speed, the 1600 C11 runs into the issues that the lower memory speed situations face.  Despite this, the 2P Westmere-EP system still beats the 4P but you really need a good single core system with high bandwidth memory to take advantage.

Image Manipulation - FastStone Image Viewer 4.2

FastStone Image Viewer is a free piece of software I have been using for quite a few years now.  It allows quick viewing of flat images, as well as resizing, changing color depth, adding simple text or simple filters.  It also has a bulk image conversion tool, which we use here.  The software currently operates only in single-thread mode, which should change in later versions of the software.  For this test, we convert a series of 170 files, of various resolutions, dimensions and types (of a total size of 163MB), all to the .gif format of 640x480 dimensions.

FastStone Image Viewer 4.2

MHz and IPC wins for FastStone, which the single thread speed of the E5-4650Ls do not have.

Video Conversion - Xilisoft Video Converter 7

With XVC, users can convert any type of normal video to any compatible format for smartphones, tablets and other devices.  By default, it uses all available threads on the system, and in the presence of appropriate graphics cards, can utilize CUDA for NVIDIA GPUs as well as AMD WinAPP for AMD GPUs.  For this test, we use a set of 33 HD videos, each lasting 30 seconds, and convert them from 1080p to an iPod H.264 video format using just the CPU.  The time taken to convert these videos gives us our result.

Xilisoft Video Converter 7

Due to the nature of XVC we do not see any speed up against Westmere-EP due to the 33rd video only being assigned a single thread, essentially doubling the time of the conversion.

Rendering – PovRay 3.7

The Persistence of Vision RayTracer, or PovRay, is a freeware package for as the name suggests, ray tracing.  It is a pure renderer, rather than modeling software, but the latest beta version contains a handy benchmark for stressing all processing threads on a platform. We have been using this test in motherboard reviews to test memory stability at various CPU speeds to good effect – if it passes the test, the IMC in the CPU is stable for a given CPU speed.  As a CPU test, it runs for approximately 2-3 minutes on high end platforms.

PovRay 3.7 Multithreaded Benchmark

PovRay is the first benchmark that shows the full strength of 64 Intel threads, scoring almost double that of the 24 thread Westmere-EP system (which was at higher frequency).

Video Conversion - x264 HD Benchmark

The x264 HD Benchmark uses a common HD encoding tool to process an HD MPEG2 source at 1280x720 at 3963 Kbps.  This test represents a standardized result which can be compared across other reviews, and is dependent on both CPU power and memory speed.  The benchmark performs a 2-pass encode, and the results shown are the average of each pass performed four times.

x264 HD Benchmark Pass 1x264 HD Benchmark Pass 2

The issue with memory management and NUMA comes into effect with x264, and the complex memory accesses required over the QPI links put a dent in performance.

Grid Solvers - Explicit Finite Difference

For any grid of regular nodes, the simplest way to calculate the next time step is to use the values of those around it.  This makes for easy mathematics and parallel simulation, as each node calculated is only dependent on the previous time step, not the nodes around it on the current calculated time step.  By choosing a regular grid, we reduce the levels of memory access required for irregular grids.  We test both 2D and 3D explicit finite difference simulations with 2n nodes in each dimension, using OpenMP as the threading operator in single precision.  The grid is isotropic and the boundary conditions are sinks.  Values are floating point, with memory cache sizes and speeds playing a part in the overall score.

Explicit Finite Difference Grid Solver (2D)Explicit Finite Difference Grid Solver (3D)

It seems odd to consider that a 4P system might be detrimental to a computationally intensive benchmark, but it all boils down to learning how to code for the system you are simulating.  Porting code written for a single CPU system onto a multiprocessor workstation is not a simple matter of copy-paste-done.

Grid Solvers - Implicit Finite Difference + Alternating Direction Implicit Method

The implicit method takes a different approach to the explicit method – instead of considering one unknown in the new time step to be calculated from known elements in the previous time step, we consider that an old point can influence several new points by way of simultaneous equations.  This adds to the complexity of the simulation – the grid of nodes is solved as a series of rows and columns rather than points, reducing the parallel nature of the simulation by a dimension and drastically increasing the memory requirements of each thread.  The upside, as noted above, is the less stringent stability rules related to time steps and grid spacing.  For this we simulate a 2D grid of 2n nodes in each dimension, using OpenMP in single precision.  Again our grid is isotropic with the boundaries acting as sinks. Values are floating point, with memory cache sizes and speeds playing a part in the overall score.

Implicit Finite Difference Grid Solver (2D)

Conclusions – Learn How To Code!

For users considering multiprocessor systems, consider your usage scenario.  If your simulation contains highly independent elements and lightweight threads, then the obvious suggestion is to look at GPUs for your needs.  For all other purposes it is a lot easier to consider single CPU systems but scaling may occur if we look at memory management. 

This makes sense when compiling your own code – the issue gets a lot tougher when dealing with third-party software.  Before spending on a large multiprocessor system, get details from the company that make your software (for which you or your institution may be paying a large amount in yearly licensing fees) about whether it is suitable for multiprocessor systems, and do not be satisfied with answers such as ‘I don’t see why not’.

With Crystalwell in the picture in the consumer space, it becomes a lot more complex when dealing with a large eDRAM/L4 cache in a multiprocessor system.  The system will then need to manage the snooping protocols for larger amounts of memory, making the whole procedure a nightmare for the unfortunate team that might have to deal with it.  Crystalwell makes sense in the server space for single processor systems, perhaps dealing with MPI in clusters, but it might take a while to see it in the multiprocessor world at least.  Fingers crossed…!

POST A COMMENT

53 Comments

View All Comments

  • lmcd - Wednesday, July 03, 2013 - link

    And the rest is probably some combination of BSD or proprietary builds of BSD with "secret sauce." Reply
  • lmcd - Wednesday, July 03, 2013 - link

    Sadly, it seems like it might. Which is ironic given the technical nature of this site -- seems like exploring BSD versus Linux or some complex breakdown would be within the scope of the site.

    Given the coverage of Android, again, it seems relevant exploring the typical architecture of a Linux distribution, the future architecture, the current Android setup, and the current Chrome OS setup.

    I'm hoping for an article, which kinda sucks as I watch pipelines pop by for Microsoft's late C++ efforts, among other failures.
    Reply
  • lmcd - Wednesday, July 03, 2013 - link

    *at least one article -- didn't make that clear Reply
  • Kevin G - Wednesday, July 03, 2013 - link

    This is actually an old Windows API issue. While a piece of software can scale to a near infinite number of threads per process (only limited by address space), the Windows scheduler will only run a maximum of 32 per process concurrently. Even MS SQL Server only supports a maximum of 32 threads per DB on a single system (MS SQL Server will spawn another process per DB to scale higher as necessary).

    Though with 32 real cores, it may pay off to simply disable HyperThreading for better scaling.
    Reply
  • mike8675309 - Monday, July 08, 2013 - link

    To clarify, this seems to be an issue with 32bit software running on 64bit hardware and making windows API calls while running under WOW64. A good example is noted in the remarks from the API documentation for the GetLogicalProcessorInformationEx function which describes issues with passing a 64bit KAFFINITY structure to a 32bit client and the side effects that can cause.
    http://msdn.microsoft.com/en-us/library/windows/de...

    As noted in the article by the author, creating software that benefits from NUMA rather than being hamstrung by NUMA requires another layer of knowledge on top of single cpu software development. I'm sure Microsoft has figured out NUMA with MS Sql Server considering the prevalence of multi-cpu solutions for that software product essentially since multi-cpu hardware for windows became common. Note TPC result id 112032702 for NEC running Windows Server R2 Enterprise, and SQL Server 2012 Enterprise on 8 processors, 80 cores, and 160 threads.
    Reply
  • psyq321 - Friday, July 05, 2013 - link

    Windows has no problems with more than 64 logical CPUs since kernel version 6.1

    The problem is that the >application< itself has to use updated Win32 APIs which allo extend processor mask to be set.

    If the application is using old Win32 APIs (pre NT Kernel 6.1) then it will only "see" up to 64 logical CPUs.
    Reply
  • psyq321 - Friday, July 05, 2013 - link

    By the way, the number 32 as the limit of the number of CPUs seen by the app comes from the 32-bit processes.

    With Windows:

    - 32-bit process has 32-bit processor mask for each thread (DWORD)
    - 64-bit process had (pre Win NT 6.1 API) 64-bit mask for each thread (DWORD_PTR)

    If application needs to access more than 64 CPUs, it has to use new Win32 APIs that were introduced in Windows 7 / Server 2013

    See here: http://msdn.microsoft.com/en-us/library/windows/ha...

    The keyword is "processor groups", and APIs that deal with group affinity.

    So, I would suggest to the reviewer to get acquainted with this if he intends to keep using Windows Server 2012 (or later) as the test vehicle.

    In the Xeon E5 case based on Sandy Bridge-EP this should still not be a problem, as long as the reviewer use 64-bit processes, because Xeon EP 4600 does not support more than 64 logical CPUs.

    However, Ivy Bridge EP already can have more than 64 logical processors with the E5 4600 v2 line. Having more than 64 logical CPUs was already possible with Xeon E7 platform based on Boxboro generation, and it will get even more scalable with Ivy Bridge EX.
    Reply
  • Jaybus - Monday, July 08, 2013 - link

    That processors are grouped is more important than the number of processors. For NUMA architectures, all logical processors belonging to a physical CPU (with or without hyperthreading) will belong to the same group. The SetProcessAffinityMask() Windows function can be used to prevent the scheduler from assigning the process's threads a logical processor that doesn't belong to the same group. This way all threads in that process always run on cores that have the same fast memory access.

    The process affinity mask essentially allows using a subset of the NUMA hardware as if it were a SMP system. If you have, say, 4 processor groups, then you have to manually divide the data up into 4 sections handled by 4 processes so that each group of threads operates on its own section with SMP memory access. MPI is then used to tie the 4 processes together just like using a cluster. The difference is that the message passing on the NUMA system is faster than on a cluster of separate physical servers, but basically it maps the NUMA system as a cluster of independent SMP systems.

    Data dependent algorithms will greatly benefit from using the process affinity mask. Since a system like this doesn't make sense for data independent algorithms, ( where GPU hardware would be faster and cheaper), only software designed for NUMA systems should be compared.
    Reply
  • aicom - Wednesday, July 03, 2013 - link

    This is exactly why we all aren't running 16+ cores in our desktops. It doesn't make sense for the majority of today's workloads. Reply
  • lmcd - Wednesday, July 03, 2013 - link

    This is more a statement of why unified memory and cache are important to performance computing. I'd like to note that the 6-core 3930X beat the 4770K on all but the few single threaded benchmarks, and the Xeon 8-core (I think it's 8-core?) beat the 3930X.

    There are plenty of applications that scale up with core count. They just don't scale up with multiple sockets and slow interconnects between those cores.
    Reply

Log in

Don't have an account? Sign up now