Simulations, Memory Requirements and Dual Processors

Throughout my simulation career, it would have been easy enough to just write code, compile and simply watch it run.  But the enthusiast and speed freak within me wanted the code to go a little faster, then a little faster, until learning about types of memory and how to prioritize code became part of my standard code scenario.  There are multiple issues that all come together from all sides of the equation.

First of all, let us discuss at a high level the concept of memory caching on a pseudo-processor. 

The picture above is a loose representation of a dual core processor, with each core represented as ‘P’, Registers labeled as ‘R’, and the size of the lines is representative of the bandwidth.

The processor has access to some registers which are a high-bandwidth, low space memory store.  These registers are used to store intermediary calculation data, as well as context switching with HyperThreading.  The processor is also directly linked to an L1 (‘level 1’) cache, which is the first place the processor looks if it needs data from memory.  If the data is not in the L1 then it looks in the L2 (‘level 2’), and so on until the data is found.  Obviously the closer the data is to the processor, the quicker it can be accessed and the calculation should prove to be quicker, and thus there are large benefits to larger caches. 

In the diagram above, each processor core has its own L1 cache and L2 cache, but a shared L3 cache.  This allows each core to probe the data in L3.  What is not shown is that there are some snoop protocols designed to let each core know what is going on in another core’s L2 cache.  With data flying around it is most important to maintain cache coherency.

Take an example where we have a simulation running two threads on our imaginary processor, and each thread requires 200 kB of data.  If our L2 cache is 256 kB then the thread can easily run inside the L2 keeping data rates high.  In the event that each core needs data from the other thread, values are copied into L3 at the expense of time.

Now imagine that our processor supports HyperThreading.  This allows us to run two threads on each processor core.  We still have the same amount of hardware, but when one thread is performing a memory read or write operation, that creates a delay until the read or write operation is confirmed.  While this delay is occurring, the processor core can save the state of the first thread and move on with the second.

The downside to our new HyperThreading scenario is when we launch a program with four threads, and each thread uses 200 kB of memory.  If our L2 cache is only 256 kB, then the combined 400 kB of data spills over into our L3 cache.  This has the potential of slowing down simulations if read and write operations are very slow.  (In modern processors, a lot of the logic built into the processor is designed to move data around such that these memory operations are as quick as possible – it goes ahead and predicts which data is needed next.)

This is the simple case of a dual core processor with HyperThreading.  It gets even more complicated if you add in the concept of dual processors.

If we have two dual core processors (four cores total) with HyperThreading (eight threads), the only memory share between the processors is the main random access memory.  When a standard program launches multiple threads, there is no say in where those threads will end up – they may be run out-of-order on whatever processor core is available.  Thus if one thread needs data from another, several things may occur:

(1) The thread may be delayed until the other thread is processed
(2) The data may already be on the same processor
(3) The data may be on the other processor, which causes delays

There are many different types of simulation that can be performed, each with their own unique way of requesting memory or dealing with threads.  As mentioned in the first page of this review, even in the research group I was in, if two people wrote code to perform the same simulation, memory requirements of each could be vastly different.  This makes it even more complicated, as when moving into a multithreaded scenario the initially slower simulation might be sped up the most.

Talking About Simulations

The next few pages will talk about a different type of simulation in turn based on my own experiences and what I have coded up.  Several are based on finite-difference grid solvers (both explicit and implicit), we have a Brownian test based on six movement algorithms, an n-body simulation, and our usual compression / video editing tests.  The ones we have written for this review will be explained briefly both mathematically and in code.

Test Setup, Power Consumption, POST Time Two and Three Dimensional Explicit Finite Difference Simulations
Comments Locked

64 Comments

View All Comments

  • mayankleoboy1 - Saturday, January 5, 2013 - link

    Ian :

    How much difference do you think Xeon Phi will make in these very different type of Computations?
    Will buying a Xeon Phi "pay itself out" as you said in the above comments ? (or is xeon phi linux only ?)
  • IanCutress - Saturday, January 5, 2013 - link

    As far as we know, Xeon Phi will be released for Linux only to begin with. I have friends who have been able to play with them so far, and getting 700 GFlops+ in DGEMM in double precision.

    It always comes down to the algorithm with these codes. It seems that if you have single precision code that doesn't mind being in a 2P system, then the GPU route may be preferable. If not, then Phi is an option. I'm hoping to get my hands on one inside H1 this year. I just have to get my hands dirty with Linux as well.

    In terms of the codes used here, if I were to guess, the Implicit Finite Difference would probably benefit a lot from Xeon Phi if it works the way I hope it does.

    Ian
  • mayankleoboy1 - Saturday, January 5, 2013 - link

    Rather stupid question, but have you tried using PGO builds ?
    Also, do you build the code with the default optimizations, or use the MSVC equivalent switch of -O2 ?
  • IanCutress - Saturday, January 5, 2013 - link

    Using Visual Studio 2012, all the speed optimisations were enabled including /GL, /O2, /Ot and /fp:fast. For each part I analysed the sections which took the most time using the Performance Analysis tools, and tried to avoid the long memory reads. Hence the Ex-FD uses an iterative loading which actually boosts speed by a good 20-30% than without it.

    Ian
  • Klimax - Sunday, January 6, 2013 - link

    Interesting. Why not Ox (all optimisations on)

    BTW: Do you have access to VTune?
  • IanCutress - Wednesday, January 9, 2013 - link

    In case /Ox performs an optimisation for memory over speed in an attempt to balance optimisations. As speed is priority #1, it made more sense to me to optimise for that only. If VS2012 gave more options, I'd adjust accordingly.

    Never heard of VTune, but I did use the Performance Analysis tools in VS2012 to optimise certain parts of the code.

    Ian
  • Beenthere - Saturday, January 5, 2013 - link

    Business and mobo makers do not use 2P mobos to get high benches or performance bragging rights per se. These systems are build for bullet-proof reliability and up time. It does no good for a mobo/system to be 3% faster if it crashes while running a month long analysis. These 2P mobos are about 100% reliability, something rarely found in a enthusiasts mobo.

    Enterprise mobos are rarely sold by enthusiast marketeers. Newegg has a few enterprise mobos listed primarily because they have started a Newegg Biz website to expand their revenue streams. They don't have much in the line of true enterprise hardware however. It's a token offering because manufacturers are not likely to support whoring of the enterprise market lest they lose all of their quality vendors who provide customer technical product support.
  • psyq321 - Sunday, January 6, 2013 - link

    Actually, ASUS Z9PE-D8 WS allows for some overclocking capabilities.

    CPU overclocking with 2P/4P Xeon E5 (2600/4600 sequence) is a no-go because Intel explicitly did not store proper ICC data so it is impossible to manipulate BCLK meaningfully (set the different ratios). Oh, and the multipliers are locked :)

    However, Z9PE D8 WS allows memory overclocking - I managed to run 100% 24/7 stable with the Samsung ECC 1600 DDR3 "low voltage" RAM (16 GB sticks) - just switching memory voltage from 1.35v to 1.55v allows overclocking memory from 1600 MHz to 2133 MHz.

    Why would anyone want to do that in a scientific or b2b environment? The only usage I can see are applications where memory I/O is the biggest bottleneck. Large-scale neural simulations are one of such applications, and getting 10 GB/s more of memory I/O can help a lot - especially if stable.

    Also, low-latency trading applications are known to benefit from overclocked hardware and it is, in fact, used in production environment.

    Modern hardware does tend to have larger headrooms between the manufacturer's operating point and the limits - if the benefit from an overclock is more benefitial than work invested to find the point where the results become unstable - and, of course, shorter life span of the hardware - then, it can be used. And it is used, for example in some trading scenarios.
  • Drazick - Saturday, January 5, 2013 - link

    Will You, Please, Update Your Google+ Page?

    It would be much easier to follow you there.
  • Ryan Smith - Saturday, January 5, 2013 - link

    Our Google+ page is just a token page. If you wish to follow us then your best option is to follow our RSS feeds.

Log in

Don't have an account? Sign up now