Simulations, Memory Requirements and Dual Processors

Throughout my simulation career, it would have been easy enough to just write code, compile and simply watch it run.  But the enthusiast and speed freak within me wanted the code to go a little faster, then a little faster, until learning about types of memory and how to prioritize code became part of my standard code scenario.  There are multiple issues that all come together from all sides of the equation.

First of all, let us discuss at a high level the concept of memory caching on a pseudo-processor. 

The picture above is a loose representation of a dual core processor, with each core represented as ‘P’, Registers labeled as ‘R’, and the size of the lines is representative of the bandwidth.

The processor has access to some registers which are a high-bandwidth, low space memory store.  These registers are used to store intermediary calculation data, as well as context switching with HyperThreading.  The processor is also directly linked to an L1 (‘level 1’) cache, which is the first place the processor looks if it needs data from memory.  If the data is not in the L1 then it looks in the L2 (‘level 2’), and so on until the data is found.  Obviously the closer the data is to the processor, the quicker it can be accessed and the calculation should prove to be quicker, and thus there are large benefits to larger caches. 

In the diagram above, each processor core has its own L1 cache and L2 cache, but a shared L3 cache.  This allows each core to probe the data in L3.  What is not shown is that there are some snoop protocols designed to let each core know what is going on in another core’s L2 cache.  With data flying around it is most important to maintain cache coherency.

Take an example where we have a simulation running two threads on our imaginary processor, and each thread requires 200 kB of data.  If our L2 cache is 256 kB then the thread can easily run inside the L2 keeping data rates high.  In the event that each core needs data from the other thread, values are copied into L3 at the expense of time.

Now imagine that our processor supports HyperThreading.  This allows us to run two threads on each processor core.  We still have the same amount of hardware, but when one thread is performing a memory read or write operation, that creates a delay until the read or write operation is confirmed.  While this delay is occurring, the processor core can save the state of the first thread and move on with the second.

The downside to our new HyperThreading scenario is when we launch a program with four threads, and each thread uses 200 kB of memory.  If our L2 cache is only 256 kB, then the combined 400 kB of data spills over into our L3 cache.  This has the potential of slowing down simulations if read and write operations are very slow.  (In modern processors, a lot of the logic built into the processor is designed to move data around such that these memory operations are as quick as possible – it goes ahead and predicts which data is needed next.)

This is the simple case of a dual core processor with HyperThreading.  It gets even more complicated if you add in the concept of dual processors.

If we have two dual core processors (four cores total) with HyperThreading (eight threads), the only memory share between the processors is the main random access memory.  When a standard program launches multiple threads, there is no say in where those threads will end up – they may be run out-of-order on whatever processor core is available.  Thus if one thread needs data from another, several things may occur:

(1) The thread may be delayed until the other thread is processed
(2) The data may already be on the same processor
(3) The data may be on the other processor, which causes delays

There are many different types of simulation that can be performed, each with their own unique way of requesting memory or dealing with threads.  As mentioned in the first page of this review, even in the research group I was in, if two people wrote code to perform the same simulation, memory requirements of each could be vastly different.  This makes it even more complicated, as when moving into a multithreaded scenario the initially slower simulation might be sped up the most.

Talking About Simulations

The next few pages will talk about a different type of simulation in turn based on my own experiences and what I have coded up.  Several are based on finite-difference grid solvers (both explicit and implicit), we have a Brownian test based on six movement algorithms, an n-body simulation, and our usual compression / video editing tests.  The ones we have written for this review will be explained briefly both mathematically and in code.

Test Setup, Power Consumption, POST Time Two and Three Dimensional Explicit Finite Difference Simulations
Comments Locked

64 Comments

View All Comments

  • Hulk - Saturday, January 5, 2013 - link

    I had no idea you were so adept with mathematics. "Consider a point in space..." Reading this brought me back to Finite Element Analysis in college! I am very impressed. Being a ME I would have preferred some flow models using the Navier-Stokes equations, but hey I like chemistry as well.
  • IanCutress - Saturday, January 5, 2013 - link

    I never did any FEM so wouldn't know where to start. The next angle of testing would have been using a C++ AMP Fluid Dynamics Simulation and adjusting the code from the SDK example like with the n-Body testing. If there is enough interest, I could spend a few days organising it for the normal motherboard reviews :)

    Ian
  • mayankleoboy1 - Saturday, January 5, 2013 - link

    How the frick did you get the i7-3770K to *5.4GHZ* ? :shock:
    How the frick did you get the i7-3770K to *5.0GHZ* ? :shock:
  • IanCutress - Saturday, January 5, 2013 - link

    A few members of the Overclock.net HWBot team helped testing by running my benchmark while they were using DICE/LN2/Phase Change for overclocking contests (i.e. not 24/7 runs). The i7-3770K will go over 7 GHz if (a) you get a good chip, (b) cool it down enough, and (c) know what you are doing. If you're interested in competitive overclocking, head over to HWBot, Xtreme Systems or Overclock.net - there are plenty of people with info to help you get started.

    Ian
  • JlHADJOE - Tuesday, January 8, 2013 - link

    The incredible performance of those overclocked Ivy bridge systems here really hammers home the importance of raw IPC. You can spend a lot of time optimizing code, but IPC is free speed when it's available.
  • jd_tiger - Saturday, January 5, 2013 - link

    http://www.youtube.com/watch?v=Ccoj5lhLmSQ
  • smonsees - Saturday, January 5, 2013 - link

    You might try modifying your algorithm to pin the data to a specific core (therefore cache) to keep the thrashing as low as possible. Google "processor affinity c++". I will admit this adds complexity to your straightforward algorithm. In C#, I would use a parallel loop with a range partition to do it as a starting point: http://msdn.microsoft.com/en-us/library/dd560853.a...
  • nickgully - Saturday, January 5, 2013 - link

    Mr. Cutress,
    Do you think with all the virtualized CPU available, researchers will still build their own system as it is something concrete to put into a grant application, versus the power-by-the-hour of cloud computing?

    Thanks.
  • IanCutress - Saturday, January 5, 2013 - link

    We examined both scenarios. Our university had cluster time to buy, and there is always the Amazon cloud. In our calculation, getting a 16 thread machine from Dell paid for itself in under six months of continuous running, and would not require a large adjustment in the way people were currently coding (i.e. staying in Windows rather than moving to Linux), and could also be passed down the research group when newer hardware is released.

    If you are using production level code and manipulating it each time to get results, and you can guarantee the results will be good each time, then power-by-the-hour could work. As we were constantly writing and testing new code for different scenarios, the build/buy your own workstation won out. Having your own system also helps in building GPU codes, if you want to buy a better GPU card it is easier to swap out rather than relying on a cloud computing upgrade.

    Ian
  • jtv - Sunday, January 6, 2013 - link

    One big consideration is who the researchers are. I work in x-ray spectroscopy (as a computational theorist). Experimentalists in this field use some of our codes without wanting to bother with having big computational resources. We have looked at trying to provide some of our codes through some cloud-based service so that it can be used on demand.

    Otherwise I would agree with Ian's reply. When I'm improving code, debugging code, or trying to implement new theoretical approaches I absolutely want my own hardware to do it on.

Log in

Don't have an account? Sign up now