Brownian Motion

When chemicals move around in the air, or dissolved in a liquid, or even travelling through a solid, they diffuse from an area of high concentration to low concentration, or follow Le Chatelier’s principle if an external force is applied.  Typically this movement is described via an overall statistical shift, expressed as a partial differential equation.  However, if we were to pick out an individual chemical and observe its movements, we would see that it moves randomly according to thermal energy – at every step it moves in a random direction until an external force (such as another chemical gets in the way) is encountered.  If we sum up the movements of millions or billions of such events, we get the gradual motion explained by statistics.

The purpose of Brownian motion simulation is to calculate what happens to individual chemicals in circumstances where the finite motion is what dictates the activity occurring.  For example, experiments dealing with single molecule detection rely on how a single molecule moves, or during deposition chemistry, the movement of individual chemicals will dictate how a surface achieves deposition in light of other factors.  The movement of the single particle is often referred to a Random Walk (of which there are several types).

There are two main differences between a random-walk simulation and the finite difference methods used over the previous pages.  The first difference is that in a random walk simulation each particle in the simulation is modelled as a point particle, completely independent in its movement with respect to other particles in the simulation (the diffusion coefficient in this context is indirectly indicative of the medium through which the particle is travelling through).  With point particles, the interactions at boundaries can be quantized with respect to the time in simulation.  From the perspective of independent particles, we can utilise various simulation techniques designed for parallel problems, such as multi-CPU analysis or using GPUs.  To an extent, this negates various commentaries which criticise the random-walk method for being exceedingly slow.

The second difference is the applicability to three dimensional simulations.  Previous methods for tackling three dimensional diffusion (in the field I was in) have relied on explicit finite difference calculation of uncomplicated geometries, which result in large simulation times due to the unstable nature of the discretisation method. The alternating direction implicit finite difference method in three dimensions has also been used, but can suffer from oscillation and stability issues depending on the algorithm method used.  Boundary element simulation can be mathematically complex for this work, and finite element simulation can suffer from poor accuracy.  The random-walk scenario is adaptable to a wide range of geometries and by simple redefinition of boundaries and/or boundary conditions, with little change to the simulation code.

In a random walk simulation, each particle present in the simulation moves in a random direction at a given speed for a time step (determined by the diffusion coefficient).  Two factors computationally come into play – the generation of random numbers, and the algorithm for determining the direction in which the particle moves.

Random number generation is literally a minefield when first jumped into.  Anyone with coding experience should have come across a rand() function that is used to quickly (and messily) generate a random number.  This number as a function of the language library repeats itself fairly regularly, often within 2^15 (32768) steps.  If a simulation is dealing with particles doing 100,000 steps, or billions of particles, this random number generator is not sufficient.  For both CPU and GPU there are a number of free random number generators available, and ones like the Mersenne Twister are very popular.  We use the Ranq1 random number generator found in Numerical Recipes 3rd Edition: The Art of Scientific Computing by Press et al., which features a quick generation which has a repeated stepping of ~1.8 x 10^19.  Depending on the type of simulation contraints, in order to optimize the speed of the simulation, the random numbers can be generated in an array before the threads are issued and the timers started then sent through the function call.  This uses additional memory, but is observed to be quicker than generating random numbers on the fly in the movement function itself.

Random Movement Algorithms

Trying to generate random movement is a little trickier than random numbers.  To move a certain distance from an original point, the particle could end up anywhere on the surface of a sphere.  The algorithm to generate such movement is not as trivial as a random axial angle followed by a random azimuth angle, as this generates movement biased at the poles of a sphere.  By doing some research, I published six methods I had found for generating random points on the surface of a sphere.

For completeness, the notation x ~ U(0,1] means that x is a uniformly generated single-ended random number between 0 and 1.

(i) Cosine Method

In spherical polar coordinates, the solid angle of a sphere, Ω, is defined as the area A covered by that angle divided by the square of the radius, re, as shown in equation 14.  The rate of change of this angle with respect to the area at constant radius gives the following set of equations:

 [14]

        [15]

   [16]

Thus Θ has a cosine weighting.  This algorithm can then be used by following these steps:

Generate     [17]

Generate           [18]

     [19]

    [20]

  [21]

Computationally, 2π can be predefined as a constant, and thus does not require a multiplication every time it is used.

(ii) Normal-Deviate Method

As explained by Knuth, each x, y and z coordinate is randomly generated from a normal distribution with a mean of 0 and variance of 1.  The result is normalised to the surface of the sphere to obtain a uniform distribution on the surface. The following equations describe this method, where λis the normalisation factor:

Let u, v, w ~ N(0,1),  λ = Rm / sqrt(u2+v2+w2)      [22]

x = uλ, y = vλ, z = wλ         [23]

It should be noted that the generation of normally distributed random numbers requires an element of rejection, and thus generating normal random numbers takes significantly longer than generating uniform random numbers.

(iii) Hypercube rejection method

Similar to the normal-deviate method, but each ordinate is uniformly randomly distributed on [-1, 1] (calculated as 2 x U(0,1]-1).  Compute the square root of the sum of squares, and if this value is greater than 1, the triplet is rejected.  If this value is below 1, the vector is normalised and scaled to Rm.

Generate u, v, w ~ 2 * U(0,1] - 1      [24]

Let s2 = u2 + v2 + w2; if s>1, reject, else λ = Rm/s [25]

x = uλ, y = vλ, z = wλ       [26]

As this method requires comparison and rejection, the chance of a triplet being inside the sphere as required is governed by the areas of a cube and sphere, such that 6/3.141 ≈ 1.910 triplets are required to achieve one which passes comparison.

(iv) Trigonometric method

This method is based on the fact that particles will be uniformly distributed on the sphere in one axis.  The proof of this has been published in peer reviewed journals.  The uniform direction, in this case Z, is randomly generated on [-Rm, Rm] to find the circle cross section at that point.  The particle is then distributed at a random angle on the XY plane due to the constraint in Z. 

Generate z ~ 2 x Rm x U(0,1] - Rm  [27]

Generate α ~ 2π x U(0,1]    [28]

Let r = sqrt(Rm2 – z2)      [29]

x = r cos(α)     [30]

y = r sin(α)     [31]

(v) Two-Dimensional rejection method

By combining methods (iii) and (iv), a form of the trigonometric method can be devised without the use of trigonometric functions.  Two random numbers are uniformly distributed along [-1, 1].  If the sum of squares of the two numbers is greater than 1, reject the doublet.  If accepted, the ordinates are calculated based on the proof that points in a single axis are uniformly distributed:

Generate u, v ~ 2 x U(0,1] - 1       [32]

Let s = u2 + v2; if s > 1, reject.     [33]

Let a = 2 x sqrt(1-s); k = a x Rm   [34]

x = ku, y = kv, z = (2s-1)Rm         [35]

Use of this method over the trigonometric method described in (iii) is dependent on whether the trigonometric functions of the system are slower than the combined speed of rejection and regeneration of the random numbers.

(vi) Bipyramidal Method

For completeness, the bipyramidal diffusion modelis also included.  This method generates a random whole number between 0 and 5 inclusive, which dictates a movement in an axis, either positive or negative relative to the starting position.  As a result, the one-dimensional representation of the random-walk, as shown in Figure 3, is played out in each of the three dimensions, and any particles that would have ended up outside the bipyramid from methods (i-v) would now be inside the pyramid.

The method for bipyramidal diffusion is shown below:

Generate  where  is the floor function

      [36]

This method is technically the computationally least expensive in terms of mathematical functions, but due to equation [36], requires a lot of ‘if’ type comparison statements and the speed of these will determine how fast the algorithm is.

Results

The Brownian motion tests are something I have been doing with motherboard reviews for over twelve months in the form of my 3DPM (Three-Dimensional Particle Movement) test in both single threaded and multithreaded form.  The simulation generates a number of particles, iterates them through a number of steps, and then checks the time to see if 10 seconds have passed.  If the 10 seconds are not up, it goes through another loop.  Results are then expressed in the form of million particle movements per second.

3D Particle Movement Single Threaded

3D Particle Movement MultiThreaded

Due to the nature of the simulation, absolutely no memory access are needed during computation - there is a small element at the end for addition.  This makes the DP system excellent for this type of work.  Perhaps unsurprisingly we see no change in our single threaded version with HyperThreading on or off.  However with the multithreaded version, HyperThreading gets a massive 63.3% boost. 

Abusing GPUs

Back when I was researching these methods and implementing them on GPUs, for method (iv), the fastest method, the following results were achieved:

Athlon X2 5050e, 2.6 GHz (2C/2T): 32.38
Dual Xeon E5520, 2.23 GHz (8C/16T): 230.63
NVIDIA Quadro 580, 32 CUDA Cores: 1041.80
NVIDIA GTX 460, 336 CUDA Cores: 13602.94

When the simulation is embarrassingly parallel like this one, GPUs make a lot of difference.  I recently rewrote method (iv) in C++ AMP and ran it on a i7-3770K at stock with a HD7970 also at stock, paired with 2400 C9 memory.  It gave a result of 73654.32.

Two Dimensional Implicit Finite Difference n-Body Simulations
Comments Locked

64 Comments

View All Comments

  • dj christian - Monday, January 14, 2013 - link

    No please!

    This article should be a one time only or once every 2 years at most.
  • nadana23 - Sunday, January 6, 2013 - link

    From the looks of results some of the benchmarks are HIGHLY sensitive to effective bandwidth per thread (ie, GDDR5 feeding a GPU stream processor >> DDR3 feeding a Xeon HT core).

    However - it must be noted that 8x DIMMS is insufficient to achieve full memory bandwidth on Xeon E5 2S!

    I'd suggest throwing a pure memory bandwidth test into the mix to make sure you're actually getting the rated number (51.2GB/s)...

    http://ark.intel.com/products/64596/Intel-Xeon-Pro...

    ... as I strongly suspect your memory config is crippling results.

    Dell's 12G config guidelines are as good a place as any to start on this :-

    http://en.community.dell.com/cfs-file.ashx/__key/c...

    Simply removing one E5-2590 and moving to 1-Package, 8 DIMM config may (counter-intuitively) bench(market) faster... for you.
  • dapple - Sunday, January 6, 2013 - link

    Great article, thanks! This is the sort of benchmark I've been wanting to see for quite some time now - simple, brute-force numerics where the code is visible and straightforward. Too many benchmarks are black boxes with processor- and compiler-specific tunes to make manufacturer "X" appear superior to "Y". That said, it would be most illustrative to perform a similar 'mark using vanilla gcc on both MS and *nix OS.
  • daosis - Sunday, January 6, 2013 - link

    It is long known issue, when windows does not start after changing hardware, especially GPU (not always so). There is as long known trick so. Just before last "power off" one should replace GPU's own driver with basic microsoft's one. In case of GPU it is "standart Vga adapter" (device manager - update driver - browse my computer - let me pick up). In fact one can replace all specific drivers on OS with similiar basic from MS and then to put this hard drive virtually to any system without any need for fresh install. Mind you, that after first boot it takes some time for OS to find and install specific drivers.
  • jamesf991 - Sunday, January 6, 2013 - link

    In the early '70s I was doing very similar simulations using a PDP 11/40 minicomputer. (I can send citations to my publications if anyone is interested.) At Texas Tech and later at Caltech, I simulated systems involving heterogeneous electron transfer kinetics, various chemical reactions in solution, coulostatics, galvanostatics, voltammetry, chronocoulometry, AC voltammetry, migration, double layer effects, solution hydrodynamics (laminar only), etc. Much of this was done on a PDP 11/40, originally with 8K words (= 16K bytes) of core memory. Later the machine was upgraded to 24 K words (!), we got a floating point board, and a hard disk drive (5 M words, IIRC). My research director probably paid in excess of $50K for the hardware. One cute project was to put a simulation "inside" a nonlinear regression routine to solve for electrode kinetic parameters such as k and alpha. Each iteration of the nonlinear solver required a new simulation -- hand-coding the innermost loops using floating point assembly instructions was a big speedup!

    I wonder how the old PDP would stack up against the 3770?
  • flynace - Monday, January 7, 2013 - link

    Do you guys think that once Haswell moves the VRM on package that someone might do a 2 socket mATX board?

    Even if it means giving up 2 of the 4 PCIe slots and/or 2 DIMMs per socket it would be nice to have a high core count standard SFF board for those that need just that.
  • samsp99 - Monday, January 7, 2013 - link

    I found this review interesting, but I don't think this board is really targeted at the HPC market. It seems like it would be good as part of a 2U / 12 + 2 drive system, similar to the Dell C2100. It would make a good virtual host, SQL, active web server etc. Having the 3 mSAS connectors would enable 4 drive each without the need for a SAS expander.

    Servers are designed for 99.999% uptime, remote management, and hands-off operation. To achieve that you need redundent power, UPS, Networking, storage etc. They also require high airflow, which is noisy and not something you want sitting under your desk. Based on that, it makes sense that the MB is intended for sale to system builders not your general build your own enthusiast.

    HW manufactuerers are faced with a similar problem to airlines - consumers gravitate to the cheapest price, and so the only real money to be made is selling higher profit margin products to businesses. Servers are where intel etc makes their profits.

    For the computational problems the author is trying to solve, to me it would seem to be better to consider:
    a) At one point, I think google was using commodity hardware, with custom shelving etc. Assuming the algorithms can be paralleled on different hosts, you shouldn't need the reliability of traditional servers, so why not use a number of commodity systems together, choosing the components that have the best perf/$.

    b) There are machines designed for HPC scenarios, such as HPC Systems E5816 that supports 8x Xeon E7-8000 (10 core) processors, or the E4002G8 - that will take 8 nVidia Tesla cards.

    c) What about developing and testing the software on cheap worstations, and then when you are sure its ready, buying compute time from Amazon cloud services etc.
  • babysam - Monday, January 7, 2013 - link

    It is quite delighting to look at your review on Anandtech (especially when I am using software and computer configurations of similar nature for my studies), as it is quite difficult for me to evaluate the performance gain of "real-life" software (i.e. science oriented in my case) on new hardware before buying.

    From what I have seen in your code segments provided (especially for the n-body simulation part) , there are large amount of floating-point divisions. Is there any possibility that the code is not only limited by the cache size(and thrashing), but by the limited throughput of the floating-point divider? (i.e. The performance degradations when HT is enabled may also be caused by the competition of the two running threads on the only floating-point divider in the core)
  • SanX - Tuesday, January 8, 2013 - link

    if you post zipped sources and exes for anyone to follow, learn, play, argue and eventually improve.

    I'd also preferred to see Fortran sources and benchmarks when possible.

    Intel/AMD should start promote 2/4/8 socket monster mobos for enthusiasts and then general public since this is the beginning of the infinite in time era for multiprocessing.

    Also where are games benchmarks like for example GTA4 which benefits a lot from multicores as well as from GPUs?
  • IanCutress - Wednesday, January 9, 2013 - link

    The n-body simulations are part of the C++ AMP example page, free for everyone to use. The rest of the code is part of a benchmark package I'm creating, hence I only give the loops in the code. Unfortunately I know no Fortran for benchmarks.

    Most mainstream users (i.e. gamers) still debate whether 4 or 6 cores are even necessary, so moving to 2P/4P/8P is a big leap in that regard. Enthusiasts can still get the large machines (a few folders use quad AMD setups) if they're willing to buy from ebay which may not always be wholly legal. You may see 2P/4P/8P becoming more mainstream when we start to hit process node limits.

    Ian

Log in

Don't have an account? Sign up now