Brownian Motion

When chemicals move around in the air, or dissolved in a liquid, or even travelling through a solid, they diffuse from an area of high concentration to low concentration, or follow Le Chatelier’s principle if an external force is applied.  Typically this movement is described via an overall statistical shift, expressed as a partial differential equation.  However, if we were to pick out an individual chemical and observe its movements, we would see that it moves randomly according to thermal energy – at every step it moves in a random direction until an external force (such as another chemical gets in the way) is encountered.  If we sum up the movements of millions or billions of such events, we get the gradual motion explained by statistics.

The purpose of Brownian motion simulation is to calculate what happens to individual chemicals in circumstances where the finite motion is what dictates the activity occurring.  For example, experiments dealing with single molecule detection rely on how a single molecule moves, or during deposition chemistry, the movement of individual chemicals will dictate how a surface achieves deposition in light of other factors.  The movement of the single particle is often referred to a Random Walk (of which there are several types).

There are two main differences between a random-walk simulation and the finite difference methods used over the previous pages.  The first difference is that in a random walk simulation each particle in the simulation is modelled as a point particle, completely independent in its movement with respect to other particles in the simulation (the diffusion coefficient in this context is indirectly indicative of the medium through which the particle is travelling through).  With point particles, the interactions at boundaries can be quantized with respect to the time in simulation.  From the perspective of independent particles, we can utilise various simulation techniques designed for parallel problems, such as multi-CPU analysis or using GPUs.  To an extent, this negates various commentaries which criticise the random-walk method for being exceedingly slow.

The second difference is the applicability to three dimensional simulations.  Previous methods for tackling three dimensional diffusion (in the field I was in) have relied on explicit finite difference calculation of uncomplicated geometries, which result in large simulation times due to the unstable nature of the discretisation method. The alternating direction implicit finite difference method in three dimensions has also been used, but can suffer from oscillation and stability issues depending on the algorithm method used.  Boundary element simulation can be mathematically complex for this work, and finite element simulation can suffer from poor accuracy.  The random-walk scenario is adaptable to a wide range of geometries and by simple redefinition of boundaries and/or boundary conditions, with little change to the simulation code.

In a random walk simulation, each particle present in the simulation moves in a random direction at a given speed for a time step (determined by the diffusion coefficient).  Two factors computationally come into play – the generation of random numbers, and the algorithm for determining the direction in which the particle moves.

Random number generation is literally a minefield when first jumped into.  Anyone with coding experience should have come across a rand() function that is used to quickly (and messily) generate a random number.  This number as a function of the language library repeats itself fairly regularly, often within 2^15 (32768) steps.  If a simulation is dealing with particles doing 100,000 steps, or billions of particles, this random number generator is not sufficient.  For both CPU and GPU there are a number of free random number generators available, and ones like the Mersenne Twister are very popular.  We use the Ranq1 random number generator found in Numerical Recipes 3rd Edition: The Art of Scientific Computing by Press et al., which features a quick generation which has a repeated stepping of ~1.8 x 10^19.  Depending on the type of simulation contraints, in order to optimize the speed of the simulation, the random numbers can be generated in an array before the threads are issued and the timers started then sent through the function call.  This uses additional memory, but is observed to be quicker than generating random numbers on the fly in the movement function itself.

Random Movement Algorithms

Trying to generate random movement is a little trickier than random numbers.  To move a certain distance from an original point, the particle could end up anywhere on the surface of a sphere.  The algorithm to generate such movement is not as trivial as a random axial angle followed by a random azimuth angle, as this generates movement biased at the poles of a sphere.  By doing some research, I published six methods I had found for generating random points on the surface of a sphere.

For completeness, the notation x ~ U(0,1] means that x is a uniformly generated single-ended random number between 0 and 1.

(i) Cosine Method

In spherical polar coordinates, the solid angle of a sphere, Ω, is defined as the area A covered by that angle divided by the square of the radius, re, as shown in equation 14.  The rate of change of this angle with respect to the area at constant radius gives the following set of equations:

 [14]

        [15]

   [16]

Thus Θ has a cosine weighting.  This algorithm can then be used by following these steps:

Generate     [17]

Generate           [18]

     [19]

    [20]

  [21]

Computationally, 2π can be predefined as a constant, and thus does not require a multiplication every time it is used.

(ii) Normal-Deviate Method

As explained by Knuth, each x, y and z coordinate is randomly generated from a normal distribution with a mean of 0 and variance of 1.  The result is normalised to the surface of the sphere to obtain a uniform distribution on the surface. The following equations describe this method, where λis the normalisation factor:

Let u, v, w ~ N(0,1),  λ = Rm / sqrt(u2+v2+w2)      [22]

x = uλ, y = vλ, z = wλ         [23]

It should be noted that the generation of normally distributed random numbers requires an element of rejection, and thus generating normal random numbers takes significantly longer than generating uniform random numbers.

(iii) Hypercube rejection method

Similar to the normal-deviate method, but each ordinate is uniformly randomly distributed on [-1, 1] (calculated as 2 x U(0,1]-1).  Compute the square root of the sum of squares, and if this value is greater than 1, the triplet is rejected.  If this value is below 1, the vector is normalised and scaled to Rm.

Generate u, v, w ~ 2 * U(0,1] - 1      [24]

Let s2 = u2 + v2 + w2; if s>1, reject, else λ = Rm/s [25]

x = uλ, y = vλ, z = wλ       [26]

As this method requires comparison and rejection, the chance of a triplet being inside the sphere as required is governed by the areas of a cube and sphere, such that 6/3.141 ≈ 1.910 triplets are required to achieve one which passes comparison.

(iv) Trigonometric method

This method is based on the fact that particles will be uniformly distributed on the sphere in one axis.  The proof of this has been published in peer reviewed journals.  The uniform direction, in this case Z, is randomly generated on [-Rm, Rm] to find the circle cross section at that point.  The particle is then distributed at a random angle on the XY plane due to the constraint in Z. 

Generate z ~ 2 x Rm x U(0,1] - Rm  [27]

Generate α ~ 2π x U(0,1]    [28]

Let r = sqrt(Rm2 – z2)      [29]

x = r cos(α)     [30]

y = r sin(α)     [31]

(v) Two-Dimensional rejection method

By combining methods (iii) and (iv), a form of the trigonometric method can be devised without the use of trigonometric functions.  Two random numbers are uniformly distributed along [-1, 1].  If the sum of squares of the two numbers is greater than 1, reject the doublet.  If accepted, the ordinates are calculated based on the proof that points in a single axis are uniformly distributed:

Generate u, v ~ 2 x U(0,1] - 1       [32]

Let s = u2 + v2; if s > 1, reject.     [33]

Let a = 2 x sqrt(1-s); k = a x Rm   [34]

x = ku, y = kv, z = (2s-1)Rm         [35]

Use of this method over the trigonometric method described in (iii) is dependent on whether the trigonometric functions of the system are slower than the combined speed of rejection and regeneration of the random numbers.

(vi) Bipyramidal Method

For completeness, the bipyramidal diffusion modelis also included.  This method generates a random whole number between 0 and 5 inclusive, which dictates a movement in an axis, either positive or negative relative to the starting position.  As a result, the one-dimensional representation of the random-walk, as shown in Figure 3, is played out in each of the three dimensions, and any particles that would have ended up outside the bipyramid from methods (i-v) would now be inside the pyramid.

The method for bipyramidal diffusion is shown below:

Generate  where  is the floor function

      [36]

This method is technically the computationally least expensive in terms of mathematical functions, but due to equation [36], requires a lot of ‘if’ type comparison statements and the speed of these will determine how fast the algorithm is.

Results

The Brownian motion tests are something I have been doing with motherboard reviews for over twelve months in the form of my 3DPM (Three-Dimensional Particle Movement) test in both single threaded and multithreaded form.  The simulation generates a number of particles, iterates them through a number of steps, and then checks the time to see if 10 seconds have passed.  If the 10 seconds are not up, it goes through another loop.  Results are then expressed in the form of million particle movements per second.

3D Particle Movement Single Threaded

3D Particle Movement MultiThreaded

Due to the nature of the simulation, absolutely no memory access are needed during computation - there is a small element at the end for addition.  This makes the DP system excellent for this type of work.  Perhaps unsurprisingly we see no change in our single threaded version with HyperThreading on or off.  However with the multithreaded version, HyperThreading gets a massive 63.3% boost. 

Abusing GPUs

Back when I was researching these methods and implementing them on GPUs, for method (iv), the fastest method, the following results were achieved:

Athlon X2 5050e, 2.6 GHz (2C/2T): 32.38
Dual Xeon E5520, 2.23 GHz (8C/16T): 230.63
NVIDIA Quadro 580, 32 CUDA Cores: 1041.80
NVIDIA GTX 460, 336 CUDA Cores: 13602.94

When the simulation is embarrassingly parallel like this one, GPUs make a lot of difference.  I recently rewrote method (iv) in C++ AMP and ran it on a i7-3770K at stock with a HD7970 also at stock, paired with 2400 C9 memory.  It gave a result of 73654.32.

Two Dimensional Implicit Finite Difference n-Body Simulations
Comments Locked

64 Comments

View All Comments

  • nevertell - Sunday, January 6, 2013 - link

    The K version may not, but the standard i7-3770 does in fact support VT-D, TXT and ECC memory from the get go. Vt-D has to be also supported by the motherboard, which may be problematic on consumer motherboards. I have a i5-2400 myself, and Vt-d is a pain to setup and to this day I still haven't found out whether is it that I am unable to set up Xen properly or just that my cheap motherboard worn't support VT-d, to properly assign a video card to a virtual machine.
  • KAlmquist - Sunday, January 6, 2013 - link

    The 3770K lacks those features, but that doesn't invalidate my point.

    Using ECC memory improves system availability, and likely decreases the probability of undetected errors resulting in incorrect computations. If these are important to you, then you should be thinking about full double or triple redundancy. Why not buy three 3770K based systems and run the same simulation on all of them? Most of the time you will get identical results on all three systems, but on rare occasions one of the systems will die during the run. No problem; you have the simulation results from the other two systems. On even rarer occasions, one of the systems will produced an incorrect result due to an undetected bit error. Again no problem; you take the results from the two simulations that agree.

    With full redundancy it doesn't matter where in the system the error occurs because full redundancy addresses faults anywhere in the system. This makes it superior to ECC memory, which only addresses faults in the memory subsystem. So the only reason to go with ECC memory instead of full triple redundancy is if the ECC memory approach costs less. Based on the numbers I posted, you aren't going to get a lower cost based on hardware costs alone. Possibly you could get there by including administrative costs and the like.

    I'm not saying that the system Ian tested wouldn't make sense under *any* circumstances. My point is that the system has a poor price performance ratio, so it only makes sense when a lot of things are working in its favor.

    The second feature you mention is VT-D, which makes it more efficient to emulate device hardware in virtual machines. I don't have any benchmarks, but my guess is that the performance improvement from VT-D is fairly small. In any case, if you want VT-D you can buy the 3770 rather than the 3770K. You can't overclock the 3770, but my comments about the 3770K offering "similar performance" were based primarily on the performance of the 3770K at stock frequency. If you assume that everyone is going to take the time to find an optimal overclock for their CPU, then the E5-2690 (which cannot be overclocked) looks even worse.

    I suppose it's off topic to debate the merits of "trusted execution technology" here, so I will simply note that if for whatever reason you want a processor that supports it, the solution is the same as for VT-D: get the 3770 instead of the 3770K.
  • Kevin G - Saturday, January 5, 2013 - link

    A very well written article that sticks toward its purpose: scientific computing. Really pleased to see articles like this on the site even if I have a few minor quibbles.

    On page 2 "To those unfamiliar with server boards, of note is the connector just to the right of center of the picture above." is either oddly worded to describe the front panel connector at the bottom the board (which is indeed right of center but not in the center of the picture) or describing a connector that isn't even documented in the manual. For clarification I'm looking at the connector just right of the top PCI-E 16x slot (above and to the left of the battery). Actually, what is that connector labeled as? I've seen it on other Xeon boards but have never seen it used.

    The last paragraph on page 2 should read omits the possibility of nonbuffered ECC memory and implies the usage of unbuffered non-ECC memory. I haven't found confirmation that this board can accept unbuffered, non-ECC memory (opposed to the possibility of an ECC requirement as some server vendors enforce).

    A couple of notes on the little processor talk on page 6. Dealing with cache thrashing between L3 and L2 is possible but when dealing with a high number of threads general coherency becomes a bigger factor. The overhead is beginning to exceed the benefit of having the additional hardware to run them. If you're lucky to be dealing with an algorithm that doesn't need such coherency overhead, then chances are it is very ideal for GPU compute (and memory capacity isn't a factor). A minor nicety would have been to see some more testing without Hyperthreading on the i7-3770k, i7-3930k, and i7-3960X to better indicate scaling with/without Hyperthreading. I suspect that those single socket processors would have been able to show some small gains with Hyperthreading where the dual socket system did not.

    An extension to the L2/L3 cache talk on page 6 is the move to dual sockets and NUMA. There is a performance penalty due to latency for having one thread access memory that is found on a remote socket. Memory mirroring between sockets can eliminate that remote penalty while increasing RAS but at the cost of halving effective memory capacity. The manual isn't clear if mirroring mode or the lockstep mode is across different sockets (it can be done across memory channels as well).

    I'd also would have loved to have heard some comparisons with the Gigabyte GA-X79S-UP5. While the name implies an X79 chipset, it uses the C606 chipset. It'll support ECC memory with socket 2011 Xeons and plenty of over clocking features (for the daring). Comparing the GA-7PESH1 to the GA-X79S-UP5 would have been able to answer if the move to dual sockets would have been worth the extra cost.
  • Hakon - Saturday, January 5, 2013 - link

    Somehow does read like an anonymous peer review :-)
  • Kevin G - Sunday, January 6, 2013 - link

    A little bit. :)

    Part of my criticism isn't about the article itself but rather the general state of massively multithreaded hardware and software. The hardware portion is quickly running into software limitation that were never expected to be reached in the professional space. A decade ago who thought that a scientist could purchase a 240 simultaneous thread processor that would fit on a mere expansion card? In some cases we don't reach Amdal's Law before hitting an artificial barrier due to scheduling or coherency overhead.

    I just noticed that the system was using Win 7 Professional which has a limit of 64 concurrent threads per process. A quad socket LGA 2011 config would actually be at the very limit of what Window 7 (or rather 2008R2 since professional only scales to two sockets) can handle. OpenMP can handle more than 64 concurrent threads but on Windows it has to submit this limitation.
  • psyq321 - Sunday, January 6, 2013 - link

    As for the GA-X79S-UP5 Clocking features are only working for 1P Xeons, which are basically similar to HEDT i7 (36xx) line. With those, customer has an advantage of ECC RAM support and still some overclocking headroom.

    Clocking 2P/4P Xeons E5 (sadly, these are the only 8-core parts so far) is next to impossible due to the lack of ICC configuration data allowing changing BCLK ratios. These Xeons can only be bumped by direct BCLK increase, which is dangerous above few MHz. At most, 5-6 MHz is feasible as tested on ASUS Z9PE-D8-WS and EVGA SR-X boards.

    Memory overclocking is another matter, completely. I have excellent results with Samsung's 1.35v ("low voltage") ECC RAM. It is not just the cheapest 16 GB ECC option (~$160 for the 16 GB ECC stick last time I checked, I got mine for 140 EUR in Germany 7 months ago), but it is the fastest while still keeping the low voltage. This RAM can be overclocked to 2133 MHz by a simple voltage bump to 1.55v, which is still within Xeon's VSa limits.
  • Kevin G - Sunday, January 6, 2013 - link

    Weird that Intel doesn't provide the ICC configuration data. The 'gear ratio' change is something I'd still expect to change on true X79 boards regardless of processors (I can see Intel crippling this on C600 series). Then again, I've heard some weird situations with LGA 2011 Xeons in desktop boards. There are some scattered reports of unlocked chips but as the internet goes there are lots of speculation and rumors but little real confirmation.

    Those Samsung 16 GB ECC sticks are registered? I thought that the GA-X79S-UP5 didn't registered DIMMs.

    As for the ability to overclock those low voltage DIMMs, not really surprised as they've historically been impressive in that regards. I have some older 4 GB 1.35v DDR3-1333 rated sticks that can go to 1866 Mhz at 1.5v. :) The timings had to be changed but still impressive.
  • PEPCK - Saturday, January 5, 2013 - link

    Worth noting that the three miniSAS connectors yield 8 SAS and 4 SATA connectors in the specification table.
  • krumme - Sunday, January 6, 2013 - link

    For this article Ian get the über nerds Gold Award only given ones in a century
  • lowenz - Sunday, January 6, 2013 - link

    A brilliant article.

    More of these, please.

Log in

Don't have an account? Sign up now