Measuring the Dual core

Michael S. started this extremely interesting thread at the Ace’s hardware Technical forum. The result was a little program coded by Michael S. himself, which could measure the latency of cache-to-cache data transfer between two cores or CPUs. In his own words: “it is a tool for comparison of the relative merits of different dual-cores.”

Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time. For those interested, the source code is available here.

We disabled Hyper Threading on all Pentium 4 (except the first one) and Xeons, to make sure that we measure from one physical core to another.

CPU configuration Hyper Transport speed Bus speed Cache2Cache
Pentium 4 3.6 GHz with Hyperthreading N/A 800 MHz 21 ns
Dual Opteron 2.4 GHz 800 MHz N/A 159 ns
Dual Opteron 2.4 GHz (iwill) 1000 MHz N/A 150 ns
Dual core Opteron 2.2 GHz to other Dual Core 1000 MHz N/A 164 ns
One Dualcore Opteron 2.2 GHz (875) 2200 MHz* N/A 107 ns
Quad Opteron 848 2.2 GHz (CPU0 to CPU2) 800 MHz N/A 240 ns
XeonDP (Prestonia) 2.8GHz N/A 400 MHz 297 ns
Dual Xeon 3.06 (Gallatin) N/A 533 MHz 219 ns
Dual Xeon 3.2 GHz (Nocona) N/A 800 MHz 242 ns
Pentium-D (dual core) N/A 800 MHz 240 ns
Dual Xeon 3.6 GHz (Nocona) N/A 800 MHz 244 ns
* Via SRQ, at clock speed of CPU

The Pentium-D (3.2 GHz) has no cache-to-cache latency advantage whatsoever over similar dual Xeon (Nocona 3.2 GHz) configurations. The Dual Opteron exchanges information no less than 60% faster than a similar Dual Xeon or Pentium-D. In case of the dual Opteron, cache coherency transfers are done via Hypertransport and the 1 GHz Hypertransport connection delivers 6% lower latency than the 800 MHz one.

Dual Core Opterons perform cache-to-cache transfer via the SRQ and it shows. The latency is another 40% lower than the Dual Opteron with the fastest Hypertransport connection, and more than twice as fast as the Pentium-D!

From the data available, we can also calculate the latency of the Hypertransport channel between two (single core) Opterons. In case of an 800 MHz link, you get (159 – 107 ns)/2 or about 26 ns. A 1000 MHz link takes about (150-107 ns)/2 or about 21.5 ns. Typically, a local memory access from the Opteron to its closest or local memory banks takes about 50-60 ns. Therefore, a remote memory access (CPU 1 goes to the local memory of CPU 2) should take roughly 80 ns (20 ns + 60 ns).

We have measured between 100 and 120 ns for a Xeon system, so it seems that the Opteron in a dual configuration is capable of accessing remote memory quicker than the Xeon can via the shared bus.

Back to our main focus: the dual core Opteron cache-to-cache latency is vastly superior compared to any dual core netburst CPU. Can this huge advantage – when measured with a micro bench – translate into a performance boost in a real world application?

And in the Real world?

Before I wrote this article, Anand told me that he was already testing with a lot of applications to see if the superior dual core architecture of the Opteron/Athlon 64 x2 could make performance scale better from one single core to a dual core CPU in a real world application. So far, he found out that multithreaded 3D rendering and video encoding applications don’t show any scaling advantages for the Opteron. I couldn’t find any either when testing read performance on database servers (DB2, MySQL MyISAM). But as you will see further, that is not a surprise and it doesn’t mean that there is no performance benefit whatsoever.

The problem is that most of the current multi-threaded software, especially on the desktop, are developed with the objective of minimizing messaging between threads and synchronization between caches. You could say that only the “easy to thread” part of most current programs are divided in many threads.

Although we have a lot of testing to do, we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers.

OLTP (On-Line Transaction Processing) applications might be one type of software to benefit significantly. A good example of an OLTP application is a bank account database application. Imagine two clients sending two different updates of a bank account. One transaction wants to add your salary to the current balance in your bank account, while the other transaction wants to decrease it with the purchase that you just made.

The machine on which the OLTP application is running is a dual CPU system. Both CPU have the current balance of your account in their caches, and each CPU gets to perform one of the described transactions. It is pretty clear that the first transaction has to finish before the new one starts; otherwise, the second result (current balance – purchase) will simply overwrite the first calculation (current balance + salary). So, the row that contains the current balance of your bank account must be locked by the first CPU, and read out. The second CPU must now know that it cannot use that variable anymore (CPU 1 tells CPU 2 to flush that cacheline or mark it as invalid) because it is about to change. A calculation is performed and written back to the memory. New value is communicated to CPU 2, and the row is unlocked if everything is OK. The second CPU now performs the second calculation, based on the updated balance. This example is simplified, but notice that the CPU must talk quite a bit to each other.

A database application where this frequent locking, reading, writing, and unlocking will most likely show the superior cache-to-cache transfer latency of the Opteron…if it is not bottlenecked by the speed of memory and/or the storage system.

Basically, we could say that the databases that lock on the table level (MySQL MyISAM) will not show any performance advantage. Table level locking is fast, and produces little overhead as long as you do not update (write) the values in your database often. If you write a lot to your database, the database will be terribly slow. These kinds of databases will not care about the dual core architecture.

However, database engines (DB2, Oracle, MySQL Innodb) with a much finer locking grain (row level) produce a lot more overhead, but will perform better when many writes are mixed with reads. These types of database engines will be much more sensitive to the dual core architecture. Again, that will be the case if the storage or memory system is not the bottleneck and the CPU is.

Another example might be a scanline renderer (each line of pixels to render is sent to a different CPU) where the render time per line is very short. This would mean that a lot of time is spent keeping track of which CPU has to be given which line and so on, and this requires a lot of synchronization between the two caches.

Some HPC applications where threads perform calculations on shared data might also show significantly better scaling on the dual core Opteron.

In general, the more time that a program spends in synchronisation and passing messages, and the shorter the computation time on each CPU, the worse that the multi threaded program scales on SMP configurations. However, those applications are exactly the ones where the dual core Opteron is going to show scaling advantages compared to the Pentium-D and Xeons.

We’ll report back with some real world benchmarks.

Dual core Opteron versus Pentium-D and Dempsey SMT Dead?
Comments Locked

28 Comments

View All Comments

  • Viditor - Friday, May 20, 2005 - link

    fitten - Thanks very much for the explanation!
  • fitten - Friday, May 20, 2005 - link

    "When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.

    Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data."

    Only if you are using synchronization primitives (mutex, critical section, semaphore, etc.) which are kernel objects or you call sleep() or something in the midst of reading/writing values. If you are just reading/writing a memory location, the OS doesn't know anything about it. Plus, if you have multiple CPUs/cores, more than one thread can be running simultaneously, which is where the MOESI protocols really come into play.
  • cz - Friday, May 20, 2005 - link

    When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.

    Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data.
  • fitten - Thursday, May 19, 2005 - link

    "When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"

    But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?"

    No, MOESI doesn't help avoid the problem - It is the mechanism of how the problem is arbitrated and resolved.

    Simplified example: CPU1 wants some data. The cache subsystem uses MOESI to determine that CPU0 currently owns that data. MOESI protocols are then used to transfer the ownership of that data to CPU1 (including copying the data to a different cache if necessary). Meanwhile, one (definitely the writing core) or both cores must wait while the MOESI stuff is done and then CPU1 is allowed to proceed with its write.

    So, you can write a two thread program where each thread does nothing but writes a value into a memory location (both threads write to the same memory location). That cannot be avoided by anything. On every write, MOESI will be invoked to resolve the ownership of the data and make sure the processor currently wanting to write to that memory location owns it. So, these two threads will generate massive amounts of MOESI traffic between the two caches (on a multi-core or multi-processor machine) because both cores want to effectively always own that memory. While MOESI is fast, it still takes time to resolve, longer than not having to do the transfer of ownership and any copying required in any case. So, you have two cores fighting over the data and generating a lot of MOESI overhead which saps performance from both cores (both cores spend a bit of time waiting until the cache tells it that it can do its writing).

    "I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples."

    Absolutely. There are times when it simply cannot be avoided and must be done. But, if you can avoid it, then you probably want to avoid it :)
  • JohanAnandtech - Thursday, May 19, 2005 - link

    Ahkorishaan:

    Good summary, that is most likely what is happening at Intel.


    bob661:

    "The Quest for More Processing Power, Part Three: ", that doesn't sound like a buyers guide hey? :-)

    nserra:

    Very astute! Ok, ok, "AMDs current dual core architecture is pretty good, let’s wait Until Intel gets it right :-).

    Fitten:

    I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples.
  • Viditor - Thursday, May 19, 2005 - link

    "When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"

    But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?
  • fitten - Thursday, May 19, 2005 - link

    Processes that will benefit from fast cache-cache transfers are ones that are multithreaded and the threads are manipulating the same data. There are applications that do this, but usually when you design multi-threaded applications you try to avoid these type situations. When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such. Shared (L2) caches tend to help this out because the data doesn't actually have to be transfered to the other core's cache as a part of the taking of ownership, the cache line(s) can stay right where they are with only the ownership modified.

    Anyway, HPC code usually goes through pains to avoid the situation where ownership of data must switch between processes/threads often. That's why data partitioning is one of the most important steps of application design in parallel applications.
  • blackbrrd - Thursday, May 19, 2005 - link

    Uhm.. #19 - that is exactly the point, to check if a row is locked you most likely have to query the other caches to see if it is locked or not...
  • JNo - Thursday, May 19, 2005 - link

    "In Part 2, Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games."

    ...before showing off a beautiful working demo of the Unreal 3 engine on the 7-core PS3 cell processor that was put together in only 2 months and that was relatively easy to develop according to the Unreal guys themselves... Ha! (cos Sweeney did downplay the use of multithreading in games if you read his original comments)
  • cz - Thursday, May 19, 2005 - link

    It is an interesting read I would say. But I would like to point out that OLTP programs will not benefit from cache2cache performance very much. That is because the very principle of multi-threaded programming requires the user account to be locked before updating. So only one thread can update an user account at any given time and other threads are blocked. Only programs that use data in single-write and multi-read form will benefit from cache2cache performance. And most likely these applications will be some sort of scientific simulations.

Log in

Don't have an account? Sign up now