Measuring the Dual core

Michael S. started this extremely interesting thread at the Ace’s hardware Technical forum. The result was a little program coded by Michael S. himself, which could measure the latency of cache-to-cache data transfer between two cores or CPUs. In his own words: “it is a tool for comparison of the relative merits of different dual-cores.”

Cache2Cache measures the propagation time from a store by one processor to a load by the other processor. The results that we publish are approximately twice the propagation time. For those interested, the source code is available here.

We disabled Hyper Threading on all Pentium 4 (except the first one) and Xeons, to make sure that we measure from one physical core to another.

CPU configuration Hyper Transport speed Bus speed Cache2Cache
Pentium 4 3.6 GHz with Hyperthreading N/A 800 MHz 21 ns
Dual Opteron 2.4 GHz 800 MHz N/A 159 ns
Dual Opteron 2.4 GHz (iwill) 1000 MHz N/A 150 ns
Dual core Opteron 2.2 GHz to other Dual Core 1000 MHz N/A 164 ns
One Dualcore Opteron 2.2 GHz (875) 2200 MHz* N/A 107 ns
Quad Opteron 848 2.2 GHz (CPU0 to CPU2) 800 MHz N/A 240 ns
XeonDP (Prestonia) 2.8GHz N/A 400 MHz 297 ns
Dual Xeon 3.06 (Gallatin) N/A 533 MHz 219 ns
Dual Xeon 3.2 GHz (Nocona) N/A 800 MHz 242 ns
Pentium-D (dual core) N/A 800 MHz 240 ns
Dual Xeon 3.6 GHz (Nocona) N/A 800 MHz 244 ns
* Via SRQ, at clock speed of CPU

The Pentium-D (3.2 GHz) has no cache-to-cache latency advantage whatsoever over similar dual Xeon (Nocona 3.2 GHz) configurations. The Dual Opteron exchanges information no less than 60% faster than a similar Dual Xeon or Pentium-D. In case of the dual Opteron, cache coherency transfers are done via Hypertransport and the 1 GHz Hypertransport connection delivers 6% lower latency than the 800 MHz one.

Dual Core Opterons perform cache-to-cache transfer via the SRQ and it shows. The latency is another 40% lower than the Dual Opteron with the fastest Hypertransport connection, and more than twice as fast as the Pentium-D!

From the data available, we can also calculate the latency of the Hypertransport channel between two (single core) Opterons. In case of an 800 MHz link, you get (159 – 107 ns)/2 or about 26 ns. A 1000 MHz link takes about (150-107 ns)/2 or about 21.5 ns. Typically, a local memory access from the Opteron to its closest or local memory banks takes about 50-60 ns. Therefore, a remote memory access (CPU 1 goes to the local memory of CPU 2) should take roughly 80 ns (20 ns + 60 ns).

We have measured between 100 and 120 ns for a Xeon system, so it seems that the Opteron in a dual configuration is capable of accessing remote memory quicker than the Xeon can via the shared bus.

Back to our main focus: the dual core Opteron cache-to-cache latency is vastly superior compared to any dual core netburst CPU. Can this huge advantage – when measured with a micro bench – translate into a performance boost in a real world application?

And in the Real world?

Before I wrote this article, Anand told me that he was already testing with a lot of applications to see if the superior dual core architecture of the Opteron/Athlon 64 x2 could make performance scale better from one single core to a dual core CPU in a real world application. So far, he found out that multithreaded 3D rendering and video encoding applications don’t show any scaling advantages for the Opteron. I couldn’t find any either when testing read performance on database servers (DB2, MySQL MyISAM). But as you will see further, that is not a surprise and it doesn’t mean that there is no performance benefit whatsoever.

The problem is that most of the current multi-threaded software, especially on the desktop, are developed with the objective of minimizing messaging between threads and synchronization between caches. You could say that only the “easy to thread” part of most current programs are divided in many threads.

Although we have a lot of testing to do, we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers.

OLTP (On-Line Transaction Processing) applications might be one type of software to benefit significantly. A good example of an OLTP application is a bank account database application. Imagine two clients sending two different updates of a bank account. One transaction wants to add your salary to the current balance in your bank account, while the other transaction wants to decrease it with the purchase that you just made.

The machine on which the OLTP application is running is a dual CPU system. Both CPU have the current balance of your account in their caches, and each CPU gets to perform one of the described transactions. It is pretty clear that the first transaction has to finish before the new one starts; otherwise, the second result (current balance – purchase) will simply overwrite the first calculation (current balance + salary). So, the row that contains the current balance of your bank account must be locked by the first CPU, and read out. The second CPU must now know that it cannot use that variable anymore (CPU 1 tells CPU 2 to flush that cacheline or mark it as invalid) because it is about to change. A calculation is performed and written back to the memory. New value is communicated to CPU 2, and the row is unlocked if everything is OK. The second CPU now performs the second calculation, based on the updated balance. This example is simplified, but notice that the CPU must talk quite a bit to each other.

A database application where this frequent locking, reading, writing, and unlocking will most likely show the superior cache-to-cache transfer latency of the Opteron…if it is not bottlenecked by the speed of memory and/or the storage system.

Basically, we could say that the databases that lock on the table level (MySQL MyISAM) will not show any performance advantage. Table level locking is fast, and produces little overhead as long as you do not update (write) the values in your database often. If you write a lot to your database, the database will be terribly slow. These kinds of databases will not care about the dual core architecture.

However, database engines (DB2, Oracle, MySQL Innodb) with a much finer locking grain (row level) produce a lot more overhead, but will perform better when many writes are mixed with reads. These types of database engines will be much more sensitive to the dual core architecture. Again, that will be the case if the storage or memory system is not the bottleneck and the CPU is.

Another example might be a scanline renderer (each line of pixels to render is sent to a different CPU) where the render time per line is very short. This would mean that a lot of time is spent keeping track of which CPU has to be given which line and so on, and this requires a lot of synchronization between the two caches.

Some HPC applications where threads perform calculations on shared data might also show significantly better scaling on the dual core Opteron.

In general, the more time that a program spends in synchronisation and passing messages, and the shorter the computation time on each CPU, the worse that the multi threaded program scales on SMP configurations. However, those applications are exactly the ones where the dual core Opteron is going to show scaling advantages compared to the Pentium-D and Xeons.

We’ll report back with some real world benchmarks.

Dual core Opteron versus Pentium-D and Dempsey SMT Dead?
Comments Locked

28 Comments

View All Comments

  • nserra - Thursday, May 19, 2005 - link

    The previous post was for the biased person who wrote this article. Johan De Gelas

    ^
    |Just kiding ;)
  • nserra - Thursday, May 19, 2005 - link

    "AMDs current dual core architecture is vastly superior to Intels"

    This is wrong!!! You said your self that Intel "new" processor was more of a “special” packing than a dual core processor, so you should say is:
    "AMDs current dual core architecture is amazing let’s wait what intel will do at a latter time"

    TDP is for power consuming as a 500W PSU is at it. Just because you have 500W PSU doesn’t mean it draws 500W of power.

    New Venice core as more transistors than the previous core not just because of SSE3, there is new power stages than can be enable to further lower power consuming, I doubt that putting a Turion on a regular board will enable those new power stages.
  • Viditor - Thursday, May 19, 2005 - link

    G'day Jarred!

    "the Pentium M 2.0 GHz chips manage to run at 22W"

    To be specific, they have a TDP of 22 watts which isn't really the same thing...

    "as I understand it even under maximum load the Pentium M stays under 22W, right?"

    Not at all...in fact it can be significantly higher than that. Intel's TDP measures an average usage under load rather than peak, while AMD's measures absolute theoretical peak under the worst conditions. This is why the TDP is quite meaningless...

    I guess my point is that I am of the opinion that the Turion might actually run at significantly lower power usages. As absolutely nobody (that I am aware of) has tested beyond the system level (i.e. the chip itself), I can't be sure...but judging by the actual specs of the chips themselves (not the TDP, but the electrical specifications) it appears that the PM may indeed be higher.

    I know I've asked before, but with the power usage and heat becoming more and more important, couldn't you guys develop a test of the actual realworld usage of the chips themselves?
    I think it might be quite illuminating...

    Cheers!
  • 4lpha0ne - Thursday, May 19, 2005 - link

    @Questar:
    Criticizing Intel and saying good things about AMD and IBM means, that Johan is an AMD fanboy? I think not. You'll see, that the opinions about Whitefield, Merom, Yonah & Co. after hitting the public will be better than about Smithfield now. That's simply the result of the amount of effort put into the designs. A dedicated dual core design is not the same as an on die dual Xeon system.

    @photoguy99:
    I'd say, Johan can make this conclusion, because he has the knowledge to do so. I'd come to the same conclusion, since the Windows scheduler (at least for XP) is not so much core-aware. It just sees the logical or phyisical CPUs and if one becomes free, it just sends the next thread to it. This causes thread-hopping (can be seen in the tech-report dual core reviews thanks to task manager screenshots). In such cases it matters somewhat if the last used data is in the other L2 cache and can be quickly transferred to the current L2 cache. And it matters for multithreaded applications, which work on the same set of data.

    @mazuz:
    I'd suggest to look at benchmarks of a 275 vs. dual 248 with 1 dual channel memory bank and benchmarks of a dual Xeon with FSB800 and a similarly configured (cache, FSB, memory, HT) Smithfield. That's the difference caused by the SRQ-connection.

    @Ahkorishaan:
    The mentioned upcoming Intel cores will indeed be nice. But some people here and on many other forums sound like the dual core K8 was AMD's last CPU and the K8 their last core ever. :) However, have a look at AMD's patent portfolio and you'll see, that this is not the case. As Fred Weber said, AMD is also still looking at power consumption. This is maybe the reason, why we might see a future CPU with more cores, but less FPU power per core (due to shared FPUs).

    AMD is also working on using things like clock gating and throttling (used by P-M) to further reduce power consumption. Currently they only implemented some standard features to keep power consumption down like other transistor designs (especially slower transistors in not so critical places), microarchitectural changes (better HALT mode), C3 state and PowerNow!/C'n'Q.

    Matthias
  • JarredWalton - Wednesday, May 18, 2005 - link

    Viditor, I think the point is that the Pentium M 2.0 GHz chips manage to run at 22W - still less than 1/3 of what the Winchester and Venice cores put out, I think. What exactly did they do to get that low? Well, there's gating technology for sure - i.e. power down unused portions of the chip - but as I understand it even under maximum load the Pentium M stays under 22W, right?

    Maybe Johan has more specifics, but I don't. I just know the price for power use on the design is very impressive, and I was surprised some of the same tech wasn't used in Prescott.
  • Viditor - Wednesday, May 18, 2005 - link

    Your usual excellent work Johan, thanks.
    A couple of nits to pick...

    "Intel will use its P-m “know-how” to keep the power dissipation so low"

    If you could qualify exactly what "know-how" you mean, that would be appreciated. IMHO, a major reason that PM is able to stay so much cooler that the Netburst chips (and on par with the Athlons) is that it doesn't have nearly as many features... Is there a reason you see the PM translating well into full blown server and desktop chips?

    "Intel can leverage their experience with the power saving features of the P-m to design quad core CPUs with remarkably low TDP"

    Arrrrrggghhh! This is a pet peave of mine. TDP IS NOT POWER USAGE!!! Sorry, I know you know this, but most don't and it's been quite frustrating.
    For those who don't know, TDP is an arbitrary design spec for OEMs to use with the CPU...
    AMD's TDP is so much higher than Intel's relative to actual power usage because AMD is much more cautious in it's design spec, not because it uses that much power.


    As to Questar's comments, IMHO the fact that the worst thing he can say is a short unsubstantiated rant speaks volumes to the credibility of the article.
    Thanks again Johan!
  • phaxmohdem - Wednesday, May 18, 2005 - link

    You are all fools. IDT's Winchip X2 dual core solution will blow all of this crapolla out of the warer.
  • mazuz - Wednesday, May 18, 2005 - link

    "AMDs current dual core architecture is vastly superior to Intels"

    This seems like a pretty strong statement considering there doesn't seem to be any known real world advantage to this architecture.
  • photoguy99 - Wednesday, May 18, 2005 - link

    Johan, isn't this statement a little unfounded:

    "we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers"

    How can you be pretty sure when you've cited none? I know you said you'll do more testing - but *after* that testing is done seems like the time to be "pretty sure" it's a real world benefit.

    You've written a good article, it was informative. Just prefer conservative research conclusions.

  • bob661 - Wednesday, May 18, 2005 - link

    #7
    Who cares which company is ahead or behind? I sure as hell don't. Give me good bang for the buck. That's all I want.

Log in

Don't have an account? Sign up now