...and why SMT can be impressive!

If you want to know what is going to happen in the future, it is always a good idea to look at the big iron. After all, many of the techniques that are now popular in low budget x86 CPUs originated from there: SIMD (Cray-1, ILLIAC IV), 64 bit (MIPS R4000) and CMP (IBM Power 4) are just a few examples.

The IBM Power 5 is a very good example of a CPU that is really made for SMT instead of just having it glued on. Up to 8 instructions can be executed in parallel on one of the two cores, while 5 instructions per thread can be fetched and retired. That means that with one thread, you can have up to 5 instructions in parallel, and with two threads running, up to 8 instructions in parallel. Combine this with massive buffers, a decently large L1 (32 KB instructions, 64 KB data) and huge amounts of memory bandwidth, and the SMT capability can really show its potential. IBM reports a performance boost of 40%, while SMT increased the die size by 24%.

This SMT makes much more effective use of processor resources than multi-core. If only one thread is running, and there is a lot of instruction level parallelism, it has all the execution resources to its disposal and the CPU acts as a massive parallel superscalar CPU. If two or more threads are running, they can make optimum use of the available execution slots. For each percentage that the die size increases, SMT gives you more than one percentage of performance back. In contrast, a second core doubles the die size, but rarely improves performance with more than 70%. SMT can be a superb feature to boost the performance of a multi-core CPU without increasing the die size too much.

Bringing it all together...

Intel and AMD are playing different trump cards while getting their next generation of quad core designs ready for the server market. It is clear, however, that clock speed will only increase slowly, and will no longer be the most important performance indicator.

Intel can leverage their experience with the power saving features of the P-m to design quad core CPUs with remarkably low TDP. SMT might well be one of Intel’s most important weapons to enable relatively high IPC per core. The fact that the current implementation called Hyperthreading offers only mediocre performance improvements is not a reason to believe that SMT will not have a bright future. SMT added to a high IPC core might even give Intel the edge in the server market. The shared L2-cache in the next generation multi-core CPUs (Merom, Conroe, Woodcrest, Whitefield) should also eliminate Intel’s high cache to cache latency.

AMD’s current dual core architecture is vastly superior to Intel’s. The more than twice as fast cache-to-cache communication does not pay off in all multithreaded applications, but it should give AMD a scaling advantage in OLTP and some rendering and HPC applications. It will be very easy for AMD to make communications between the cores even faster, by attaching a shared L2-cache to the SRQ. AMD can also leverage their knowledge and experiences with the on die northbridge to lower the latency and increase the bandwidth of the memory subsystem.

I like to express my thanks to the following people who helped to make this article possible:

References

[1] Hyper-Threading Technology Architecture and Microarchitecture
http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm

SMT Dead?
Comments Locked

28 Comments

View All Comments

  • Viditor - Friday, May 20, 2005 - link

    fitten - Thanks very much for the explanation!
  • fitten - Friday, May 20, 2005 - link

    "When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.

    Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data."

    Only if you are using synchronization primitives (mutex, critical section, semaphore, etc.) which are kernel objects or you call sleep() or something in the midst of reading/writing values. If you are just reading/writing a memory location, the OS doesn't know anything about it. Plus, if you have multiple CPUs/cores, more than one thread can be running simultaneously, which is where the MOESI protocols really come into play.
  • cz - Friday, May 20, 2005 - link

    When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.

    Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data.
  • fitten - Thursday, May 19, 2005 - link

    "When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"

    But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?"

    No, MOESI doesn't help avoid the problem - It is the mechanism of how the problem is arbitrated and resolved.

    Simplified example: CPU1 wants some data. The cache subsystem uses MOESI to determine that CPU0 currently owns that data. MOESI protocols are then used to transfer the ownership of that data to CPU1 (including copying the data to a different cache if necessary). Meanwhile, one (definitely the writing core) or both cores must wait while the MOESI stuff is done and then CPU1 is allowed to proceed with its write.

    So, you can write a two thread program where each thread does nothing but writes a value into a memory location (both threads write to the same memory location). That cannot be avoided by anything. On every write, MOESI will be invoked to resolve the ownership of the data and make sure the processor currently wanting to write to that memory location owns it. So, these two threads will generate massive amounts of MOESI traffic between the two caches (on a multi-core or multi-processor machine) because both cores want to effectively always own that memory. While MOESI is fast, it still takes time to resolve, longer than not having to do the transfer of ownership and any copying required in any case. So, you have two cores fighting over the data and generating a lot of MOESI overhead which saps performance from both cores (both cores spend a bit of time waiting until the cache tells it that it can do its writing).

    "I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples."

    Absolutely. There are times when it simply cannot be avoided and must be done. But, if you can avoid it, then you probably want to avoid it :)
  • JohanAnandtech - Thursday, May 19, 2005 - link

    Ahkorishaan:

    Good summary, that is most likely what is happening at Intel.


    bob661:

    "The Quest for More Processing Power, Part Three: ", that doesn't sound like a buyers guide hey? :-)

    nserra:

    Very astute! Ok, ok, "AMDs current dual core architecture is pretty good, let’s wait Until Intel gets it right :-).

    Fitten:

    I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples.
  • Viditor - Thursday, May 19, 2005 - link

    "When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"

    But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?
  • fitten - Thursday, May 19, 2005 - link

    Processes that will benefit from fast cache-cache transfers are ones that are multithreaded and the threads are manipulating the same data. There are applications that do this, but usually when you design multi-threaded applications you try to avoid these type situations. When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such. Shared (L2) caches tend to help this out because the data doesn't actually have to be transfered to the other core's cache as a part of the taking of ownership, the cache line(s) can stay right where they are with only the ownership modified.

    Anyway, HPC code usually goes through pains to avoid the situation where ownership of data must switch between processes/threads often. That's why data partitioning is one of the most important steps of application design in parallel applications.
  • blackbrrd - Thursday, May 19, 2005 - link

    Uhm.. #19 - that is exactly the point, to check if a row is locked you most likely have to query the other caches to see if it is locked or not...
  • JNo - Thursday, May 19, 2005 - link

    "In Part 2, Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games."

    ...before showing off a beautiful working demo of the Unreal 3 engine on the 7-core PS3 cell processor that was put together in only 2 months and that was relatively easy to develop according to the Unreal guys themselves... Ha! (cos Sweeney did downplay the use of multithreading in games if you read his original comments)
  • cz - Thursday, May 19, 2005 - link

    It is an interesting read I would say. But I would like to point out that OLTP programs will not benefit from cache2cache performance very much. That is because the very principle of multi-threaded programming requires the user account to be locked before updating. So only one thread can update an user account at any given time and other threads are blocked. Only programs that use data in single-write and multi-read form will benefit from cache2cache performance. And most likely these applications will be some sort of scientific simulations.

Log in

Don't have an account? Sign up now