Thread Machine Gun

Besides hard-to-predict branches and high memory latency, server applications on MP systems also get slowed down by high latency network communication and cache coherency (keeping all the data coherent across the different caches; read more here).

To summarize, the challenges and problems that server CPUs face are:
  1. Memory latency, load to load dependencies
  2. Branch misprediction
  3. Cache Coherency overhead
  4. Keeping Power consumption low
  5. Latency of the Network subsystem
So, how did Les Kohn, Dr. Marc Trembley, Poonacha Kongetira, Kathirgamar Aingaran and other engineers at SUN attack these problems? Let us take deeper look at Niagara or the UltraSparc T1.

Memory latency is by far the worst problem, causing a typical server CPU to be idle for 75% of the time. So, this is the first problem that the SUN/Afara engineers attacked.

The 8 cores of the 64 bit T1 can process 8 instructions per cycle, each of a different thread, so you might think that it is a just a massive multi-core CPU. However, the register file of each core keeps track of 4 different active threads contexts. This means that 4 threads are "kept alive" all the time by storing the contents of the General Purpose Registers (GPR), the different status registers and the instruction pointer register (which points to the instruction that should be executed next). Each core has a register file of no less than 640 64-bit registers, 5.7 KB big. That is pretty big for a register file, but it can be accessed in 1 cycle.

Each core has only one pipeline. During every cycle, each core switches between the 4 different active threads contexts that share a pipeline. So at each clock cycle, a different thread is scheduled on the pipeline in a round robin order; to put it more violently: it is a machine gun with threads instead of bullets.

In a conventional CPU, such a switch between two threads would cause a context switch where the contents of the different registers are copied to the L1-cache, and this would result in many wasted CPU cycles when switching from one thread to another thread. However, thanks to the large register file which keeps all information in the registers and Special Thread Select Logic, a context switch doesn't require any wasted CPU cycles. The CPU can switch between the 4 active threads without any penalty, without losing a cycle. This is called Fine Grained Multi-threading or FMT.


Fig 3: The SUN T1 Pipeline. Source:SUN [1].

This Gatling gun of 4 shooting threads per core solves both the branch and the memory latency problem. If a thread issues a load, it takes 3 cycles to get the data from the 8 KB L1-cache. If the 4 threads are active, the thread that issued the load will not be active for 3 cycles anyway, as the other 3 threads get their one cycle. By the time that the first thread is ready to make use of the load of data, the load is finished. If the load takes longer, because it has to access the L2-cache or the memory, the thread is simply skipped, and one of the other 3 threads gets its timeslot until the load is finished.

If a branch is encountered, no branch prediction is performed: it would only waste power and transistors. No, the condition on which the branch is based is simply resolved. The CPU doesn't have to guess anymore. The pipeline is not stalled because other threads are switched in while the branch is resolved. So, instead of accelerating the little bit of compute time (10-15%) that there is, the long wait periods (memory latencies, branches) of each thread is overlapped with the compute time of 3 other threads.


Fig 4: Fine Grained Chip Multi-threading in action. Source:SUN.

The result is a win-win situation: the overall throughput increases, as 4 threads get done in a little bit more time than what it would take to perform one thread on a conventional CPU. In other words, the throughput increases significantly. Secondly, the single issue pipeline is used much more efficiently: SUN claims that the pipeline is used 70 to 85% of the time.

So, one core gets an IPC of about 0.7, which is very roughly twice as good as a big 3-way superscalar CPU with branch prediction and big OOO buffers would do, and it takes less chip logic to accomplish.
Index The 8 little cores that could
Comments Locked

49 Comments

View All Comments

  • thesix - Friday, December 30, 2005 - link

    If you're talking about POWER5's SMT, currently it provides two HW threads per core:
    http://publib.boulder.ibm.com/infocenter/pseries/i...">http://publib.boulder.ibm.com/infocente...x.doc/ai...

    If you look closer at T1, the best one has 8 cores, each core supports four HW threads.
    http://www.sun.com/processors/UltraSPARC-T1/">http://www.sun.com/processors/UltraSPARC-T1/

    SMT and CMT appear to be the same type of technology (at least conceptual wise) with different names from two vendors.

    > The very very poor FP performance of T1 is the truth.
    > We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.

    OK. Since you have repeated so many times, I am sure everyone who's reading this will remember, and I do not disagree :-).

    Thanks.
  • Betwon - Friday, December 30, 2005 - link

    We think that it is diffirent between CMT and SMT.

    For exapmle:
    P4 630 is a kind of SMT CPU, but not a CMT CPU.
    AthlonX2 is a kind of CMT CPU, but not a SMT CPU.

    From anandtech:
    T1 has no branch prediction,and it has only one-instruction-issue/core, 8KB L1D/core(too few for 4 threads to use).

    POWER5 has 32KB L1D/core, which is used by two threads.

    We think that the SMT of T1 may be OK, unless 4 threads only use very few L1D cache(It is impossible for most cases)
  • Betwon - Friday, December 30, 2005 - link

    edit:
    The only explain about how to improve the efficiency(very poor) is to use SMT to hide the stall's latency(by branch miss/cache miss ect.)

    But a core has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
  • Betwon - Friday, December 30, 2005 - link

    edit: T1 have no branch prediction and it has only one_inst_issue/core.
  • Brian23 - Friday, December 30, 2005 - link

    Obviously the apps that they used to benchmark in this article like running on the chip. Also, this chip doesn't run windows. It runs Sun's proprietary operating system. (I forgot what it's called.) Sun will give this new chip software support because they want it to do well.

    I think I read in the article that the chip is backwards compatable with the previous design Sun chips, meaning a lot of software is already available that will run on the chip.
  • Betwon - Friday, December 30, 2005 - link

    NO!

    It is too narrow for the areas of 32-thread-parallel-well apps.

    'have many threads' is not equal to '32-thread-parallel-well'!

    Even there are 32 threads, but without parallel-well , This new CPU will waste more than 90% of it's potential.

    The efficiency of Itanium( Itanium is capable of a 1.3-1.5 IPC) is much better than x86-CPU(0.7-0.9 IPC). Itanium never used OOO logic and long pipelines.
  • Betwon - Friday, December 30, 2005 - link

    The efficiency of Itanium2 is still better than IBM's POWER5, and a Itanium2 core may retire 6 instrutions/cycle,and POWER5's can retire 5-instrutions/cycle.

    But a core of this new CPU is only one instrutions/cycle.
  • Brian23 - Friday, December 30, 2005 - link

    I think you missed the part where x86 chips spend 400 cycles waiting on memory accesses when the Sun chip just keeps chugging with another thread while the load is happening.
  • Calin - Tuesday, January 3, 2006 - link

    Those 400 cycles are related to the higher clock speed (if your processor would be twice as slow, it would wait only 200 cycles). I assume the 400 cycles are based on the Xeon processor (that has high clock speed and slower FSB).
  • Betwon - Friday, December 30, 2005 - link

    NO!
    It is not true for all the x86 CPU.When Athlon64 spend many cycles waiting on memory accesses,
    For P4 with HT,P4 just keeps chugging with another thread while the load is happening.

    Do you understand what I want to say?

Log in

Don't have an account? Sign up now