Thread Machine Gun

Besides hard-to-predict branches and high memory latency, server applications on MP systems also get slowed down by high latency network communication and cache coherency (keeping all the data coherent across the different caches; read more here).

To summarize, the challenges and problems that server CPUs face are:
  1. Memory latency, load to load dependencies
  2. Branch misprediction
  3. Cache Coherency overhead
  4. Keeping Power consumption low
  5. Latency of the Network subsystem
So, how did Les Kohn, Dr. Marc Trembley, Poonacha Kongetira, Kathirgamar Aingaran and other engineers at SUN attack these problems? Let us take deeper look at Niagara or the UltraSparc T1.

Memory latency is by far the worst problem, causing a typical server CPU to be idle for 75% of the time. So, this is the first problem that the SUN/Afara engineers attacked.

The 8 cores of the 64 bit T1 can process 8 instructions per cycle, each of a different thread, so you might think that it is a just a massive multi-core CPU. However, the register file of each core keeps track of 4 different active threads contexts. This means that 4 threads are "kept alive" all the time by storing the contents of the General Purpose Registers (GPR), the different status registers and the instruction pointer register (which points to the instruction that should be executed next). Each core has a register file of no less than 640 64-bit registers, 5.7 KB big. That is pretty big for a register file, but it can be accessed in 1 cycle.

Each core has only one pipeline. During every cycle, each core switches between the 4 different active threads contexts that share a pipeline. So at each clock cycle, a different thread is scheduled on the pipeline in a round robin order; to put it more violently: it is a machine gun with threads instead of bullets.

In a conventional CPU, such a switch between two threads would cause a context switch where the contents of the different registers are copied to the L1-cache, and this would result in many wasted CPU cycles when switching from one thread to another thread. However, thanks to the large register file which keeps all information in the registers and Special Thread Select Logic, a context switch doesn't require any wasted CPU cycles. The CPU can switch between the 4 active threads without any penalty, without losing a cycle. This is called Fine Grained Multi-threading or FMT.


Fig 3: The SUN T1 Pipeline. Source:SUN [1].

This Gatling gun of 4 shooting threads per core solves both the branch and the memory latency problem. If a thread issues a load, it takes 3 cycles to get the data from the 8 KB L1-cache. If the 4 threads are active, the thread that issued the load will not be active for 3 cycles anyway, as the other 3 threads get their one cycle. By the time that the first thread is ready to make use of the load of data, the load is finished. If the load takes longer, because it has to access the L2-cache or the memory, the thread is simply skipped, and one of the other 3 threads gets its timeslot until the load is finished.

If a branch is encountered, no branch prediction is performed: it would only waste power and transistors. No, the condition on which the branch is based is simply resolved. The CPU doesn't have to guess anymore. The pipeline is not stalled because other threads are switched in while the branch is resolved. So, instead of accelerating the little bit of compute time (10-15%) that there is, the long wait periods (memory latencies, branches) of each thread is overlapped with the compute time of 3 other threads.


Fig 4: Fine Grained Chip Multi-threading in action. Source:SUN.

The result is a win-win situation: the overall throughput increases, as 4 threads get done in a little bit more time than what it would take to perform one thread on a conventional CPU. In other words, the throughput increases significantly. Secondly, the single issue pipeline is used much more efficiently: SUN claims that the pipeline is used 70 to 85% of the time.

So, one core gets an IPC of about 0.7, which is very roughly twice as good as a big 3-way superscalar CPU with branch prediction and big OOO buffers would do, and it takes less chip logic to accomplish.
Index The 8 little cores that could
Comments Locked

49 Comments

View All Comments

  • thesix - Thursday, December 29, 2005 - link

    "Hypervisor" is a technology used mostly by IBM from mainframe days. Every system vendor can implement this technology in their systems.
  • pmurphy - Thursday, December 29, 2005 - link

    Actually lets start by saying you're missed on aceshardware.. and I do have to wonder how you felt about the oath of allegiance to Intel anandtech requires?

    Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away.

    In addition: you cite (as a lot of other people do to) this 1.2Ghz "maximum" as if it had reality - it does not. As issued, the T1 incorporates some design trade-offs that make higher cycle rates impractical, but those are the result of engineering vs. marketing (time and cost) trade-offs, not inherent consequences of the technology. Sun has faster test units running now - with very high end products in the pipeline.
  • defter - Thursday, December 29, 2005 - link

    "Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away."

    Floating point limitation won't go away, 8 FPUs@1.4GHz will just make floating point capabilities of the chip somehow useful. For the comparison dual-core Opteron has 6 FPUs@2.4GHz NOW and in 2007 there will be quad-core Opterons (12 FPUs) available.

    As somebody already mentioned, performance/$ is also very important. While T1 is way faster than any other chip, I guess it will cost much more, probably more than 2 high end dual-core Opterons.

    I'm not saying that T1 isn't good. It is, but only in certain tasks.
  • JohanAnandtech - Thursday, December 29, 2005 - link

    I don't think it is a tradition at Anandtech to swear allegiance to Intel, or either they have forgotten to tell me.:-)

    All jokes aside, When I say Intel has the advantage on hardware VT technology and the software support needed, that is solely based on facts. Sun is actively trying to get full support of Xen (VM), and also Linux and FreeBSD OS support, but for the moment T1 is Solaris only if you want good software support.

    AFAIK there is no indication that SUN can go much faster than 1.2 GHz. To let the 4 threads access a 5.7 KB register file in one cycle is probably limiting the clockspeed, and the 6 stage pipeline is another clear indication that this CPU won't clock much higher. SUN counting on 65 nm to increase the clockspeed higher (1.4 GHz and more) is another indication.


  • ravedave - Thursday, December 29, 2005 - link

    When might we expect to see Anandtech benchmarks? 1-2 months?
  • Puddleglum - Thursday, December 29, 2005 - link

    The [2] SUN T1 benchmarks reference link is pointing to a bizarre location at intel.com. The text says sun.com, but the link points to intel.com.

    It should be fixed to point to: http://www.sun.com/servers/coolthreads/t1000/bench...">http://www.sun.com/servers/coolthreads/t1000/bench...
  • ncage - Thursday, December 29, 2005 - link

    It looks like sun is back with a vengance. This thing seems perfect for the server market. I am really suprised that they were able to get their $hit back together. I dought the single threaded performance on this thing would be that great but, then again, who cares this thing is a server not a workstation made for single threaded use. This thing would be perfect for virtualization. I don't know if this is possible for solaris or maybe vmware/ms virtual server will have this feature in the future but hopefully they will allow you to allocate which core to which virtualization layer that you want. So say your running 4 OS and you have 8 cores. You allocate 2 cores to each OS. You notice that 2 of the four high really high cpu utilization. You could then dynamically add one more core to each of the virtualized OS that had high cpu usage from the ones that had low cpu usage. For those of you who think virtualization isn't a big deal...now wouldnt' this be cool.
  • Slaimus - Thursday, December 29, 2005 - link

    Are these benchmarks all running similar TCP/IP stacks? We all know solaris 10 has a new TCP/IP stack that is much faster than linux.
  • Puddleglum - Thursday, December 29, 2005 - link

    The benchmarks are from Sun's website (http://www.sun.com/servers/coolthreads/t1000/bench...">link)
    "SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers."

    So, yes, you can assume they're all using the same TCP/IP stack. But, as the article mentions: "Of course, this is an ideal benchmark for the T1 with many java threads."

Log in

Don't have an account? Sign up now