The 8 little cores that could

Each core is pretty small, as it has only one pipeline, no Branch Prediction Unit, no OOO buffers, and no OOO pipeline stages, which search for independent instructions. Only the large register file and thread select logic make the very simple core a bit fatter and more complex.

An 8 KB data cache and 16 KB instruction cache give an L1-hitrate of 90% or less, but it also helps to keep each core small. To keep 8 cores with such tiny L1-caches running at 70% efficiency with so many threads, a big L2-cache and massive memory bandwidth is needed.


Fig 5: 8 cores fed by a 3 MB L2-cache and 4 integrated memory controllers. Source:SUN.

SUN provided the T1 with 3 MB of shared L2-cache with 4 banks, and 4 memory controllers, each with 128 bit (16 Byte) access to the memory. At 400 MHz, this means that the T1 has access to 400 MHz x 16 B x 4 or no less than 25.6 GB/s. A very fast on-chip cross-bar interconnect links all the different components (4 banks L2, 4 memory controllers, 8 cores and 1 FPU) with a 200 GB/s communication lane. This minimizes the cache coherency overhead: the 8 L1-caches talk to each other over a very fast on-chip interconnect, similar to the Dual Opteron.

We have quantified this effect of faster cache coherency in our Linux database server article. A dual core Opteron was about 13% faster than two single core CPUs at the same clock speed. With 8 cores that might share data, cache coherency has an even bigger impact on performance. Sharing the L2-cache also ensures that no coherency traffic is necessary on the level 2 cache.


Fig 6: One (yellow) of the 8 cores (gray) of the T1. Source:SUN.

This does not mean that no concessions have been made to keep the die size at 340 mm² and the separate cores cool and small. As you can see in figure 5, only one FPU is available for the 8 cores, and each FP instruction takes no less than 40 cycles. From SUN's developer guide for the UltraSparc T1 [6]:
As a rough guideline, performance degrades if the number of floating-point instructions exceeds 1 percent of total instructions.
Some instructions like division have long latencies, causing the thread to be skipped. The situation is then similar to a thread with a long latency load. To keep power consumption and die size per core low, each core has a very shallow six-stage pipeline: fetch, thread select, decode, execute, memory, and write back. The result is an architecture that does not need branch prediction, thanks to a shallow pipeline and FMT. However, this limits clock speed to 1.2 GHz in 90 nm, while competing chips are clocking between 2 and 4 GHz.

There is more. Each core has a modular arithmetic unit (MAU) that supports modular multiplication and exponentiation to speed up Secure Sockets Layer (SSL) processing. This compensates for the lack of the FPU and the low clock speed. A single 1.2 GHz MAU seems to "sign" as fast as a 1.8 GHz Opteron, but quite a bit slower at verifying authenticity.
Thread Machine Gun The SUN benchmarks …
Comments Locked

49 Comments

View All Comments

  • thesix - Thursday, December 29, 2005 - link

    "Hypervisor" is a technology used mostly by IBM from mainframe days. Every system vendor can implement this technology in their systems.
  • pmurphy - Thursday, December 29, 2005 - link

    Actually lets start by saying you're missed on aceshardware.. and I do have to wonder how you felt about the oath of allegiance to Intel anandtech requires?

    Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away.

    In addition: you cite (as a lot of other people do to) this 1.2Ghz "maximum" as if it had reality - it does not. As issued, the T1 incorporates some design trade-offs that make higher cycle rates impractical, but those are the result of engineering vs. marketing (time and cost) trade-offs, not inherent consequences of the technology. Sun has faster test units running now - with very high end products in the pipeline.
  • defter - Thursday, December 29, 2005 - link

    "Ah well, all that aside the most glaring omission with respect to the Niagara II is the fact that it has a full floating point component in each core - meaning that the current floating point limitation will largely go away."

    Floating point limitation won't go away, 8 FPUs@1.4GHz will just make floating point capabilities of the chip somehow useful. For the comparison dual-core Opteron has 6 FPUs@2.4GHz NOW and in 2007 there will be quad-core Opterons (12 FPUs) available.

    As somebody already mentioned, performance/$ is also very important. While T1 is way faster than any other chip, I guess it will cost much more, probably more than 2 high end dual-core Opterons.

    I'm not saying that T1 isn't good. It is, but only in certain tasks.
  • JohanAnandtech - Thursday, December 29, 2005 - link

    I don't think it is a tradition at Anandtech to swear allegiance to Intel, or either they have forgotten to tell me.:-)

    All jokes aside, When I say Intel has the advantage on hardware VT technology and the software support needed, that is solely based on facts. Sun is actively trying to get full support of Xen (VM), and also Linux and FreeBSD OS support, but for the moment T1 is Solaris only if you want good software support.

    AFAIK there is no indication that SUN can go much faster than 1.2 GHz. To let the 4 threads access a 5.7 KB register file in one cycle is probably limiting the clockspeed, and the 6 stage pipeline is another clear indication that this CPU won't clock much higher. SUN counting on 65 nm to increase the clockspeed higher (1.4 GHz and more) is another indication.


  • ravedave - Thursday, December 29, 2005 - link

    When might we expect to see Anandtech benchmarks? 1-2 months?
  • Puddleglum - Thursday, December 29, 2005 - link

    The [2] SUN T1 benchmarks reference link is pointing to a bizarre location at intel.com. The text says sun.com, but the link points to intel.com.

    It should be fixed to point to: http://www.sun.com/servers/coolthreads/t1000/bench...">http://www.sun.com/servers/coolthreads/t1000/bench...
  • ncage - Thursday, December 29, 2005 - link

    It looks like sun is back with a vengance. This thing seems perfect for the server market. I am really suprised that they were able to get their $hit back together. I dought the single threaded performance on this thing would be that great but, then again, who cares this thing is a server not a workstation made for single threaded use. This thing would be perfect for virtualization. I don't know if this is possible for solaris or maybe vmware/ms virtual server will have this feature in the future but hopefully they will allow you to allocate which core to which virtualization layer that you want. So say your running 4 OS and you have 8 cores. You allocate 2 cores to each OS. You notice that 2 of the four high really high cpu utilization. You could then dynamically add one more core to each of the virtualized OS that had high cpu usage from the ones that had low cpu usage. For those of you who think virtualization isn't a big deal...now wouldnt' this be cool.
  • Slaimus - Thursday, December 29, 2005 - link

    Are these benchmarks all running similar TCP/IP stacks? We all know solaris 10 has a new TCP/IP stack that is much faster than linux.
  • Puddleglum - Thursday, December 29, 2005 - link

    The benchmarks are from Sun's website (http://www.sun.com/servers/coolthreads/t1000/bench...">link)
    "SPECjAppServer2004 is the only industry-standard benchmark used for Java Enterprise Edition application servers."

    So, yes, you can assume they're all using the same TCP/IP stack. But, as the article mentions: "Of course, this is an ideal benchmark for the T1 with many java threads."

Log in

Don't have an account? Sign up now