The Slim T1 CPU

It is very unfair of us to compare one of the eight very slim T1 cores to mammoths like the Opteron or the Xeon, which have about 10 to 20 times more transistors. Still, we are curious. We know that Sun sacrificed single-threaded performance on the altar of power consumption, multi-threaded performance and die space. How far did they go? Let us find out with LMBench 3.0a. By the way, you can find much more information about the T1 CPU in our previous article.

First, we check the cache latency and RAM latency. For fat modern superscalar cores like the Opteron and Xeon, these numbers are extremely important. The T1 CPU is less sensitive to the latency of the memory subsystem as long as it has enough threads. The T1 swaps threads waiting for the memory to respond for more responsive threads.

CPU (LMBench) OS Clockspeed L1 (ns) L1 (cycles) L2 (ns) L2 (cycles) RAM (ns) RAM (cycles)
Opteron 275 SunOS 5.10 2211 1.357 3 5.436 12 67.5 149
Pentium- M 1.6 GHz Linux 2.6.15- 1593 1,880 3 6 10 92.1 147
Sun T1 1 GHz SunOS 5.10 980 3.120 3 22.1 22 107.5 105
Opteron 275 Linux 2.6.15- 2209 1.357 3 5 12 73 161
Xeon Irwindale 3.6 GHz Linux 2.6.15- 3594 1.110 4 8 28 48.8 175

Sun has definitely favoured power consumption here. A 3-cycle latency at 1 GHz on a 90 nm process is very conservative. A 22-cycle L2-cache latency is even a bit slow, but again, the thread Gatling gun takes care of that. The built-in memory controllers pay off: latency is about 105 cycles, while even the Pentium-M needs 147 cycles. This helps to keep the average latency (seen from viewpoint of the CPU) low.

Let us see if there is some integer crunching power in the little Sparc core.

CPU (LMBench) OS Bit Add mul div mod
Opteron 275 SunOS 5.10 0.45 0.45 1.36 18.60 19.00
Pentium- M 1.6 GHz Linux 2.6.15- 0.63 0.63 2.51 19.50 11.50
Sun T1 1 GHz SunOS 5.10 1.01 1.00 29.10 104.00 114.00
Opteron 275 Linux 2.6.15- 0.45 0.45 1.36 18.60 19.00
Xeon Irwindale 3.6 GHz Linux 2.6.15- 0.28 0.28 2.79 17.30 23.30

The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off. Update: the Sun documentation tell us 7-11 cycles for multiply and 72 for division.

Let us check out what the lonely FPU of the T1 can do.

CPU (LMBench) OS FADD FMUL FDIV
Opteron 275 SunOS 5.10 1.80 1.80 10.90
Pentium- M 1.6 GHz Linux 2.6.15- 1.88 3.14 23.90
Sun T1 1 GHz SunOS 5.10 26.50 29.30 54.20
Opteron 275 Linux 2.6.15- 1.81 1.81 9.58
Xeon Irwindale 3.6 GHz Linux 2.6.15- 1.39 1.95 12.60

FADD and FMUL are a little faster than what we first reported (40 cycles), and the main part of that latency might just consist of getting the data to the FPU of the T1. It is clear that the Sun T1 doesn't like FP code at all.

Words of thanks PHP/MySQL: de T2000 as a heavy SAMP web server
Comments Locked

26 Comments

View All Comments

  • drw - Friday, March 24, 2006 - link

    Based on the kernel versions listed, I assume that a 32-bit distro was used?

    If so, am curious how a 64-bit distro would compare, as both Apache and MySQL benefit greatly by 64 bit.
  • JohanAnandtech - Friday, March 24, 2006 - link

    Fully 64 bit. uname -a clearly indicates 64 bit
  • defter - Friday, March 24, 2006 - link

    quote:

    At first sight, Sun has won the performance/watt battle for now


    Dual Opteron 275HE had 5% higher power consumpion (198W vs 188W), but it was 5-30% faster (depending wherever or not gzip was used). These results would suggest that dual Opteron has won performance/watt battle in this benchmarks.

    Pricing is also quite important. What's the price for dual Opteron 275HE server with 8GB of memory? About $5000-7000?
  • PeterMobile - Friday, March 24, 2006 - link

    Definitely interesting to see a 3. party review of the T2000. I think it could also be interesting to compare both the Sun machine and the x86 servers to an IBM p5 510Q. That's a 4-way 1.5 GHz Power5+, which including 4 GB RAM and 2 Ultra320 disks lists for $8,536.
  • Calin - Friday, March 24, 2006 - link

    I saw there is almost no loss of performance for compressing data... how about encrypting it?
  • cxl - Friday, March 24, 2006 - link

    quote:


    The very common ADD instruction is executed in one cycle, but it takes no less than 29 cycles to multiply and 104 to divide. Faster mul and division would have taken up much more die space and consumed much more power. Considering that those instructions are very rare in most server workloads, this is a pretty clever trade-off.


    Actually, MOD operation can be very important for servers, as it is basis for any hashing operations, commonly used in many server applications. E.g. to identify variable in a script, interpreters routinely use hashtables.

    114 cycles per MOD operation is performance disaster.
  • Calin - Friday, March 24, 2006 - link

    The performance in the tested configuration was quite good - I wonder how other benchmarks and maybe other "twists" of the benchmark tested would look like.
  • cosmotic - Friday, March 24, 2006 - link

    quote:

    Last, but certainly least, Sun’s solid engineering has impressed us.


    Did you mean certainly NOT least?
  • JohanAnandtech - Friday, March 24, 2006 - link

    definitely ... Fixed. Just checking if you read it carefully :-)
  • cosmotic - Friday, March 24, 2006 - link

    Why no graphs? It makes reading benchmarks SO much easier.

Log in

Don't have an account? Sign up now