The SUN benchmarks ...

Although we haven't run benchmarks yet, the benchmarks that SUN presents[2] are still interesting. We'll delve deeper once we have our own benchmarks. The power consumption numbers are estimates. We tried to give you both the typical and the maximum values. Some manufacturers give only typical numbers (Intel, IBM) while others only give maximum numbers (AMD), so we had to find other sources and base our estimates upon them.

JBB2005 represents an order processing application for a wholesale supplier written in Java.

Specjbb2005

System CPU Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 63,378 160%
Sun Fire X4200 2x 2.4GHz DC Opteron 150-180 W 4 4 45,124 114%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 61,789 156%
IBM xSeries 346 2x 2.8GHz DC Xeon 270-300 W 4 8 39,585 100%

The performance of the T1 is simply amazing. Of course, this is an ideal benchmark for the T1 with many java threads. The Power 5+ is the only one that comes close, as it can process 8 threads simultaneously just like the T1. But it consumes +/- 4 times more than the T1.

SPECweb2005 emulates users sending browser requests over broadband Internet connections to a web server. It provides three new workloads: a banking site (HTTPS), an e-commerce site (HTTP/HTTPS mix), and a support site (HTTP). Dynamic content is implemented in PHP and JSP.

Specweb2005

System Processors Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 14,001 289%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 7,881 162%
IBM xSeries 346 2x 3.8GHz Xeon 220-260 W 4 4 4,348 90%
Dell 2850 2x 2.8GHz DC Xeon 260-300 W 4 8 4,85 100%

Here, the T1 is by far the best CPU. This is, however, a very hard to interpret benchmark. For example, back in 2003, I did some benchmarking on a JSP server. Our first results were very weird: a single Xeon performed just as well as a dual Xeon, despite the fact that the Gigabit PCI NIC was not at its limits at all (about 180 Mbit/s). Once we used an Intel NIC, things became better, but the network bottleneck wasn't gone before we used a CSA (directly connected to the Northbridge) Intel NIC. The benchmark depends more on the quality of the NIC driver, the latency from the NIC to the memory (DMA) and of course, the quality of the NIC chip itself than on the CPU. That being said, it is clear that Web servers spawn a lot of threads that do not require a lot of processing unless they are encrypted. So, this is the natural habitat of the T1 CPU. As long as you can make sure that the CPU is the bottleneck, the CPU which can perform the most threads per cycle will win.

SAP 2 Tier is based on the number one ERP software. The database back-end and application run on the same machine.

System Processors Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 4780 97%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 5020 102%
HP DL580 4x 3.33GHz Xeon MP 440-520 W 4 8 4700 96%
HP DL385 2x 2.2GHz DC Opteron 140-180 W 4 4 4920 100%

SAP 2-tier is a typical example of a benchmark with very low IPC. However, some of the queries are more complex, so the T1 cannot outperform the fatter cores. Still, the performance per watt is unbeatable.


Unbeatable?

The words "paradigm shift" and "disruptive" technology have been abused so many times that we don't like to use them. But in the case of the T1 CPU, it wouldn't be exaggerated to say that it is the herald of a new generation of server CPUs, and that it has disrupted the server market. Single core, single threaded CPUs do not have a chance in this market anymore. Does this also signal the end of superscalar CPUs in the server environment? Is the massive multi-core with scalar cores the future for the entire server world? The SUN UltraSparc T1 simply wipes the floor with the competition when it comes to performance per Watt. According to this metric, the UltraSparc T1 is 4 to 12 times better.


Fig 7: The cores of the T1 processor are hardly warmer than the rest of the die. A "fat" core has much more hotspots.

However, we think that there are also opportunities for the fatter cores. The main weakness of the T1 is the shallow pipeline and clock speed. The need to be compatible with the previous Sparcs and thus, the need for the relatively big Register Window system (with 1 cycle access) also limits clock speed. While the competition has bigger cores, it does not need as many cores as the T1. Each superscalar core could make better use of its resources by using Coarse Grained Multi threading (Montecito), FMT or SMT (Power 5). That should allow these kinds of cores to achieve higher IPC per core. Clock speed can be 2- 3 times higher, allowing two dual cores or one quad core "fat" CPUs to outperform the T1.

These kinds of CPUs consume quite a bit more power, but as long as this extra power usage is not dramatically higher, fat cores might still have a good chance in the market. After all, it is total system power that counts, and large RAID arrays and AC units often represent larger power draws than just the CPU. With the exception of the web server market, power consumption is not the number one priority most of the time, although it is important.

A study sponsored by SUN[3] shows that the best results in commercial server loads are achieved with 4 to 6 threads per core, combined with 2 to 3-way superscalar in order cores. This is another indication that there is a lot of room for very different multi-core approaches such as Intel's Montecito, IBM Power 6+ and upcoming multi-core Xeons and Opterons. A multi-threaded 64-bit version of Sossaman (31 Watt TDP per two cores) could also threaten the UltraSparc T1.

In some server related markets, fat multi-cores might even be more preferable. Once such market is the OLAP databases, where very complex queries are sent by a limited number of users. The response time of the T1 could be rather mediocre there, while a higher clocked CPU with fewer cores could be quite a bit more responsive in these loads. Also, OLAP queries that calculate statistical data will use more FP instructions.
The 8 little cores that could Virtualization
Comments Locked

49 Comments

View All Comments

  • Betwon - Thursday, December 29, 2005 - link

    Why? Really?

    quote:

    only one FPU is available for the 8 cores, and each FP instruction takes no less than 40 cycles.


    It shows that the performance of FP apps is very very poor!!!
    We can't believe it.

    It is terrible for many FP apps.

    Now, we know that the new CPU is only for the integer/32-thread-parallel-well apps.
  • JarredWalton - Friday, December 30, 2005 - link

    How many FP instructions do you think a high-end web server runs? Try to think outside the box for a minute, rather than comparing it to HPC-oriented chips. Itanium wastes more than 80% of it's potential when running many database loads, and it does better than some of the other alternatives. Spending lots of die space on OOO logic and long pipelines isn't always the best solution, especially if you can guarantee that most code will have many threads. Quit thinking Half-Life and other games for a minute and try to shift to the big iron server world.
  • Betwon - Friday, December 30, 2005 - link

    Only one FP unit? not less than 40 cycles latency?

    If it is true:

    The new CPU will be slower than P3@450MHz in the area of FP apps.
  • Brian23 - Friday, December 30, 2005 - link

    who cares. That's not what it's designed to do. The only reason that it has the floating point core is for the rare occation when a FP op is needed.
  • Betwon - Friday, December 30, 2005 - link

    It means that this new CPU does not fit for the FP apps. Maybe a old CPU(10 years old) can beat it.

    Now, we know that the apps-area of this new CPU is very very spec...

    It is too difficult to find apps(2-thread-paralle-well or more) for P4, how to find the apps(32-thread-paralle-well) easily?
  • thesix - Friday, December 30, 2005 - link

    Betwon,

    You seem to be confused with the concept of multi-threading v.s. multi-tasking.

    You do NOT need to find an app that runs 32 parallel threads in one process.
    You can simple run 32 _instances_ of that app, for example,
    or, run 32 different apps even if everyone of them is single-threaded.
    A typical server environment is just like that.

    When we talk about Chip Multi-Threading (CMT), it's the _hardware_ thread, which is a totally different concept than software thread. Once hardward thread represents the capability of running one computing task, it does care where this task comes from the same app/process or not.

    A perfect example is the Apache webserver, IIRC, at least in version 1.x.y (which is still the most popular version), the apache http server process is single-threaded. A new process is forked for each (or a group of) new http request. The more hardware threads you have, the more requests you can handle in parallel. Of course, the faster each hardward thread (or core, or cpu) is, the more requests it can handle in a given amount of time, but not in parallel.

    It is also true that _most_ database out there doesn't use _any_ floatpoint computation.

    So, if you think about it, the market for T1 type of CPU/server is not a small one.

    The bottom line is, T1 excels at througput/Watt and througput/chip.
    It's a well kown fact that it sucks at single-task or floatpoint computation.
  • Betwon - Friday, December 30, 2005 - link

    NO!

    You seem to be confused with the concept of SMT v.s. CMT.

    It is very low efficient, if T1 only use CMT but not use SMT.

    T1 have no branch prediction and one_inst_issue/core, very very poor FP performacne.
    The only explain about how to improve the efficiency(very poor) is to use SMT to hide the latency(by branch miss/cache miss ect.)

    But it has only 8KB L1(which will be used by 4 threads), the cache miss will increase. It is possible to become worst.
  • thesix - Friday, December 30, 2005 - link

    Explain to me the conceptual difference between SMT and CMT?
    All you have said is the (component) _implementation_ difference between T1 and POWER in achieving hardware threading.

    Since you appear to know this topic quite well, why the ignorant comment like this:
    "It is too difficult to find apps(2-thread-paralle-well or more) for P4, how to find the apps(32-thread-paralle-well) easily?"
    and kept screaming about the lack of floatpoint performance?

    I simply don't understand why you're so upset.
  • Betwon - Friday, December 30, 2005 - link

    My english has some problem.
    I think that T1 use both CMT and SMT.
    SMT -- one core with four threads
    CMT -- one CPU with eight cores

    If without SMT, cores of T1 will be very poor efficient (because of the stall's latency caused by branch miss/cache miss).

    The very very poor FP performance of T1 is the truth.
    We have to remind ourselves that it is only a integer CPU. It's FP performance is too terrible.
  • fitten - Sunday, January 1, 2006 - link

    There is no "reminding" anyone of the poor FPU performance. The thing was never designed to be strong in FPU (quite obviously). It has "enough" FPU so that it doesn't have to do software emulation and that's it. So... going on and on about FPU performance is a useless argument here. Sun (nor anyone talking about the T1s) has ever said that it would be good at FPU perforamnce because it wasn't designed to be.

    This CPU was designed for servers. Servers typically have high cache miss rates anyway because of a number of things (streaming any kind of I/O doesn't have much data locality advantages). Server processes also typically have lots of I/O stalls. When a context stalls, each core has multiple other contexts to chose from in order to keep running.

    So, I think the points you are trying to stress are quite obvious from the design of the CPU and the types of loads it was designed to handle. Yes, poor FPU performance obvious from having very limited (and slow) FPU resources. Yes, if you aren't running lots of threads the machine is inefficient because the thing is designed to take advantage of server type threads where there will be lots of I/O stalling and if there is nothing else to run while waiting on the I/O requests to finish, it sits idle (much like any other machine). Yes, in-order execution and the lack of branch prediction will not mask any stalls the instruction stream will generate (which is OK because the design of the CPU actually counts on these stalls to happen so that lots of nice SMT can happen).

    It sounds like you are in violent agreement with everyone :)

Log in

Don't have an account? Sign up now