The SUN benchmarks ...

Although we haven't run benchmarks yet, the benchmarks that SUN presents[2] are still interesting. We'll delve deeper once we have our own benchmarks. The power consumption numbers are estimates. We tried to give you both the typical and the maximum values. Some manufacturers give only typical numbers (Intel, IBM) while others only give maximum numbers (AMD), so we had to find other sources and base our estimates upon them.

JBB2005 represents an order processing application for a wholesale supplier written in Java.

Specjbb2005

System CPU Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 63,378 160%
Sun Fire X4200 2x 2.4GHz DC Opteron 150-180 W 4 4 45,124 114%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 61,789 156%
IBM xSeries 346 2x 2.8GHz DC Xeon 270-300 W 4 8 39,585 100%

The performance of the T1 is simply amazing. Of course, this is an ideal benchmark for the T1 with many java threads. The Power 5+ is the only one that comes close, as it can process 8 threads simultaneously just like the T1. But it consumes +/- 4 times more than the T1.

SPECweb2005 emulates users sending browser requests over broadband Internet connections to a web server. It provides three new workloads: a banking site (HTTPS), an e-commerce site (HTTP/HTTPS mix), and a support site (HTTP). Dynamic content is implemented in PHP and JSP.

Specweb2005

System Processors Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 14,001 289%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 7,881 162%
IBM xSeries 346 2x 3.8GHz Xeon 220-260 W 4 4 4,348 90%
Dell 2850 2x 2.8GHz DC Xeon 260-300 W 4 8 4,85 100%

Here, the T1 is by far the best CPU. This is, however, a very hard to interpret benchmark. For example, back in 2003, I did some benchmarking on a JSP server. Our first results were very weird: a single Xeon performed just as well as a dual Xeon, despite the fact that the Gigabit PCI NIC was not at its limits at all (about 180 Mbit/s). Once we used an Intel NIC, things became better, but the network bottleneck wasn't gone before we used a CSA (directly connected to the Northbridge) Intel NIC. The benchmark depends more on the quality of the NIC driver, the latency from the NIC to the memory (DMA) and of course, the quality of the NIC chip itself than on the CPU. That being said, it is clear that Web servers spawn a lot of threads that do not require a lot of processing unless they are encrypted. So, this is the natural habitat of the T1 CPU. As long as you can make sure that the CPU is the bottleneck, the CPU which can perform the most threads per cycle will win.

SAP 2 Tier is based on the number one ERP software. The database back-end and application run on the same machine.

System Processors Power Dissipation CPUs (Estimated) Number of cores Number of Active threads Score Percentage score
Sun Fire T2000 1x 1.2GHz UltraSPARC T1 72-79 W 8 32 4780 97%
IBM p5 550 2x 1.9GHz POWER5+ 320-360 W 4 8 5020 102%
HP DL580 4x 3.33GHz Xeon MP 440-520 W 4 8 4700 96%
HP DL385 2x 2.2GHz DC Opteron 140-180 W 4 4 4920 100%

SAP 2-tier is a typical example of a benchmark with very low IPC. However, some of the queries are more complex, so the T1 cannot outperform the fatter cores. Still, the performance per watt is unbeatable.


Unbeatable?

The words "paradigm shift" and "disruptive" technology have been abused so many times that we don't like to use them. But in the case of the T1 CPU, it wouldn't be exaggerated to say that it is the herald of a new generation of server CPUs, and that it has disrupted the server market. Single core, single threaded CPUs do not have a chance in this market anymore. Does this also signal the end of superscalar CPUs in the server environment? Is the massive multi-core with scalar cores the future for the entire server world? The SUN UltraSparc T1 simply wipes the floor with the competition when it comes to performance per Watt. According to this metric, the UltraSparc T1 is 4 to 12 times better.


Fig 7: The cores of the T1 processor are hardly warmer than the rest of the die. A "fat" core has much more hotspots.

However, we think that there are also opportunities for the fatter cores. The main weakness of the T1 is the shallow pipeline and clock speed. The need to be compatible with the previous Sparcs and thus, the need for the relatively big Register Window system (with 1 cycle access) also limits clock speed. While the competition has bigger cores, it does not need as many cores as the T1. Each superscalar core could make better use of its resources by using Coarse Grained Multi threading (Montecito), FMT or SMT (Power 5). That should allow these kinds of cores to achieve higher IPC per core. Clock speed can be 2- 3 times higher, allowing two dual cores or one quad core "fat" CPUs to outperform the T1.

These kinds of CPUs consume quite a bit more power, but as long as this extra power usage is not dramatically higher, fat cores might still have a good chance in the market. After all, it is total system power that counts, and large RAID arrays and AC units often represent larger power draws than just the CPU. With the exception of the web server market, power consumption is not the number one priority most of the time, although it is important.

A study sponsored by SUN[3] shows that the best results in commercial server loads are achieved with 4 to 6 threads per core, combined with 2 to 3-way superscalar in order cores. This is another indication that there is a lot of room for very different multi-core approaches such as Intel's Montecito, IBM Power 6+ and upcoming multi-core Xeons and Opterons. A multi-threaded 64-bit version of Sossaman (31 Watt TDP per two cores) could also threaten the UltraSparc T1.

In some server related markets, fat multi-cores might even be more preferable. Once such market is the OLAP databases, where very complex queries are sent by a limited number of users. The response time of the T1 could be rather mediocre there, while a higher clocked CPU with fewer cores could be quite a bit more responsive in these loads. Also, OLAP queries that calculate statistical data will use more FP instructions.
The 8 little cores that could Virtualization
Comments Locked

49 Comments

View All Comments

  • sgtroyer - Wednesday, January 4, 2006 - link

    Another fascinating article, Johan. It's fun to see Anandtech spending more time delving into architecture and non x86 processors, and doing more analysis and less benchmarking. Keep it up!
  • Scarceas - Friday, December 30, 2005 - link

    Remember the converse: Most cpus give up thread level performance...

    Remember the intended market...

    Remember not all 32 threads have to come from one app...

  • Betwon - Friday, December 30, 2005 - link

    Remember: Not come from one app... is not equal to parallel-well:
    It is possible that it is more slow(many apps work together at the same time) than work one by one.

    a core have only 8KB L1, but have to be split for 4 threads to use.It is too few L1 for 4-thread!

    Xeon have 16KB for 2 threads, POWER5 have 32KB for 2 threads.

    "Most cpus give up thread level performance..." ?
    Remember: The Xeon from Intel and POWER5 from IBM --both are multi-thread CPU.
  • Cerb - Friday, December 30, 2005 - link

    Sun stands to gain quite a bit from this, but not really at the expense of IBM, AMD, or Intel. This is doing something that the other guys aren't trying to do, rather than competing against them at what they do well. It is not the future of desktop CPUs. It will not be even a good general-purpose server CPU. It takes a lot of data in, and pushes a lot of data out. A workload that hinges on doing that, without much actual work done to that data, is all it is made to do.

    It is basically a network appliance that happens to run generic programs on it. If you need that it offers, it will be Lord and Master of your rack. If you're not sure, you will pass it by; because you know that that Opteron over here can take anything you throw at it pretty well.
  • Betwon - Friday, December 30, 2005 - link

    If all people think the your words is correct, SUN may cry.
    So small areas for it's apps.
    quote:

    It is not the future of desktop CPUs. It will not be even a good general-purpose server CPU. It takes a lot of data in, and pushes a lot of data out.

  • Cerb - Friday, December 30, 2005 - link

    Why would they cry? They even go to pointing out it's crap for FPU tasks (well, if you notice it lacking entirely in the whole of the PR stuff for it), and tasks with high ILP and IPC (where our mainstream CPUs excel). They also still have a full line-up of other servers, including those based on their own updated SPARCs. It appears their buzzword for this stuff here is 'throughput computing'. Their own brochure for this thing also clearly sell it for high TLP and large data workloads. For more general work, they've got Opterons, and the UltraSPARC IV+ does not appear to be a slouch.

    Let's look at their own "key applications":
    * Web and application tier workloads
    Lots of web server threads. Lots of DB threads. Simple integer logic.

    * Multithreaded workloads
    See above.

    * Java application servers and Java Virtual Machines
    They're sun. Regardless of how good it may or may not be here, they must market Java™. McNealy has to eat, you know :).

    * Consolidated web servers
    Basically the first one/two, but worded differently, to point out that it can do 2x as much web serving work as other servers in the rack with it, and maybe even more, while using little power.

    * Infrastructure services (portal, directory, identity)
    Data in, shuffle it, pump it out. Only slightly different than the rest so far (except Java).

    * Enterprise applications (ERP, CRM, SCM business logic)
    Again, mostly simple DB work where a lot of things may be going on at once, but plenty of them will really be separate from each other. What each task lags in will be made up for by being able to run another 30 at the same time.

    Note that nothing like engineering, scientific simulations, etc., is on that list (things that do a lot of FPU work in parallel). It's basically web and DB said in different ways, and a plug for Java. In addition, their benchmarks look carefully chosen, but not cooked, like Apple's.
  • Betwon - Friday, December 30, 2005 - link

    Just for web server/links, very local java apps?

    You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron).
  • Cerb - Friday, December 30, 2005 - link

    Just for web server/links, very local java apps?
    No, not local java apps. Java is only there as marketing, because this is something Sun in trying to sell. Java probably works fine on it, but really has nothing to do with any of it, except that the same company is behind it and this chip.

    You think that the key of DB works and web server is the multi-thread-parallel performance.It seems that the multi-threads processor(such as P4 Xeon with HT and POWER5 with SMT) is more competitive than the single-thread processor(such as opteron).
    All of those are multithreaded processors. A 386 is a multithreaded processor (in fact, its ability to handle threading in hardware is part of how Linux got created!). However, except for the Power5, none of those can run more than a single thread at a time per core. They can run tons and tons of instructions at a time, but not separate threads (yes, even with HT).

    I don't know how IBM's SMT works, but Intel's is nothing like what the T1 is doing. The T1 seems to be made to send out threads without regard to whether one needs replacing or not.

    Let's say your task has an IPC of 3, and you have 4 paths to use at a time.
    ***0
    Not bad, 75% used. Now, let's say it's only 1.
    *000
    25%, not so great. But, because you have to send them in sets of 4, you can't get 3 more in.

    OK, now, enter Hyperthreading. Let's go to 2 of those 3-IPC tasks.
    ***0 ***0
    Hey, wait, it didn't use that extra one. 75% again. With HT, it switches the whole thing between threads fast. This help make up for the stalling that will happen a fair bit on the Netburst chips. You can't actually get more done--you just don't have to wait as long when something can't go on, because another one is ready to take its place. yes, it may help a little, but there is also the possibility the CPU will get too loaded down and decrease performance, too.

    So, apply that to two 1 IPC tasks.
    *000 *000
    You're still only using 25%, there. You may get a little boost here or there, when one stalls and the other does not, but you've still got about 3/4 of it wasted.

    Now, let's take one core of that T1. It runs four threads, each single-width. So, to that 3 IPC task:
    *000, *000, *000
    One path of the four is used, going over it three times, because it can't span them out and run them in parallel. So, it will take 3 passes to do the work the others can do in 1. Even with a very short pipeline, that hurts. For this task, the 'fat' CPUs, like the Opteron, are excellent.

    But, let's go and run 3 1 IPC tasks, instead:
    ***0
    Now, it got 75% used. Now, running 3x 1 IPC tasks on the Xeon or Opteron:
    *000, *000, *000
    Not so great. The OoOE, branch prediction, and large local caches help, but it just can't keep up, because it's only one thread at a time.

    While this is a very specific kind if workload, the majority of machines that you use on the internet, and many that you may use within a large company, are basically that kind of workload.

    Get request for data.
    Fetch data.
    Check where data needs to go.
    Send it there.

    The thing about it is that this workload accounts for the majority of what goes on over the internet, and most other networks. As long as your servers have enough work to do during peak times to keep one of these machines somewhat busy, it could save rack space, power use, and increase performance in the process.

    Hopefully I didn't screw too much of that up--I did ramble a bit.
  • Schmide - Saturday, December 31, 2005 - link

    As always correct me where I’m wrong.

    Ok this all works fine if you’re dealing with a non-superscaler 386. But the processors you’re referring to are fully pipelined out of order micro-opp architectures.

    I believe the Opteron can have 72 instructions in flight at any one given time, the Power something like 200(x2?), and the P4 126. Each in various levels of decode, process and write.

    As for the thread level parallelism, it is in no way as granular as you portray it. Think more in milliseconds not ticks. I believe thread quantums (time slice) for windows are on the order of 30ms. So a 2gh processor task switch occurs, if the thread holds its slice for more than its allotted time, in 6mhz of ticks.

    HyperThreading does by definition feeds the execution units from two threads at a time; however, this doesn’t ever reach the level of instruction level parallelism that you portray it just kind of fills in the gaps.

    Each core of the Niagara can by theory achieve an ILP of 0.7. Multiply that by 8 and you get a theoretical 5.7 IPC. (but even the ItainumII never reaches the theoretical). Something always gets in the way.

    I think the Niagara has some promise.
  • Betwon - Thursday, December 29, 2005 - link

    How terrible for the single thread apps!
    NO branch prediction!

    Someone must be crazy!

    quote:

    If a branch is encountered, no branch prediction is performed: it would only waste power and transistors. No, the condition on which the branch is based is simply resolved. The CPU doesn't have to guess anymore. The pipeline is not stalled because other threads are switched in while the branch is resolved. So, instead of accelerating the little bit of compute time (10-15%) that there is, the long wait periods (memory latencies, branches) of each thread is overlapped with the compute time of 3 other threads.

Log in

Don't have an account? Sign up now