The G5 a.k.a. Power 970FX

You might not have noticed it, but there is in fact a lot of good news in this article for owners of current Apple systems. Gcc 4.0 promises a lot better (FP) performance in open source software. The improvement from gcc 4.0 over gcc 3.3.3 and 3.3 is amazing on the PowerFX family: almost a 70% improved FP performance!

Now that the open source community finally has a decent compiler for the Apple platform, Apple management decides to step over to another architecture. Ironically, right now, the Intel architecture needs a super-optimized compiler (Intel's own) to reach the FP performance that the G5 now reaches with a very popular but far less aggressive compiler (gcc).

Combined with the data from our first article, we can safely say that the G5 2.7 GHz FP performance is at least as good as the best x86 CPUs. Integer performance seems to be between 70% and 80% of the fastest x86 CPUs, while FP/SIMD performance can actually surpass x86 in certain situations.

With the dual-core Power 970MP available and IBM's current outstanding track record when it comes to multi-core CPUs, big question marks can be placed on whether or not the switch to Intel CPUs will - from a technical point of view - be such a big step forward as Steve Jobs claims. There is more: each core has 1 MB cache instead of the current 512 KB, which will improve integer performance quite a bit as it lessens the impact of the biggest problem of the G5 - the high latency access to the memory system.


Xserve, silently cooled. Below the G5 with the cover, you can see the heatsink.

It is again ironic that the Power 970MP is far more advanced than the current Intel Dual-cores when it comes to power management. Each core can be placed independently in a power-saving state called doze, while the other core continues operation.

A low power Power 970FX is also available and consumes about 16 Watts at 1.6 GHz; so it seems that IBM, although slightly late, could have provided everything that Apple needs. The G5 with its 58 million transistors and 66 mm² die size is not really a hot CPU. The Xserve (2 x 2.3 GHz G5) was by far the quietest 1U air-cooled server that ever entered our lab in Kortrijk.

The Usual Suspects

The Mac OS X kernel environment includes the Mach kernel, BSD, the I/O Kit, file systems, and networking components. Some of these components slow down MySQL significantly. While our rough profiling has not identified the true culprits, we think that we can narrow the possible suspects to:
  1. Relatively high TCP Latency that we measured
  2. The implementation of the threading system. Does the pthread to Mach threads mapping involve some overhead, or is this the result of the traditional performance problem of the micro kernel, namely the high latency of such a kernel on system calls? While Mac Os X is not a micro-kernel, the problem might still exist as the Mach core is deep inside that kernel. Is there IPC overhead? Lmbench signaling benchmarks seem to suggest that there is.
  3. The finer grained locking in the current version of Tiger does not appear to be working for some reason and we still have the "two lock" system of Panther.
Will this performance problem only be visible in MySQL? At this point, we can only speculate, but we have a strong suspicion that this is not the case. Server workloads spend, contrary to other workloads such as workstation apps, a substantial portion of their execution in the kernel and TCP stack. Porting such applications to Mac OS X is more complicated than just recompiling code. We didn't have to search long before we found examples of companies that increased their number of servers or upgraded in order to run MySQL faster.

We look forward to testing other database and server apps on the Mac OS X platform. Critical reports that point out weaknesses can only help the Apple community move forward and keep the Apple people on their toes.

References

[1] Threading on OS X
http://developer.apple.com/technotes/tn/tn2028.html

[2] Lmbench: Portable Tools for Performance Analysis
Larry McVoy, Silicon Graphics
Carl Staelin, Hewlett-Packard Laboratories

[3] Performance Characterization of a Quad Pentium Pro SMP Using OLTP
Kimberly Keeton*, David A. Patterson*, Yong Qiang He+, Roger C. Raphael+, and Walter E. Baker
Computer Science Division
University of California at Berkeley

Mac OS X Achilles Heel
Comments Locked

47 Comments

View All Comments

  • Lori - Friday, September 2, 2005 - link

    http://en.wikipedia.org/wiki/Microkernel">http://en.wikipedia.org/wiki/Microkernel

    MacOS X uses a modified microkernel (a monolithic / microkernel hybrid). The idea was to cut down IPC costs by putting servers that would be IPC heavy directly into the kernel. However, there has recently been a lot of work in the microkernel world to reduce this IPC cost and bring its speed near that of a monolithic kernel.

    L4Ka::Pistachio is an example of this:
    http://www.l4ka.org/">http://www.l4ka.org/
  • leviat - Thursday, September 1, 2005 - link

    If the problem is indeed in the thread creation portion of the OS, it would be interesting to see how a single threaded webserver fairs. I would love to see a benchmark test of Lighttpd (www.lighttpd.org) to see a comparison of how that runs on Darwin vs linux-ppc.

    Another interesting test would be to see MySQL can be configured to precreate the handler threads. This might allow us to see how it handles the context-switching between the multiple threads and allow for it to compete.

    Anyways, great article!
  • JohanAnandtech - Friday, September 2, 2005 - link

    What exactly to do you mean by single threaded? Because Apache 1.3 works with processes, and is thus single-threaded per user.

    MySQL can make use of a Thread cache, we played with it but it didn't give any substantial boost. I don't see how the software would be able to precreate all threads as it has close down and open connections. If you got some insight, please share :-).

    Context switching is quite fast on the G5 OS X, give or take a few percentages compared to Linux x86 or G5 Linux, as we tested with lmbench.
  • Lori - Friday, September 2, 2005 - link

    Actually there are more than one way to handle multiple connections in a server application.

    To give you some examples...

    1. Multi process
    2. Multi thread
    3. Some hybrid of the two

    You can see combinations of these types all provided by Apache 2's MPMs. (perchild, prefork, threadpool, worker, leader.. etc)

    4. Asynchronus multiplexing.

    Your program becomes its own schedular. You can do all your processing within a single thread. Also read up on non blocking i/o. I am actually surprised apache does not have a MPM to handle this type of connection multiplexing but I also read its harder to get OS support.

    Letsee... links... umm... ahh...:

    http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html
  • Avalon - Thursday, September 1, 2005 - link

    Seems like once you remove the G5 from OSX, it's a very capable chip.
  • jamawass - Thursday, September 1, 2005 - link

    Great article, in response to the previous post Anand has posted tons of server articles on x86 systems so Apple is fair game here. Secondly Apple servers are based on OSX in the market, corporations want to know the real world performance not the desktop feel. Also Johan's speculation on Apple's move to Intel raises some troubling questions for Apple execs.
  • karlreading - Thursday, September 1, 2005 - link

    a lot of people commenting on how apple have mad a wrong dicision turning to intel.
    possibly, but IMHO, and, if im not mistaken, didnt the opteron dominate all the tests.
    so in my mind whilst its true for people to doubt apple for going intel, x86 on the whole is still a very viable option if you go the AMD route.
    yes i know people will say AMD dont hae the capacity, but amd powered macs should be how x86 macs are done.
    karlos
  • karlreading - Thursday, September 1, 2005 - link

    also worth noting is that they say the FP poerformance is as good as the fastest x86 chip. well scuse me, but isnt that a 2.7ghz g5 part there testing there? thats the fastest g5 currently avalible isnt it? well then why not test the opteron 254. thats the fastest x86 chip, running 2.8ghz, rather than the 850/250 2.4ghz part tested? that would put some lead against the g5 and also, 2.8ghz is a lot closer than 2.4ghz is to the 2.7ghz g5's core speed. if were trying to be fair.
    if we was being really picky we would be stating duakl core opteron as the fastest x86, but i digress....
    karlos
  • JohanAnandtech - Friday, September 2, 2005 - link

    You are right about the recentely introduced 2.8 GHz Opteron. Well, to be really accurate, at the time of the introduction of the 2.7 GHz G5, a 2.6 Ghz opteron was available.

    Anyway, It was not my intention to be "accurate", it was more a general impression. Give or take a few percent, the G5 can compete FP wise :-).
  • Pannenkoek - Thursday, September 1, 2005 - link

    It's a matter of scalability, SMP support and not so much of how fast some system calls are executed as the reason for the bad performance I would think. Linux is the most used OS for superclusters these days, Mac OS remains a desktop OS. It's no wonder that it performs poorly as a serious server on a multiprocessor/core system. It would have been interesting to see how Windows would have faired (on the x86 of course), if we are testing OSes in this way.

    However, MySQL benchmarks say little about desktop performance, Anandtech's audience consists of desktop users and the reason people love or hate Mac OS is its desktop. Nevertheless, almost a great article. It should have been if the autor could have resisted the temptation of too much speculation, instead of honest benchmark numbers.

Log in

Don't have an account? Sign up now