No More Apple Mysteries, Part Two

by Johan De Gelas on 9/1/2005 12:05 AM EST
  • Posted in
  • Mac
POST A COMMENT

47 Comments

Back to Article

  • tthiel - Wednesday, May 24, 2006 - link

    You need to redo this entire test. So much has come out about how poorly this was done its hard to believe it came from Anandtech. Reply
  • iggie - Friday, January 13, 2006 - link

    I'm surprised you didn't post the raw VM latency results from lmbench. I found http://www-128.ibm.com/developerworks/library/l-yd...">another article that did a similar performance comparison (Darwin vs. Linux on G5).
    mmap latency is 3x greater, but most tellingly, page fault latency is > 900 x greater!

    Did you observe similar results in your tests?

    I would imagine that page faults would play a greater and greater role as more and more independent clients connect to a server. I have experienced a huge disparity in http://www.openmicroscopy.org/api/omeis/">our own server software implementation for scientific imaging. In our case, all disk access is done via mmap and page faults (its a shared-VM-based image server system meant to serve many terabytes of image data)
    Reply
  • asifyoucare - Sunday, September 04, 2005 - link

    Interesting article.

    If you suspect that thread performance is the bottleneck, why not write a short program to measure how many threads can be created and destroyed per second?

    Reply
  • DoctorBooze - Saturday, September 03, 2005 - link

    quote:

    In the case of Linux, creating a thread is very similar to creating a process. [...] So, if you test fork() on Linux, you also get a rough idea of how fast threads are created

    I'm no guru but I don't think that's true now with Native Posix Threads, which you get in 2.6 kernels with a suitable libc (and some distros with 2.4 kernels). Check what your program's linked with: on my Fedora Core 3 system `ldd /usr/libexec/mysqld` shows me MySQL is linked with /lib/tls/libc.so.6 and running that shows it has NPTL. The API may be similar but what happens in the kernel isn't and it makes a big, big difference to MySQL. Still, Linux now has fast native POSIX threads and it looks like OS X doesn't.
    Reply
  • ikruusa - Saturday, September 03, 2005 - link

    Indeed, as mentioned previously there was some mistakes in gcc options. And SIMD optimization is really basic in 4.0.x - only certain loops can be vectorized automatically. But loops around arrays are most significant part in signal processing and that is where SIMD really matters :)
    As we know for NetBurst arch it is recommended to use XMM registers (that is registers for SSE/SSE2) for FP calculations. And that is what gcc 3.x does (4.x too): -mfpmath=sse triggers all x87 stuff to run as scalar math using SSE command-set. As I know AltiVec is SIMD unit which is smoothly added to PowerPC pipeline. How useful there is scalar math instead of usual FP - I have no idea.
    What I want to say - my opinion is that if MySQL team has something to say about compiler options then they have documents about it. Using SIMD style processing in DB engine is very challenging exercise for coders. Dont expect magic from compiler here. Hint: maybe Intel's own icc compiler provide some magic but you have to prove it ;) I still believe that the most useful options can be -O[2,3] -funroll-loops and -ffast-math (as you mentioned) with -arch=[processor]. The last one should provide basic branching elimination (e.g. using cmov for x86) and correct instr. ordering.
    About testing Linux. I have some skills in Apache testing with JMeter. I have been quite stuck but kernel developers were kind enough to help: http://marc.theaimsgroup.com/?l=linux-kernel&m...">http://marc.theaimsgroup.com/?l=linux-kernel&m...
    Then I discovered all OS tuning possibilities in /proc Well, most are still unknown for me but I just want to get your attention here. Oracle talks about shared memory and number of semaphores and some particular Linux /proc parameters. Of course there should be all written in MySQL manual too if any parameter needs tuning. But is it enough to read MySQL manual and create profile for OS'es IPC and process management if we need to stress test MySQL on e.g 8-way SMP?
    But still - good start of interesting investigation, anandtech.com!! Thank you and keep going!
    Reply
  • kvs - Saturday, September 03, 2005 - link

    If thread-creation is extremely slow in Darwin, maybe MySQL-performance could be helped by enabled the thread cache? A look at 'mysqladmin extended-status' would show how many threads had been created and cached, and should reveal if thread_cache would be needed.
    Reply
  • tester2 - Friday, September 02, 2005 - link

    Well if ab on Mac OS X was the problem you could have easily tested this from a Linux box over the network.

    Because you probably did this as well, and found out that performance tuning done by Apple outperformed the Linux/PPC and Linux/Opteron system by a substantial amount you keept this out of the story ...

    So I did some testing, and yes when using ab from a Mac OS X I find the exact figures you report. Using a Linux Pentium 4 based system over Gb network gave me 6150 req/sec substantially faster then anything out there.
    Look here for numbers from another source; http://www.pcmag.com/article2/0,1895,1637655,00.as...">http://www.pcmag.com/article2/0,1895,1637655,00.as...

    The webserver runs around 60 threads ... go figure.

    Yes there is a problem with the Mac OS X - Mysql combo if you are looking for performance, but jugging this as Mac OS X for server applications is a nono is drawing the wrong conclusion. I hope someone with good development skills will look at the mysql code and tune it to work well with Mac OS X.

    Reply
  • benh - Friday, September 02, 2005 - link

    Interesting article ! One thing that is worth looking into however is wether the YDL kernel is actually a 32 or a 64 bits kernel. This would probably have an impact on some of the numbers. I would expect the ppc64 kernel to perform faster overall on a 64 bits CPU with a small overhead on syscalls from 32 bits applications due to the argument size translation.

    Also, the problem with the 2.7Ghz on linux is indeed a slight change in the firmware. It in fact looks like a bug in Apple Open Firmware device tree on those machine where they left out the properties providing the interrupt routing of the i2c controller in the north bridge used to drive the fan controller among others. The OS X driver silently falls back to a polled mecanism, while the linux driver doesn't and (shame on me!) used to have a small bug that would cause it crash when unable to locate those properties.

    I posted a patch a while ago fixing that up, I would expect YDL to have an updated kernel/installer available by now.

    Finally, you are right about the U3 northbridge having a quite high memory latency, that is definitely not helping the G5. There have been rumours floating around that Apple now has a new bridge that improves that significantly, though it's pretty much impossible to tell if/when they will release a machine using it. IBM also had multicore G5s available for some time now, though Apple is still not releasing any machine using them.

    Regards,
    Ben.
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    Thanks for the very helpful feedback.

    Do you have any idea why the U3 came with such high latency. Lack of development time? Lack of expertise? A inherent problem with the FSB of the G5? Rather old technology? You see I am very curious, and couldn't find much info on it.



    Reply
  • benh - Friday, September 02, 2005 - link

    I don't know for sure. I wouldn't blame the FSB though. I remember reading somewhere that the memory controller in U3 was similar if not identical to the old one they used in U2 on G4 machines and was to blame but I can't guarantee the reliability of that information.
    Reply
  • Gandalf90125 - Friday, September 02, 2005 - link

    From the article:

    "...so it seems that IBM, although slightly late, could have provided everything that Apple needs."

    I'd say not everything Apple needs as I suspect the switch to Intel was driven more by marketing than any technical aspect of the IBM vs. the Intel chips.
    Reply
  • Illissius - Friday, September 02, 2005 - link

    A few notes:

    - you mention trying a --fast-math option, which I've never heard of... presumably this was a typo for -ffast-math?

    - when I tried using -mcpu (which you say you used for YDL) on GCC 3.4, it told me the option had been deprecated, and -mtune has to be used instead (dunno whether it told me this latter part itself or I read it somewhere else, but it's true). I'm not sure whether GCC4 has the same behaviour (I'd think so), whether it still has the intended effect despite the warning, or whether it matters at all.

    - was there a reason for using -march on one, and -mcpu/-mtune on the other? (the difference is that -mcpu/-mtune optimize the code for that processor as much as possible while still keeping the code compatible with everything else in the architecture, while -march does the same without care for compatibility -- on x86 at least, not sure whether it's the same on PPC)

    - you mention using the same compiler because, err, you wanted to use the same compiler... if this was done in the hopes of it generating code of similar speed for each architecture, though, then your own results show there isn't much point -- seems GCC, 3.3 at least, is much better at generating x86 code than PPC (which isn't surprising, given much more work probably went into it due to the larger userbase). Not saying it was a bad idea to use GCC on both platforms (it's a good one, if for no other reason than most code, on the Linux side at least and OSX itself (not sure about the apps) are compiled with it), just that if the above was the reason, it wasn't a very good one ;).

    - Continuing the above, I was a bit surprised at the, *ahem*, noticeable difference in speed between not even two different compilers, but two versions of the same. (I was expecting something like 1-5, maybe 10% difference, not 100). Maybe this could warrant yet another followup article, this time on compilers? :)
    Reply
  • Pannenkoek - Friday, September 02, 2005 - link

    The reason is that GCC 4.0 incorporated infrastrucure for vector optimization (tree-ssa), which can give, especially in synthetic benchmarks, huge increase in FP performance. GCC can now finally optimize for SSE, Altivec, etc., a reason why in the past optimizing specifically for newer Pentiums did not yield much improvement.

    Althougn compiler benchmarks would be interesting, I doubt it is a task for Anandtech. Normal desktop users do not have to worry about whether or not their applications are optimized optimally, and any differences between, say GCC and ICC, are small or negligible for ordinary desktop programs. (Multimedia programs often have inline assembly for performance critical parts anyway).

    GCC is free, supports about any platform and improves continually while it's already a first class compiler.
    Reply
  • javaxman - Friday, September 02, 2005 - link

    While I generally love this article, I have to wonder...
    why not write a simple benchmark for pthread(), if you think that's the bottleneck? Surely it'd be a simple thing to write a page of code which creates a bunch of threads in a loop, then issues a thread count and/or timestamp. It seems like a blindingly obvious test to run. Please run it.

    I have to say that I *do* think pthread() is the likely bottleneck, possibly due to BSD4.9-derivative code, but why not test that if we think that's the problem? I understand wanting to see real-world MySQL performance, but if you're trying to find a system-level bottleneck, that's not the right type of testing to do...

    Now that I metion it, Darwinx86 vs. BSD 4.9 ( on the same system ) vs. BSD 5.x ( on the same system ) vs. Linux ( on the same system ) would really be a more interesting test at this point... I'm really not caring about PPC these days unless it's an IBM blade system, to be honest... testing an Apple PPC almost seems silly, they'll be gone before too long... Apple's decision to move away from PPC has more to do with *future* chip development than *current* offerings, anyway... Intel and AMD are just putting more R&D into their x86 chips, IBM's not matching it, and Apple knows it...

    but even if you are going to look at PPC systems, if you're trying to find a system-level bottleneck, write and run system-level tests... a pthread() test is what is needed here.
    Reply
  • rhavenn - Friday, September 02, 2005 - link

    If I remember correctly, OS X is forked off of the FreeBSD 4.9 codebase. The 4.x series of BSD always had a crappy threading system and didn't handled threaded apps well at all. I doubt Apple really touched those internals all that much.

    FreeBSD 5.x has a much better time of it. I'm wondering if the switch back to a Intel platform will make it easier for Apple to integrate the BSD 5.x codebase into their OS? or even if they plan on using the BSD 6.x codebase for a future release? The threading models have vastly improved.

    Just a thought :)
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    http://www.apple.com/education/hed/compsci/tiger.h...">http://www.apple.com/education/hed/compsci/tiger.h... :

    "FreeBSD 5.0
    The upgraded kernel in Tiger, based on mach and FreeBSD, provides optimized resource locking for better scalability across multiple processors, support for 64-bit memory pointers through the System library and standards-based access control lists"

    Where did you see FreeBSD 4.9?
    Reply
  • mbe - Friday, September 02, 2005 - link

    Readers also pointed out that LMBench uses "fork", which is the way to create a process and not threads in all Unix variants, including Mac OS X and Linux. I fully agree, but does this mean that the benchmark tells us nothing about the way that the OS handles threading? The relation between a low number in this particular Lmbench benchmark and a slow creating of threads may or may not be the answer, but it does give us some indication of a performance issue. Allow me to explain...

    This misses the point, your claim in the last article was that MacOS X used userspace threads. Mentioning that LMbench uses processes still rules out userspace threads having any part to play. This is since processes can't in any meaningful way (short of violating some pretty basic principles) be implemented around userspace threads. The point is that a process is a virtual memory space attached to a main system thread, not a userspace thread which are not normally even considered threads on this level.

    This is necessary since the virtual memory attached to the thread has to be managed when doing context switches, and by its very definition userspace code cannot directly touch the memory mappings.
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    Yes, it could be. The interesting questions are:
    - Is the only culprit for the 8 time lower performance. Microkernels are reported to be 66 to 5% slower depending on who benchmarked it. But not 8 times slower.
    - What makes it still interesting for the apple devs to use it?

    I hope Apple will be a bit more keen to defend their product, because their might be interesting technical reasons to keep the Mach kernel.
    Reply
  • sdf - Friday, September 02, 2005 - link

    Is Mac OS X really a microkernel? I understood it to be designed to function as a microkernel, but compiled and shipped as a macrokernel for performance reasons. Reply
  • JohanAnandtech - Sunday, September 04, 2005 - link

    I am sorry if I wasn't clear. As I state in the article clearly: Mac OS X is ** NOT ** a microkernel, but based on a microkernel as the Mach kernel is burried inside the FreeBSD monolithic kernel.

    Most of the tasks are done by a FreeBSD alike kernel, but threading is done by the Mach kernel.
    Reply
  • Lori - Friday, September 02, 2005 - link

    http://en.wikipedia.org/wiki/Microkernel">http://en.wikipedia.org/wiki/Microkernel

    MacOS X uses a modified microkernel (a monolithic / microkernel hybrid). The idea was to cut down IPC costs by putting servers that would be IPC heavy directly into the kernel. However, there has recently been a lot of work in the microkernel world to reduce this IPC cost and bring its speed near that of a monolithic kernel.

    L4Ka::Pistachio is an example of this:
    http://www.l4ka.org/">http://www.l4ka.org/
    Reply
  • leviat - Thursday, September 01, 2005 - link

    If the problem is indeed in the thread creation portion of the OS, it would be interesting to see how a single threaded webserver fairs. I would love to see a benchmark test of Lighttpd (www.lighttpd.org) to see a comparison of how that runs on Darwin vs linux-ppc.

    Another interesting test would be to see MySQL can be configured to precreate the handler threads. This might allow us to see how it handles the context-switching between the multiple threads and allow for it to compete.

    Anyways, great article!
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    What exactly to do you mean by single threaded? Because Apache 1.3 works with processes, and is thus single-threaded per user.

    MySQL can make use of a Thread cache, we played with it but it didn't give any substantial boost. I don't see how the software would be able to precreate all threads as it has close down and open connections. If you got some insight, please share :-).

    Context switching is quite fast on the G5 OS X, give or take a few percentages compared to Linux x86 or G5 Linux, as we tested with lmbench.
    Reply
  • Lori - Friday, September 02, 2005 - link

    Actually there are more than one way to handle multiple connections in a server application.

    To give you some examples...

    1. Multi process
    2. Multi thread
    3. Some hybrid of the two

    You can see combinations of these types all provided by Apache 2's MPMs. (perchild, prefork, threadpool, worker, leader.. etc)

    4. Asynchronus multiplexing.

    Your program becomes its own schedular. You can do all your processing within a single thread. Also read up on non blocking i/o. I am actually surprised apache does not have a MPM to handle this type of connection multiplexing but I also read its harder to get OS support.

    Letsee... links... umm... ahh...:

    http://www.kegel.com/c10k.html">http://www.kegel.com/c10k.html
    Reply
  • Avalon - Thursday, September 01, 2005 - link

    Seems like once you remove the G5 from OSX, it's a very capable chip. Reply
  • jamawass - Thursday, September 01, 2005 - link

    Great article, in response to the previous post Anand has posted tons of server articles on x86 systems so Apple is fair game here. Secondly Apple servers are based on OSX in the market, corporations want to know the real world performance not the desktop feel. Also Johan's speculation on Apple's move to Intel raises some troubling questions for Apple execs. Reply
  • karlreading - Thursday, September 01, 2005 - link

    a lot of people commenting on how apple have mad a wrong dicision turning to intel.
    possibly, but IMHO, and, if im not mistaken, didnt the opteron dominate all the tests.
    so in my mind whilst its true for people to doubt apple for going intel, x86 on the whole is still a very viable option if you go the AMD route.
    yes i know people will say AMD dont hae the capacity, but amd powered macs should be how x86 macs are done.
    karlos
    Reply
  • karlreading - Thursday, September 01, 2005 - link

    also worth noting is that they say the FP poerformance is as good as the fastest x86 chip. well scuse me, but isnt that a 2.7ghz g5 part there testing there? thats the fastest g5 currently avalible isnt it? well then why not test the opteron 254. thats the fastest x86 chip, running 2.8ghz, rather than the 850/250 2.4ghz part tested? that would put some lead against the g5 and also, 2.8ghz is a lot closer than 2.4ghz is to the 2.7ghz g5's core speed. if were trying to be fair.
    if we was being really picky we would be stating duakl core opteron as the fastest x86, but i digress....
    karlos
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    You are right about the recentely introduced 2.8 GHz Opteron. Well, to be really accurate, at the time of the introduction of the 2.7 GHz G5, a 2.6 Ghz opteron was available.

    Anyway, It was not my intention to be "accurate", it was more a general impression. Give or take a few percent, the G5 can compete FP wise :-).
    Reply
  • Pannenkoek - Thursday, September 01, 2005 - link

    It's a matter of scalability, SMP support and not so much of how fast some system calls are executed as the reason for the bad performance I would think. Linux is the most used OS for superclusters these days, Mac OS remains a desktop OS. It's no wonder that it performs poorly as a serious server on a multiprocessor/core system. It would have been interesting to see how Windows would have faired (on the x86 of course), if we are testing OSes in this way.

    However, MySQL benchmarks say little about desktop performance, Anandtech's audience consists of desktop users and the reason people love or hate Mac OS is its desktop. Nevertheless, almost a great article. It should have been if the autor could have resisted the temptation of too much speculation, instead of honest benchmark numbers.
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    Sorry couldn't resist :-). (for the rest of the world, pannekoek is dutch for Pancake)

    Desktop performance is ok, as desktop apps are similar to the workstation apps we tested in the first article. Those apps spend from 5-20% in the OS, while server apps spend up to 80% of their time in the OS!

    However, I should point out that we tested Mac OS X SERVER, so it is a problem for the Xserves.
    Reply
  • Pannenkoek - Friday, September 02, 2005 - link

    I stand corrected then. However, my reasoning still applies, it's just that Apple relies even more on its brand than on technology to sell server systems apparently. Who runs Mac OS servers anyway, it's an oxymoron. ;-)

    P.S. Do not mock my nick, it served well in beating godlike UT bots, and should be honoured as much as Loque.
    Reply
  • Tanclearas - Thursday, September 01, 2005 - link

    "Apple told us that the problem lies in the Apachebench (the client side), which stalls from time to time and thus, generates too low of a load on the (Apache) server."

    How does this explanation make any sense? Linux obviously doesn't have a problem with these "stalls".
    Reply
  • JohanAnandtech - Friday, September 02, 2005 - link

    What follows is not what Apple said, but my interpretation...

    They are probably pointing out that the version for Mac OS X has a Mac OS X specific bug. Of course, who is to blame? I am sceptical like you.
    Reply
  • mariush - Thursday, September 01, 2005 - link

    Page 4 :

    We used the following on the Opteron based PCs:

    Gcc -O2 -mcpu=G5 flops.c -o flops

    And, on the G5 machines, we used:

    Gcc -O2 -march=k8 flops.c -o flops

    I think it's the other way around.
    Reply
  • Houdani - Thursday, September 01, 2005 - link

    Aye, was gonna point that out also.

    In addition, on page 3 should you list the Yellow Dog Linux along with OSX in the Software section for the Apple PowerMac G5?
    Reply
  • Shinei - Thursday, September 01, 2005 - link

    My question is, would the memory latencies be so high for the 970FX if high-end RAM was used for the Linux tests (like, say, some TCCD or BH-5 at 2-2-2-5), instead of the standard 3-3-3-8 SPD that ships with the G5 system? Or is there some limitation to the G5 motherboard that prevents posting with performance RAM as a way for Apple to ensure that only certain, accepted DIMMs are used with their computers?
    Anyway, these results are very telling about what the OSX86 Macs are going to perform like--that is to say, ~25% slower than the equivalent Windows/Linux boxes running the same hardware...
    Reply
  • IntelUser2000 - Sunday, September 04, 2005 - link

    quote:

    My question is, would the memory latencies be so high for the 970FX if high-end RAM was used for the Linux tests (like, say, some TCCD or BH-5 at 2-2-2-5), instead of the standard 3-3-3-8 SPD that ships with the G5 system? Or is there some limitation to the G5 motherboard that prevents posting with performance RAM as a way for Apple to ensure that only certain, accepted DIMMs are used with their computers?


    That doesn't matter since they are testing workstations, Irwindale and Opteron is also using CAS3 RAM. No workstations/servers use 2-2-2-5 RAM.


    The poor scores of OS X compared to Linux makes sense. G5 was rumored to be fast in speccpu benchmarks but came out to be slower. Must be that rumor systems were benched with Linux and the production was benched with OSX.

    I am impressed with OS X's features though.
    Reply
  • Jedi2155 - Thursday, September 01, 2005 - link

    The G5 motherboard has the limitations due to Apple's way to insure you only buy certified ram. The SPD settings must be perfect. Reply
  • ceefka - Thursday, September 01, 2005 - link

    I am humbled by the sheer expertise of Johan. Amazing work, Johan!

    This makes me even more curious about Intel's contribution to the next generation of Macs. How will they compare to the best G5s?
    Reply
  • stmok - Thursday, September 01, 2005 - link

    LOL...As everyday passes, it seems more "interesting things" are revealed from Apple solutions. Reply
  • ViRGE - Thursday, September 01, 2005 - link

    Granted, some of this was over my head(more than I'd like to admit to), but your results are none the less very interesting Johan. Now that we have the Linux/G5 numbers, there's no arguing that there's a weakness in MacOSX somewhere, which is a bit depressing as a Mac user, but still a very useful insight as to how there's obviously something very broken in some design aspect of the OS(it simply shouldn't be getting crushed like it is). My only question now is how Apple and its devs will respond to this - it is pretty damning after all.

    Thanks for finally getting some Linux/G5 numbers out to settle this.
    Reply
  • sdf - Friday, September 02, 2005 - link

    By changing hardware platforms.

    No, seriously.

    A transition from PowerPC to Intel would be the perfect time to correct ABI flaws like this. It isn't that the G5 causes the slow down, it's that the slow down (maybe) can't really be fixed without breaking binary compatibility. A CPU transition is clearly going to do that anyway, so maybe they'll just wait...
    Reply
  • toelovell - Thursday, September 01, 2005 - link

    I am kind of curious to see how Darwin would work on an x86 based system for these same tests. There are x86 binaries for Darwin 8. So it should be possible to run these tests and compare Darwin with Linux on an x86 platform. This would help to see if the OS really is the limitation. Just a thought. Reply
  • JohanAnandtech - Thursday, September 01, 2005 - link

    If linux is capable of pushing the G5 8 times higer than with Mac OS X, there is little doubt on my mind that the OS is the problem. Or did I understand you wrong?

    Anyway, I have no experience whatsoever with Darwin. My first impression is that installing Darwin on x86 is probably a very masochistic experience, due to lack of proper drivers. We might get it working but can it really run MySQL and other apps? THere are probably libraries missing... Will the results be representative of anything as it is probably tuned for just getting it running instead of performance? Anyone with Darwin x86 experience?
    Reply
  • wjcott - Thursday, September 01, 2005 - link

    The only interest I have in a mac OS is if they are going to sell it without a computer. I would love to have OS X, but I must build the machine. Reply
  • Quanticles - Thursday, September 01, 2005 - link

    Every component must be fine tuned to the upmost degree... Every BIOS Setting... Every Hidden Register... *crazy eyes* =) Reply

Log in

Don't have an account? Sign up now