Mac OS X Achilles Heel

It is clear that profiling MySQL on the kernel is the only way that we are going to be able to pin-point why exactly MySQL is so slow on Mac OS X. So, why did I state that I believe the threading engine in Mac OS X to be rather slow? Well, I admit that I should have made it more clear in the article that I didn't have rock-solid evidence. However, my suspicion is based on more than speculation.

First of all, notice that the Mac OS X performance is decent with a concurrency of one, or one simulated user. It still performs well when a second user is simulated, as the second CPU can kick in and push performance higher. Let us check the scaling, by putting the numbers of our MySQL graphic into a table.

Concurrency Dual G5 2,5 GHz Tiger Scaling (Concurrency one=100%) Dual G5 2,5 GHz Linux 2.6 Scaling (Concurrency one=100%) Dual Opteron 2.4Ghz Scaling (Concurrency one=100%)
1 192 100% 228 100% 290 100%
2 274 143% 349 153% 438 151%
5 113 59% 411 180% 543 187%
10 62 32% 443 194% 629 217%
20 50 26% 439 193% 670 231%
35 50 26% 422 185% 650 224%
50 47 25% 414 182% 669 230%

The performance at concurrency 1 and 2 is mediocre, but not really bad. Notice that the scaling of Mac OS X from one to two is not fantastic, but is almost as good as the Linux machines. Once we worked with 5 concurrent users, however, performance collapses on Mac OS X: we get only 60% of the performance at concurrency one. With Linux, both CPUs are not stressed at a concurrency of two, and increasing the load makes the CPUs work harder.

The G5 (Linux) achieves its peak quicker as it is a bit slower in this integer intensive task than the Opteron. However, it is important to remark that while performance begins to decline very slowly as we increase the number of users, there is no collapse! At a concurrency of 50, we still have 80% more performance than at a concurrency of one, showing that Linux handles the extra load of the extra threads very well. On Mac OS X, performance has plummeted to one quarter of our initial performance, showing that the threads are creating an additional overhead somehow.

Secondly, it is a fact that our benchmark is not disk limited. In that case, it is well documented that MySQL performance depends on the threading performance of the OS. A few examples:
MySQL Reference Manual for version 5.0.3-alpha:

"MySQL is very dependent on the thread package used. So when choosing a good platform for MySQL, the thread package is very important"
More:
"The capability of the kernel and the thread library to run many threads that acquire and release a mutex over a short critical region frequently without excessive context switches. If the implementation of pthread_mutex_lock() is too anxious to yield CPU time, this will hurt MySQL tremendously. If this issue is not taken care of, adding extra CPUs will actually make MySQL slower"
Darwin (6.x and older) used to be quite a bit slower when it came to context switches, but our own LMBench testing shows that the latest Darwin 8.0 performs context switches just as/nearly as fast as Linux kernel 2.6. So, a possible explanation might be that more context switches happen, but we still have to find a method to measure this. Suggestions are welcome....

From the MySQL site:
"As a multithreaded server, MySQL is most efficient on an operating system that has a well implemented threading system"
Thirdly, we have the Lmbench benchmarks, which are not conclusive, but point in the same direction. Even the high latency for the TCP measurements (see above) on Mac OS X might indicate relatively poor threading performance. MySQL has a TCP/IP connection thread, which handles all connection requests. This thread creates a new dedicated thread to handle the authentication and SQL query processing for each connection.

The split funnel suspect

The last suspect is the locking system. In Panther, only two threads could lock into the kernel to execute code of the kernel. One thread could lock into the networking part, while the other into the rest of the kernel services.

In Tiger, the locking is finer. Although Apple's documents indicate that it is still rather coarse grained, it is clear that more than two locks into the kernel can exist at the same time. In the case of MySQL, this should be a very important improvement, but we didn't see any improvement at all when performing the tests on both Panther and Tiger. This is speculation, but based on our data, we are tempted to hypothesize that the new locking system isn't really working right now, and that Tiger continues to behave like Panther.

Does it affect you?

What does this all mean? Whether or not you skipped the technical part, you probably want to know how it affects your Apple (server) experience.

It is clear that if you plan to run MySQL on Apple hardware, it is better to install YDL Linux than to use OS X. If you need excellent read performance, the maximum performance of your server will be up to 8 times better. If your server is only going to serve a limited number of users, YDL Linux will allow you to run with a less expensive system.

If the usage pattern of your server is more OLTP, Transaction processing oriented, we give you the same advice. Our quick tests with InnoDB show the same kind of behavior and we have noticed very slow file system performance. At this point, we do not have enough data to be conclusive. We noticed, for example, that importing data in our database (via the ">" command) took up to 8 times longer.

Low level benchmarking on Mac OS X and Linux The G5 a.k.a. Power 970FX
Comments Locked

47 Comments

View All Comments

  • Gandalf90125 - Friday, September 2, 2005 - link

    From the article:

    "...so it seems that IBM, although slightly late, could have provided everything that Apple needs."

    I'd say not everything Apple needs as I suspect the switch to Intel was driven more by marketing than any technical aspect of the IBM vs. the Intel chips.
  • Illissius - Friday, September 2, 2005 - link

    A few notes:

    - you mention trying a --fast-math option, which I've never heard of... presumably this was a typo for -ffast-math?

    - when I tried using -mcpu (which you say you used for YDL) on GCC 3.4, it told me the option had been deprecated, and -mtune has to be used instead (dunno whether it told me this latter part itself or I read it somewhere else, but it's true). I'm not sure whether GCC4 has the same behaviour (I'd think so), whether it still has the intended effect despite the warning, or whether it matters at all.

    - was there a reason for using -march on one, and -mcpu/-mtune on the other? (the difference is that -mcpu/-mtune optimize the code for that processor as much as possible while still keeping the code compatible with everything else in the architecture, while -march does the same without care for compatibility -- on x86 at least, not sure whether it's the same on PPC)

    - you mention using the same compiler because, err, you wanted to use the same compiler... if this was done in the hopes of it generating code of similar speed for each architecture, though, then your own results show there isn't much point -- seems GCC, 3.3 at least, is much better at generating x86 code than PPC (which isn't surprising, given much more work probably went into it due to the larger userbase). Not saying it was a bad idea to use GCC on both platforms (it's a good one, if for no other reason than most code, on the Linux side at least and OSX itself (not sure about the apps) are compiled with it), just that if the above was the reason, it wasn't a very good one ;).

    - Continuing the above, I was a bit surprised at the, *ahem*, noticeable difference in speed between not even two different compilers, but two versions of the same. (I was expecting something like 1-5, maybe 10% difference, not 100). Maybe this could warrant yet another followup article, this time on compilers? :)
  • Pannenkoek - Friday, September 2, 2005 - link

    The reason is that GCC 4.0 incorporated infrastrucure for vector optimization (tree-ssa), which can give, especially in synthetic benchmarks, huge increase in FP performance. GCC can now finally optimize for SSE, Altivec, etc., a reason why in the past optimizing specifically for newer Pentiums did not yield much improvement.

    Althougn compiler benchmarks would be interesting, I doubt it is a task for Anandtech. Normal desktop users do not have to worry about whether or not their applications are optimized optimally, and any differences between, say GCC and ICC, are small or negligible for ordinary desktop programs. (Multimedia programs often have inline assembly for performance critical parts anyway).

    GCC is free, supports about any platform and improves continually while it's already a first class compiler.
  • javaxman - Friday, September 2, 2005 - link

    While I generally love this article, I have to wonder...
    why not write a simple benchmark for pthread(), if you think that's the bottleneck? Surely it'd be a simple thing to write a page of code which creates a bunch of threads in a loop, then issues a thread count and/or timestamp. It seems like a blindingly obvious test to run. Please run it.

    I have to say that I *do* think pthread() is the likely bottleneck, possibly due to BSD4.9-derivative code, but why not test that if we think that's the problem? I understand wanting to see real-world MySQL performance, but if you're trying to find a system-level bottleneck, that's not the right type of testing to do...

    Now that I metion it, Darwinx86 vs. BSD 4.9 ( on the same system ) vs. BSD 5.x ( on the same system ) vs. Linux ( on the same system ) would really be a more interesting test at this point... I'm really not caring about PPC these days unless it's an IBM blade system, to be honest... testing an Apple PPC almost seems silly, they'll be gone before too long... Apple's decision to move away from PPC has more to do with *future* chip development than *current* offerings, anyway... Intel and AMD are just putting more R&D into their x86 chips, IBM's not matching it, and Apple knows it...

    but even if you are going to look at PPC systems, if you're trying to find a system-level bottleneck, write and run system-level tests... a pthread() test is what is needed here.
  • rhavenn - Friday, September 2, 2005 - link

    If I remember correctly, OS X is forked off of the FreeBSD 4.9 codebase. The 4.x series of BSD always had a crappy threading system and didn't handled threaded apps well at all. I doubt Apple really touched those internals all that much.

    FreeBSD 5.x has a much better time of it. I'm wondering if the switch back to a Intel platform will make it easier for Apple to integrate the BSD 5.x codebase into their OS? or even if they plan on using the BSD 6.x codebase for a future release? The threading models have vastly improved.

    Just a thought :)
  • JohanAnandtech - Friday, September 2, 2005 - link

    http://www.apple.com/education/hed/compsci/tiger.h...">http://www.apple.com/education/hed/compsci/tiger.h... :

    "FreeBSD 5.0
    The upgraded kernel in Tiger, based on mach and FreeBSD, provides optimized resource locking for better scalability across multiple processors, support for 64-bit memory pointers through the System library and standards-based access control lists"

    Where did you see FreeBSD 4.9?
  • mbe - Friday, September 2, 2005 - link

    Readers also pointed out that LMBench uses "fork", which is the way to create a process and not threads in all Unix variants, including Mac OS X and Linux. I fully agree, but does this mean that the benchmark tells us nothing about the way that the OS handles threading? The relation between a low number in this particular Lmbench benchmark and a slow creating of threads may or may not be the answer, but it does give us some indication of a performance issue. Allow me to explain...

    This misses the point, your claim in the last article was that MacOS X used userspace threads. Mentioning that LMbench uses processes still rules out userspace threads having any part to play. This is since processes can't in any meaningful way (short of violating some pretty basic principles) be implemented around userspace threads. The point is that a process is a virtual memory space attached to a main system thread, not a userspace thread which are not normally even considered threads on this level.

    This is necessary since the virtual memory attached to the thread has to be managed when doing context switches, and by its very definition userspace code cannot directly touch the memory mappings.
  • JohanAnandtech - Friday, September 2, 2005 - link

    Yes, it could be. The interesting questions are:
    - Is the only culprit for the 8 time lower performance. Microkernels are reported to be 66 to 5% slower depending on who benchmarked it. But not 8 times slower.
    - What makes it still interesting for the apple devs to use it?

    I hope Apple will be a bit more keen to defend their product, because their might be interesting technical reasons to keep the Mach kernel.
  • sdf - Friday, September 2, 2005 - link

    Is Mac OS X really a microkernel? I understood it to be designed to function as a microkernel, but compiled and shipped as a macrokernel for performance reasons.
  • JohanAnandtech - Sunday, September 4, 2005 - link

    I am sorry if I wasn't clear. As I state in the article clearly: Mac OS X is ** NOT ** a microkernel, but based on a microkernel as the Mach kernel is burried inside the FreeBSD monolithic kernel.

    Most of the tasks are done by a FreeBSD alike kernel, but threading is done by the Mach kernel.

Log in

Don't have an account? Sign up now