Mac OS X Achilles Heel

It is clear that profiling MySQL on the kernel is the only way that we are going to be able to pin-point why exactly MySQL is so slow on Mac OS X. So, why did I state that I believe the threading engine in Mac OS X to be rather slow? Well, I admit that I should have made it more clear in the article that I didn't have rock-solid evidence. However, my suspicion is based on more than speculation.

First of all, notice that the Mac OS X performance is decent with a concurrency of one, or one simulated user. It still performs well when a second user is simulated, as the second CPU can kick in and push performance higher. Let us check the scaling, by putting the numbers of our MySQL graphic into a table.

Concurrency Dual G5 2,5 GHz Tiger Scaling (Concurrency one=100%) Dual G5 2,5 GHz Linux 2.6 Scaling (Concurrency one=100%) Dual Opteron 2.4Ghz Scaling (Concurrency one=100%)
1 192 100% 228 100% 290 100%
2 274 143% 349 153% 438 151%
5 113 59% 411 180% 543 187%
10 62 32% 443 194% 629 217%
20 50 26% 439 193% 670 231%
35 50 26% 422 185% 650 224%
50 47 25% 414 182% 669 230%

The performance at concurrency 1 and 2 is mediocre, but not really bad. Notice that the scaling of Mac OS X from one to two is not fantastic, but is almost as good as the Linux machines. Once we worked with 5 concurrent users, however, performance collapses on Mac OS X: we get only 60% of the performance at concurrency one. With Linux, both CPUs are not stressed at a concurrency of two, and increasing the load makes the CPUs work harder.

The G5 (Linux) achieves its peak quicker as it is a bit slower in this integer intensive task than the Opteron. However, it is important to remark that while performance begins to decline very slowly as we increase the number of users, there is no collapse! At a concurrency of 50, we still have 80% more performance than at a concurrency of one, showing that Linux handles the extra load of the extra threads very well. On Mac OS X, performance has plummeted to one quarter of our initial performance, showing that the threads are creating an additional overhead somehow.

Secondly, it is a fact that our benchmark is not disk limited. In that case, it is well documented that MySQL performance depends on the threading performance of the OS. A few examples:
MySQL Reference Manual for version 5.0.3-alpha:

"MySQL is very dependent on the thread package used. So when choosing a good platform for MySQL, the thread package is very important"
More:
"The capability of the kernel and the thread library to run many threads that acquire and release a mutex over a short critical region frequently without excessive context switches. If the implementation of pthread_mutex_lock() is too anxious to yield CPU time, this will hurt MySQL tremendously. If this issue is not taken care of, adding extra CPUs will actually make MySQL slower"
Darwin (6.x and older) used to be quite a bit slower when it came to context switches, but our own LMBench testing shows that the latest Darwin 8.0 performs context switches just as/nearly as fast as Linux kernel 2.6. So, a possible explanation might be that more context switches happen, but we still have to find a method to measure this. Suggestions are welcome....

From the MySQL site:
"As a multithreaded server, MySQL is most efficient on an operating system that has a well implemented threading system"
Thirdly, we have the Lmbench benchmarks, which are not conclusive, but point in the same direction. Even the high latency for the TCP measurements (see above) on Mac OS X might indicate relatively poor threading performance. MySQL has a TCP/IP connection thread, which handles all connection requests. This thread creates a new dedicated thread to handle the authentication and SQL query processing for each connection.

The split funnel suspect

The last suspect is the locking system. In Panther, only two threads could lock into the kernel to execute code of the kernel. One thread could lock into the networking part, while the other into the rest of the kernel services.

In Tiger, the locking is finer. Although Apple's documents indicate that it is still rather coarse grained, it is clear that more than two locks into the kernel can exist at the same time. In the case of MySQL, this should be a very important improvement, but we didn't see any improvement at all when performing the tests on both Panther and Tiger. This is speculation, but based on our data, we are tempted to hypothesize that the new locking system isn't really working right now, and that Tiger continues to behave like Panther.

Does it affect you?

What does this all mean? Whether or not you skipped the technical part, you probably want to know how it affects your Apple (server) experience.

It is clear that if you plan to run MySQL on Apple hardware, it is better to install YDL Linux than to use OS X. If you need excellent read performance, the maximum performance of your server will be up to 8 times better. If your server is only going to serve a limited number of users, YDL Linux will allow you to run with a less expensive system.

If the usage pattern of your server is more OLTP, Transaction processing oriented, we give you the same advice. Our quick tests with InnoDB show the same kind of behavior and we have noticed very slow file system performance. At this point, we do not have enough data to be conclusive. We noticed, for example, that importing data in our database (via the ">" command) took up to 8 times longer.

Low level benchmarking on Mac OS X and Linux The G5 a.k.a. Power 970FX
Comments Locked

47 Comments

View All Comments

  • tthiel - Wednesday, May 24, 2006 - link

    You need to redo this entire test. So much has come out about how poorly this was done its hard to believe it came from Anandtech.
  • iggie - Friday, January 13, 2006 - link

    I'm surprised you didn't post the raw VM latency results from lmbench. I found http://www-128.ibm.com/developerworks/library/l-yd...">another article that did a similar performance comparison (Darwin vs. Linux on G5).
    mmap latency is 3x greater, but most tellingly, page fault latency is > 900 x greater!

    Did you observe similar results in your tests?

    I would imagine that page faults would play a greater and greater role as more and more independent clients connect to a server. I have experienced a huge disparity in http://www.openmicroscopy.org/api/omeis/">our own server software implementation for scientific imaging. In our case, all disk access is done via mmap and page faults (its a shared-VM-based image server system meant to serve many terabytes of image data)
  • asifyoucare - Sunday, September 4, 2005 - link

    Interesting article.

    If you suspect that thread performance is the bottleneck, why not write a short program to measure how many threads can be created and destroyed per second?

  • DoctorBooze - Saturday, September 3, 2005 - link

    quote:

    In the case of Linux, creating a thread is very similar to creating a process. [...] So, if you test fork() on Linux, you also get a rough idea of how fast threads are created

    I'm no guru but I don't think that's true now with Native Posix Threads, which you get in 2.6 kernels with a suitable libc (and some distros with 2.4 kernels). Check what your program's linked with: on my Fedora Core 3 system `ldd /usr/libexec/mysqld` shows me MySQL is linked with /lib/tls/libc.so.6 and running that shows it has NPTL. The API may be similar but what happens in the kernel isn't and it makes a big, big difference to MySQL. Still, Linux now has fast native POSIX threads and it looks like OS X doesn't.
  • ikruusa - Saturday, September 3, 2005 - link

    Indeed, as mentioned previously there was some mistakes in gcc options. And SIMD optimization is really basic in 4.0.x - only certain loops can be vectorized automatically. But loops around arrays are most significant part in signal processing and that is where SIMD really matters :)
    As we know for NetBurst arch it is recommended to use XMM registers (that is registers for SSE/SSE2) for FP calculations. And that is what gcc 3.x does (4.x too): -mfpmath=sse triggers all x87 stuff to run as scalar math using SSE command-set. As I know AltiVec is SIMD unit which is smoothly added to PowerPC pipeline. How useful there is scalar math instead of usual FP - I have no idea.
    What I want to say - my opinion is that if MySQL team has something to say about compiler options then they have documents about it. Using SIMD style processing in DB engine is very challenging exercise for coders. Dont expect magic from compiler here. Hint: maybe Intel's own icc compiler provide some magic but you have to prove it ;) I still believe that the most useful options can be -O[2,3] -funroll-loops and -ffast-math (as you mentioned) with -arch=[processor]. The last one should provide basic branching elimination (e.g. using cmov for x86) and correct instr. ordering.
    About testing Linux. I have some skills in Apache testing with JMeter. I have been quite stuck but kernel developers were kind enough to help: http://marc.theaimsgroup.com/?l=linux-kernel&m...">http://marc.theaimsgroup.com/?l=linux-kernel&m...
    Then I discovered all OS tuning possibilities in /proc Well, most are still unknown for me but I just want to get your attention here. Oracle talks about shared memory and number of semaphores and some particular Linux /proc parameters. Of course there should be all written in MySQL manual too if any parameter needs tuning. But is it enough to read MySQL manual and create profile for OS'es IPC and process management if we need to stress test MySQL on e.g 8-way SMP?
    But still - good start of interesting investigation, anandtech.com!! Thank you and keep going!
  • kvs - Saturday, September 3, 2005 - link

    If thread-creation is extremely slow in Darwin, maybe MySQL-performance could be helped by enabled the thread cache? A look at 'mysqladmin extended-status' would show how many threads had been created and cached, and should reveal if thread_cache would be needed.
  • tester2 - Friday, September 2, 2005 - link

    Well if ab on Mac OS X was the problem you could have easily tested this from a Linux box over the network.

    Because you probably did this as well, and found out that performance tuning done by Apple outperformed the Linux/PPC and Linux/Opteron system by a substantial amount you keept this out of the story ...

    So I did some testing, and yes when using ab from a Mac OS X I find the exact figures you report. Using a Linux Pentium 4 based system over Gb network gave me 6150 req/sec substantially faster then anything out there.
    Look here for numbers from another source; http://www.pcmag.com/article2/0,1895,1637655,00.as...">http://www.pcmag.com/article2/0,1895,1637655,00.as...

    The webserver runs around 60 threads ... go figure.

    Yes there is a problem with the Mac OS X - Mysql combo if you are looking for performance, but jugging this as Mac OS X for server applications is a nono is drawing the wrong conclusion. I hope someone with good development skills will look at the mysql code and tune it to work well with Mac OS X.

  • benh - Friday, September 2, 2005 - link

    Interesting article ! One thing that is worth looking into however is wether the YDL kernel is actually a 32 or a 64 bits kernel. This would probably have an impact on some of the numbers. I would expect the ppc64 kernel to perform faster overall on a 64 bits CPU with a small overhead on syscalls from 32 bits applications due to the argument size translation.

    Also, the problem with the 2.7Ghz on linux is indeed a slight change in the firmware. It in fact looks like a bug in Apple Open Firmware device tree on those machine where they left out the properties providing the interrupt routing of the i2c controller in the north bridge used to drive the fan controller among others. The OS X driver silently falls back to a polled mecanism, while the linux driver doesn't and (shame on me!) used to have a small bug that would cause it crash when unable to locate those properties.

    I posted a patch a while ago fixing that up, I would expect YDL to have an updated kernel/installer available by now.

    Finally, you are right about the U3 northbridge having a quite high memory latency, that is definitely not helping the G5. There have been rumours floating around that Apple now has a new bridge that improves that significantly, though it's pretty much impossible to tell if/when they will release a machine using it. IBM also had multicore G5s available for some time now, though Apple is still not releasing any machine using them.

    Regards,
    Ben.
  • JohanAnandtech - Friday, September 2, 2005 - link

    Thanks for the very helpful feedback.

    Do you have any idea why the U3 came with such high latency. Lack of development time? Lack of expertise? A inherent problem with the FSB of the G5? Rather old technology? You see I am very curious, and couldn't find much info on it.



  • benh - Friday, September 2, 2005 - link

    I don't know for sure. I wouldn't blame the FSB though. I remember reading somewhere that the memory controller in U3 was similar if not identical to the old one they used in U2 on G4 machines and was to blame but I can't guarantee the reliability of that information.

Log in

Don't have an account? Sign up now