Original Link: http://www.anandtech.com/show/1778
IntroductionA little bit more than a month ago, AnandTech published "No more mysteries: Apple's G5 versus x86, Mac OS X versus Linux" with the ambitious goal of finding out how the Apple platform compares, performance-wise, to the x86 PC platform. The objective was to find out how much faster or slower the Apple machines were compared to their PC alternatives in a variety of server and workstation applications.
Some of the results were very surprising and caught the attention of millions of AnandTech readers. We found out that the Apple platform was a winner when it came to workstation applications, but there were serious performance problems when you run server applications such as MySQL (Relational Database) or Apache (Webserver). The MySQL database running on Mac OS X and the Dual G5 was up to 10 times slower than on the Dual Opteron running Linux.
We suspected that Mac OS X was to blame as low level OS benchmarks (Lmbench 3.0) indicated low OS performance. The whole article was a first attempt to understand better how the Apple platform - Mac OS X + G5 - performs, and as always, first attempts are never completely successful or accurate. As we found more and more indications that the OS, not the CPU, was the one to blame, it became obvious that we should give more proof for our "Mac OS X has a weak spot" theory by testing both the Apple and x86 machines with Linux. My email was simply flooded with hundreds of requests for some Linux Mac testing...even a month after publication!
That is what we'll be doing in this article: we will shed more light on the whole Apple versus x86 PC, IBM G5 versus Intel CPU discussion by showing you what the G5 is capable of when running Linux. This gives us insight on the strength and weakness of Mac OS X, as we compare Linux and Mac OS X on the same machine.
The article won't answer all the questions that the first one had unintentionally created. As we told you in the previous article, Apple pointed out that Oracle and Sybase should fare better than MySQL on the Xserve platform. We will postpone the more in-depth database testing (including Oracle) to a later point in time, when we can test the new Apple Intel platform.
Why Bother?Why do we bother, now that Apple has announced clearly that the next generation of the Apple machines will be based on Intel? Well, this makes our research even more interesting. As you will see further in the article, the G5 is not the reason why we saw terrible, slow performance. In fact, we found that the IBM PowerPC 970FX, a.k.a. "G5", has a few compelling advantages.
As Apple moves to Intel, the only thing that makes Apple unique, and not yet another x86 PC OEM, is Mac OS X. That is why Apple will attempt to prevent you from running an x86 version of Mac OS X on anything else but their own hardware (using various protection schemes), as Anand reported in "Apple's Move to x86: More Questions Answered". Mac OS X will be the main reason why a consumer will choose an Apple machine instead of a Dell one. So, as we get to know the strengths and weaknesses about this complex but unique OS, we'll get insight into the kind of consumers who would own an Intel based machine with Mac OS X - besides the people who are in love with Apple's gorgeous cases of course....
We also gain insight into the real reasons behind the move to Intel, and what impact it will have for the IT professional. Positive but very vague statements about the move to the Intel architecture have already been preached to the Apple community. For example, it was reported that the "Speed of Apple Intel dev systems impress developers". Proudly, it was announced that the current Apple Intel dev systems - based on a 3.6 GHz Intel Pentium 4 with 2 MB L2 Cache - were faster than a dual G5 2 GHz Mac. That is very ironic for three reasons.
Firstly, Apple's own website contradicts this in every tone. Secondly, we found a 2.5 GHz G5 to perform more or less like a Pentium 4 3 - 3.2 GHz in integer tasks. So, a 2 GHz G5 is probably around the speed of a 2.6 GHz Pentium 4. It is only natural that a much faster single CPU with a better disk and memory system outpaces a slower dual CPU in single threaded booting and development tasks. Thirdly, the whole CPU industry is focused now on convincing the consumers of how much better multi-core CPUs are compared to their "old" single core brethren.
The Aftermath of the First ArticleWe received a flood of mails and posts from our readers requesting that we test the Apple machines with Linux too and questioned why we hadn't done that in the first article. We have to point out that the objective of the first article was to compare the platforms, and therefore, it is only natural to use Mac OS X on the Apple machine. Very few Apple machines run Linux, but in this article, we test this combination to shed more light on our findings.
Secondly, we spent most of our time trying out different MySQL setups to find out whether or not the poor MySQL numbers were a result of bad tuning. We tested and tried with, for example, the "skip-locking", "key_buffer" and "thread_cache" parameters, but none of them could help the Apple platform to perform significantly better. The out-of-the-box MySQL setup on Tiger is not very different from a typical SUSE Linux out-of-the-box installation, except that skip-locking is not enabled on the Apple platform. The reason seems to be that quite a few Xserves are used in clusters. Enabling "Skip-locking" gives a 1-3% performance boost to the Xserve and PowerMacs. We can say with 99% certainty that the MySQL configuration was not the cause of the poor MySQL performance.
The vast majority of the reactions of the Apple user community were very positive, despite our low server benchmark numbers. Many Apple users told us that they were glad that we had pointed out that Mac Os X still needs a bit of performance tunings. Anand reported the same thing as what many Apple users pointed out, which is that the responsiveness of the OS is not spectacular:
"The overall responsiveness of the system was decent, but go back to using a top-of-the-line PC in Windows for a few minutes, and you definitely feel a bit sluggish on the G5"We still receive suggestions because of the first article, and one question that was asked a lot was: "why not test with different compilers?" The reason was that gcc was the default compiler on both Mac OS X and Linux. Testing with compilers would widen the scope of this kind of article too much, and we wanted to use the same compiler on all CPUs. That being said, we retested with the gcc 4.0 compiler because the 3.3 version performed pretty poorly on the Power FX platform.
I would like to thank the readers for the valuable feedback. In this second part, we'll correct the inaccuracies in the first.
Scope and FocusAgain, we are focusing on workstation and server applications, especially the open source ones (MySQL, Apache) as Apple is touting heavily the importance of their move to an "open source foundation".
The 64 bit Apple Machines were running OS X Server 10.4.1 (Tiger) and Yellow Dog 4.0 Linux version 2.6.10-1.ydl.1g5-smp. The reason we chose Yellow dog is that Terrasoft, the company behind this Linux version, optimises only for the G5. So, Yellow dog is by far the most PowerPC optimized Linux distribution out there.
Our x86 machines are still running a 64 bit server version of a popular Open Source Operating Unix system: SUSE Linux SLES 9 Service Pack 1 (kernel 2.6.5).
Benchmark ConfigurationWe used the MySQL version (4.0.18) that came with the SUSE SLES9 CD's and Mac OS X Tiger 10.4.1, which was certified to work on our OS. Our YDL Linux reported: "Linux version 2.6.10-1.ydl.1g5-smp (gcc version 3.3.3 (Yellow Dog Linux 3.3.3-16.ydl.7))"
SUSE SLES 9 (SUSE Entreprise Edition), Linux kernel 2.6.5, 64 bit
Workstation tests: Windows XP SP2
Apple PowerMac G5
OS X 10.4.1 Tiger, 64 bit (partially)
Yellow Dog Linux 3.3.3-16.ydl.7
MySQL 4.0.18, 32 en 64 bit, MyISAM engine
Gcc 3.3.3 and 4.0
HardwareHere is the list of the different configurations:
Apple PowerMac Dual 2.7 GHz, Dual 2.5 GHz
4 GB (8x512 MB) Corsair XMS3200 running at CAS 3-3-3
Dual Intel Xeon DP Irwindale 3.6 GHz 2 MB L2-cache, 800 MHz FSB - Lindenhurst Chipset
Intel® Server Board SE7520AF2
4 GB (4x1024 MB) Micron Registered DDR-II PC2-3200R, 400 MHz CAS 3, ECC enabled
NIC: Dual Intel® PRO/1000 Server NIC (Intel® 82546GB controller)
Opteron Server: Dual Opteron 250 (2.4 GHz)
Iwill DK8ES Bios version 1.20
4 GB: 4x1GB MB Reg. Transcend (Hynix 503A) DDR400 - (3-3-3-6)
NIC: Broadcom BCM5721 (PCI-E)
Client Configuration: Dual Opteron 250
MSI K8T Master1-FAR
4x512 MB Infineon PC2700 Registered, ECC enabled
NIC: Broadcom 5705
1 Seagate Cheetah 36 GB (15000 RPM, SCSI Ultra320, 8MB cache)
Maxtor 120 GB DiamondMax Plus 9 (7200 RPM, ATA-100/133, 8MB cache)
Words of thanksA lot of people gave us assistance with this project, and we like to thank them, of course:
Frank Balzer, IBM DB2/SUSE Linux Expert
Jasmin Ul-Haque, Novell Corporate Communications
Matty Bakkeren, Intel Benelux
Trevor E. Lawless, Intel US
Larry.D. Gray, Intel US
Damon Muzny, AMD US
Nick Leman, MySQL expert
Ruben Demuynck, Vtune and OS X expert
David Van Dromme, Iwill Benelux Helpdesk
I also would like to thank Lode De Geyter, Manager of the PIH, for letting us use the infrastructure of the TUK to test the database servers.
Micro CPU Benchmarks: Isolating the FPUAlthough it surely wasn't the main subject of our first article, the FLOPS (Floating Point Operations Per Second) portion was one part where I clearly made a mistake. Indeed, the --noaltivec flag and the comment that Altivec was enabled by default in the gcc 3.3 compiler docs made me believe that some Altivec SIMD optimization was being done when compiling flops, a synthetic micro FPU benchmark. That was not true: flops is double precision and gcc 3.3 did not support vectorisation.
As I wrote in the article, we used -O2 and then tried a bucket load of other options like --fast-math --mtune=G5, but it didn't make any significant difference.
Again, note that benchmarking with flops is not real world, but it isolates the FPU power. Flops shows the maximum double precision power that the core has by making sure that the program fits in the L1-cache. Flops consists of 8 tests, and each test has a different but well known instruction mix. The most frequently used instructions are FADD (addition), FSUB (subtraction) and FMUL (multiplication). We used the following on the Opteron based PCs:
Gcc -O2 -march=k8 flops.c -o flopsAnd, on the G5 machines, we used:
Gcc -O2 -mcpu=G5 flops.c -o flopsThe command "gcc - version" gave this output "gcc (GCC) 4.0.0 Copyright (C) 2005 Free Software Foundation, Inc."
Let us check out the results:
|MOD||FADD||FSUB||FMUL||FDIV||Powermac G5 2.7 GHz
|Powermac G5 2.7 GHz
|Powermac G5 2.5 GHz
|Opteron 850 2.4 GHz
|Opteron 850 2.4 GHz
As Gabriel Svelto and other readers pointed out, the problem with gcc 3.3 generating code for PowerPC CPUs is that it outputs very poorly scheduled code for these CPUs. The result is that gcc 3.3 does not make good use of the FP units of the G5 core, which are capable of FMADD instructions. This kind of instruction performs a 64-bit, double-precision floating-point multiply of an operand in floating-point register (FPR) "FRA" by the 64-bit, double-precision floating-point operand in FPR "FRC"; then add the result of this operation to the 64-bit, double-precision floating-point operand in FPR "FRB". Thus if the code allows it, you can do a multiplication and an addition while executing only one instruction. gcc 4.0 is a lot better at using these capabilities as you can see.
A bit disappointing is the fact that gcc 4.0 lowers the performance of the Opteron compared to gcc 3.3.3, but this article is not about compiler technology; rather, it is about comparing the G5 and the Apple platform to the x86 platform. With our current benchmark data, we can conclude that the G5's FPU performance is as good as the best x86 FP chip, the AMD Athlon 64 / Opteron. Using IBM's compiler for the G5 and Intel's compiler on the Opteron, there will be higher results for both platforms, but we wanted a comparison with exactly the same compiler technology.
The Xserve Server PlatformThe most surprising and even astonishing results of the previous article were, of course, the MySQL and Apache server benchmarks. A powerful Windows XP based client (see above: "Client Configuration: Dual Opteron 250") fires off an enormous amount of Select, grouping and ordering read intensive queries and simulates 1 to 50 concurrent clients. All that query data is sent over a direct Gigabit Ethernet link to the tested server; in this case, a PowerMac Dual G5 2.5 GHz running OS X Server (Tiger). In part I, we discovered that performance of the Apple machine completely collapsed once there were more than 2 concurrent clients.
The solution? Install a Linux distribution to verify our suspicion that the OS is to blame is on the mark. We chose Yellow Dog Linux (YDL). Terra Soft, the company behind Yellow Dog, is an Apple Authorized OEM Value Added Reseller, so you could say that Apple has no objection to installing YDL on your Apple machines. There is more: Terra Soft is specialized in optimizing for the G5 processor. The version that we used, Yellow Dog Linux 4.0.1, is based on the Linux Kernel version 2.6.10-1.ydl.1g5-smp.
Let us see how the Dual 2.5 GHz G5 performed in MySQL when running Yellow Dog Linux. Please note: YDL 4.0 wouldn't run on the 2.7 GHz Apple machine, so we do not have results for that platform.
The difference between the PowerMac running Linux and Mac OS X Server is absolutely striking. Mac OS X server shows better performance going from one to a second connection (and thus thread) because the second CPU steps in and helps carry the load. After that, however, performance completely collapses and stabilizes at around 50 queries per second.
While the G5 is not the best integer processing unit out there, it is not the one to blame for the poor performance that we experienced in our first tests. Running Yellow Dog Linux, the Dual G5 was capable of performing similar to a 3 GHz Xeon. Notice that more concurrent connections gives better performance from 1 to 20. At 5 concurrent simulated users, YDL simply wipes the floor with Mac OS X: 411 versus 113 queries per second. It gets worse at 10 concurrent users: 443 queries per second on Linux versus 62 on Mac Os X. Around 20 connections, performance declines only very slowly just like all the x86/Linux machines.
With the MySQL performance woes now clearly caused by OS X, let us see if Apache tells us the same story. We tested with Apachebench, with "n" being the total of number of connections and "c" the total of concurrent connections:
ab -n 100000 -c 100 http://localhostSome people suggested that we should test with both Apache 1.3 and 2.0, so we gave Apache 2.0 a test run.
|Unit: Requests per second||Powermac Dual G5 2.5 GHz OS X||Powermac Dual G5 2.5 GHz YDL||Dual Xeon 3.6 GHz|
On OS X, we noticed that the activity monitor was telling us that the CPUs were not working very hard and were underutilized. This seems to indicate that the problem with Apache is somewhat different from MySQL, as MySQL showed a CPU load between 165% and 190%. (200% is the maximum, as there are 2 CPUs in the system.)
Apple told us that the problem lies in Apachebench (the client side), which stalls from time to time and thus generates too low of a load on the (Apache) server. The weird thing is that this does not happen with few connections (up to 10,000). When we repeated the test, Apachebench on Mac OS X gets in trouble again. Version 2.0 is slightly faster on OS X, but it still trails by a significant margin. On the other hand, YDL and the Xeon platform are roughly 3X as fast with version 2.0.
According to Apple, this is a bug in Apachebench. Now, we can accept that explanation, as it is clear that the server is not loaded and can still accept a lot more web requests. However, the Apachebench problem is still interesting. Why exactly does the client stall? Is it really a bug or is it running out of some resources? We didn't delve deeper, as we are developing a less synthetic, closer to the real world benchmark to test web servers.
Even if we ignore the Apache results, our MySQL tests - and the queries used in these tests - are based on a real world usage pattern of a real world database. The G5 is partially crippled by a chipset that takes a long time to access the memory, and it's not the fastest integer CPU; still, it performs like a 3 GHz Xeon on Linux. The problem clearly lies in Mac OS X, and is worth further investigation.
Bottleneck SearchWe did some basic profiling, and this allows us to eliminate a few bottlenecks as the cause of the performance issues. As we discussed in the first article, network performance wasn't an issue: we used a direct Gigabit Ethernet link between client and server. On average, the server received 4 Mbit/s and sent 19 Mbit/s of data, with a peak of 140 Mbit/s. That peak of 140 Mbit/s is only achieved when running at the highest performance (500-600 queries per second); the Apple machine stayed well below that peak.
Another theory is published in a personal blog: the fsync() theory. Basically, the command forces the OS to write all the pending data to the disk drive, and then forces the disk drive to write all the data in its write cache to the platters. The theory is that most OSes do not force the last step, while Mac OS X does. However, this theory is not the reason for the lackluster performance that we noticed.
First of all, we saw at most 23 KB/s writes, again at peak performance, in the case of the Dual G5 running Mac OS X at 274 queries per second. To avoid excessive writing, our Dbbench client has a warm-up period where the database is put under load but no measurements take place. This makes our benchmarking consistent, and lowers the pressure on the disk system. You can read more about our MySQL test methods here. Secondly, we were using the MyISAM database engine, which does not support this "transactional safe writing".
MySQL ConfigurationWe played around with all the configurations' variables mentioned here, but none of them made any real difference for the Mac OS X MySQL performance. Again, the "query cache" was off, as we wanted to test worst case performance. More info on why we test this way can be found here.
For those who are curious, we did a quick test with "query cache on". The Apple machine scored about 500 queries per second. In the case of the Linux x86 machines, we had to use several clients. It seems that each client can fire off at most 1000 queries per second. This appears to be a Windows 2003 limitation, since faster Opterons (2.6 GHz instead of 2.4 GHz) or quad Opteron clients (instead of dual) couldn't get us past this limit either. With several clients firing off queries, the Linux machines were capable of a peak of 2700 queries per second (and probably more - we had 3 clients at most), while the Mac was still limited to 500 queries per second. Note that this is "best case" performance, since up to 60% of the queries were picked out of the cache. With more random queries, these numbers are significantly lower.
Let us see if LMBench can make us wiser, now that we can compare Linux and Mac OS X on the Apple PowerMac.
Low level benchmarking on Mac OS X and LinuxLmbench 3.0 provides a suite of micro-benchmarks that measure the bottlenecks at the Unix operating system and CPU level. We were not able to install Yellow Dog Linux on the 2.7 GHz Apple machine, although the paper specs (excluding the CPU) were exactly the same as our 2.5 GHz PowerMac. Some small chipset tweak or firmware change is probably the cause of our YDL installation failure on the newest and latest Apple PowerMac.
So, we'll give Mac OS X a small advantage by running it on the 2.7 GHz and Linux on the 2.5 GHz machine. Frankly, I don't care much about an 8% clock difference, as the main goal is to find out why MySQL runs between 5 and 8 times slower on Mac OS X!
The Unix process/thread creation is called "forking" as a copy of the calling process is made. lmbench "fork" measures simple process creation by creating a process and immediately exiting the child process. The parent process waits for the child process to exit. lmbench "exec" measures the time to create a completely new process, while "sh" measures the time to start a new process and run a little program via /bin/sh (complicated new process creation).
Everything is expressed in micro seconds, lower is better.
|G5 2.7 GHz||Darwin||2700||659||2308||4960|
|G5 2.5 GHz||Linux||2500||182||748||2259|
|Xeon 3.6 GHz||Linux||3585||158||467||2688|
In the previous article, I wrote: "Mac OS X is incredibly slow - between 2 and 5(!) times slower - in creating new threads, as it doesn't use kernelthreads, and has to go through extra layers (wrappers)"
Readers pointed out that there were two errors in this sentence. The first one is that Mac OS X does use kernelthreads, and this is completely true. My confusion came from the fact that FreeBSD 4.x and older - which was part of the OS X kernel until Tiger came along - did not implement kernelthreads; rather, only userthreads. It was one of the reasons why MySQL ran badly on FreeBSD 4.x and older. In the case of userthreads, it is not the kernel that manages the threads, but an application running on top of the kernel in userspace.
However, this is not the case of Mac OS X. Pthreads, available to the programmer, map directly to a Mach thread, and thread handling is the very heart of the Mach kernel inside the OS X kernel.
Mac OS X thread layering hierarchy (Courtesy: Apple)
"POSIX thread (commonly referred to as a "pthread") is a lightweight wrapper around a Mach thread that enables it to be used by user-level processes. POSIX threads are the basis for all of the application-level threads."Readers also pointed out that LMBench uses "fork", which is the way to create a process and not threads in all Unix variants, including Mac OS X and Linux. I fully agree, but does this mean that the benchmark tells us nothing about the way that the OS handles threading? The relation between a low number in this particular Lmbench benchmark and a slow creating of threads may or may not be the answer, but it does give us some indication of a performance issue. Allow me to explain.
In the Unix world, threads are often described as lightweight processes. A thread is a sequential flow of control within a program, whereas a process is one or more threads plus its own (virtual) address space. Threads share the same address space and thus, share memory and can exchange information very quickly. However, it is important to understand that threads and processes are not completely different, but they are related to each other.
In the case of Linux, creating a thread is very similar to creating a process. In fact, you use the same procedure, only with different flags or parameters. Linux implements all threads as if they were standard processes. You create a new thread with the clone() call, and the necessary flags, which describe the resources (memory) to be shared. The process creation is done with fork(), but fork() is nothing less than a clone() without the flags that describe the resources that must be shared. So, if you test fork() on Linux, you also get a rough idea of how fast threads are created.
What about Mac OS X? Well, when the Mach kernel is asked to create a Unix process (fork()), the mach kernel creates a task (which describes the resources available) and one thread. So, thread creation time is included in the fork () benchmark of Lmbench.
What can we conclude from this? First, the above tables demonstrate clearly that the creation of UNIX processes is much slower on MAC OS X, and the G5, the CPU, is not to blame. In the first test, the G5 2.5 GHz running Linux is only slightly slower than a Pentium 4 at 3.6 GHz. The third test shows that the G5 is even capable of outperforming the other CPUs, which points towards Mac OS X being the problem here. Even with a faster CPU, the OS X scores are all slower than the Linux scores on the G5.
Does this give us an idea of why MySQL performs so badly? Unfortunately, it makes us suspect that not only process, but thread creation is also slow. We can suspect it, since the process creation, which includes the creation of one thread, takes up to 5 times longer. We can't prove it, as the thread creation time is a small part of the total benchmark time, and we are not sure that the time to create a thread compared to the total time is the same proportionally on both Mac OS X and Linux. LMBench gives us a rough indication that we might be right, but it doesn't give us cold hard facts. We need to look elsewhere for those.
Interprocess Communication (IPC) and SignalingSignals allow processes (and thus threads) to interrupt other processes. Although MySQL is only one process, it must cooperate with other process via IPC. Benchmarking signal and interprocess communication latency allows to us to understand how quickly the MySQL process can cooperate with other processes and the Operating System. Contrary to workstation and gaming applications, access to the operating system and other processes is critical for database server performance. For example, our client in our database testing setup sends the queries via a Gigabit Ethernet connection (hardware - Layer 1) and via TCP-IP (Network stack Layer 2-4) to the server.
Larry McVoy (SGI) and Carl Staelin (HP) on signaling:
"Lmbench measures both signal installation and signal dispatching in two separate loops, within the context of one process. It measures signal handling by installing a signal handler and then repeatedly sending itself the signal."All numbers are expressed in micro seconds; thus, lower is better.
|G5 2.7 GHz||Darwin||2700||1.13||1.91||4.64||8.6||21.9||1.67||6.2|
|G5 2.5 GHz||Linux||2500||0.14||0.26||3.41||4.16||18.9||0.38||1.9|
|Xeon 3.6 GHz||Linux||3585||0.19||0.25||2.3||2.88||9.0||0.28||2.7|
Signaling needs significantly more time in Mac OS X (Darwin) than on Linux. The processor plays a minor role: the Opteron at 2.4 GHz is a bit faster than the Xeon 3.6 GHz running exactly the same (x86) code. However, it is clear that the operating system plays a much bigger role: a 2.5 GHz G5 running Linux easily beats the identical system with a 2.7 GHz G5 running Mac OS X. Despite the FreeBSD heritage, the TCP signals are very slow (4 times slower!) on Mac OS X.
The slower signaling results likely contribute to the overall unimpressive MySQL performance. There are still other factors that also play a part. Let us check out Inter Process Communication (IPC).
|G5 2.7 GHz||Darwin||9.496||13.1||34.8||44.5||61|
|G5 2.5 GHz||Linux||11.6||16.4||19.1||19.6||34|
|Xeon 3.6 GHz||Linux||9.909||19.0||16.0||19.3||40|
As TCP is connection based, you get Synchronize (SYN) and Acknowledgement (ACK) messages to establish a reliable connection, before any data can be transferred. Lmbench measures this startup time (TCP conn). Notice how the G5 performs this task quite quickly with Linux, but much slower with Mac OS X. The latency to connect to a TCP server is also measured (TCP) and Mac OS X is measured to be more than twice as slow compared to the Linux based machines, including the same G5 machine. So, although network bandwidth might not be a problem for our benchmark, network latency might have an influence.
Some studies show that there is a direct relationship between these TCP benchmarks and some aspects of Database performance. For example, it was reported that "The TCP latency benchmark is an accurate predictor of the Oracle distributed lock manager's performance." 
Mac OS X Achilles HeelIt is clear that profiling MySQL on the kernel is the only way that we are going to be able to pin-point why exactly MySQL is so slow on Mac OS X. So, why did I state that I believe the threading engine in Mac OS X to be rather slow? Well, I admit that I should have made it more clear in the article that I didn't have rock-solid evidence. However, my suspicion is based on more than speculation.
First of all, notice that the Mac OS X performance is decent with a concurrency of one, or one simulated user. It still performs well when a second user is simulated, as the second CPU can kick in and push performance higher. Let us check the scaling, by putting the numbers of our MySQL graphic into a table.
|Concurrency||Dual G5 2,5 GHz Tiger||Scaling (Concurrency one=100%)||Dual G5 2,5 GHz Linux 2.6||Scaling (Concurrency one=100%)||Dual Opteron 2.4Ghz||Scaling (Concurrency one=100%)|
The performance at concurrency 1 and 2 is mediocre, but not really bad. Notice that the scaling of Mac OS X from one to two is not fantastic, but is almost as good as the Linux machines. Once we worked with 5 concurrent users, however, performance collapses on Mac OS X: we get only 60% of the performance at concurrency one. With Linux, both CPUs are not stressed at a concurrency of two, and increasing the load makes the CPUs work harder.
The G5 (Linux) achieves its peak quicker as it is a bit slower in this integer intensive task than the Opteron. However, it is important to remark that while performance begins to decline very slowly as we increase the number of users, there is no collapse! At a concurrency of 50, we still have 80% more performance than at a concurrency of one, showing that Linux handles the extra load of the extra threads very well. On Mac OS X, performance has plummeted to one quarter of our initial performance, showing that the threads are creating an additional overhead somehow.
Secondly, it is a fact that our benchmark is not disk limited. In that case, it is well documented that MySQL performance depends on the threading performance of the OS. A few examples:
MySQL Reference Manual for version 5.0.3-alpha:More:
"MySQL is very dependent on the thread package used. So when choosing a good platform for MySQL, the thread package is very important"
"The capability of the kernel and the thread library to run many threads that acquire and release a mutex over a short critical region frequently without excessive context switches. If the implementation of pthread_mutex_lock() is too anxious to yield CPU time, this will hurt MySQL tremendously. If this issue is not taken care of, adding extra CPUs will actually make MySQL slower"Darwin (6.x and older) used to be quite a bit slower when it came to context switches, but our own LMBench testing shows that the latest Darwin 8.0 performs context switches just as/nearly as fast as Linux kernel 2.6. So, a possible explanation might be that more context switches happen, but we still have to find a method to measure this. Suggestions are welcome....
From the MySQL site:
"As a multithreaded server, MySQL is most efficient on an operating system that has a well implemented threading system"Thirdly, we have the Lmbench benchmarks, which are not conclusive, but point in the same direction. Even the high latency for the TCP measurements (see above) on Mac OS X might indicate relatively poor threading performance. MySQL has a TCP/IP connection thread, which handles all connection requests. This thread creates a new dedicated thread to handle the authentication and SQL query processing for each connection.
The split funnel suspectThe last suspect is the locking system. In Panther, only two threads could lock into the kernel to execute code of the kernel. One thread could lock into the networking part, while the other into the rest of the kernel services.
In Tiger, the locking is finer. Although Apple's documents indicate that it is still rather coarse grained, it is clear that more than two locks into the kernel can exist at the same time. In the case of MySQL, this should be a very important improvement, but we didn't see any improvement at all when performing the tests on both Panther and Tiger. This is speculation, but based on our data, we are tempted to hypothesize that the new locking system isn't really working right now, and that Tiger continues to behave like Panther.
Does it affect you?What does this all mean? Whether or not you skipped the technical part, you probably want to know how it affects your Apple (server) experience.
It is clear that if you plan to run MySQL on Apple hardware, it is better to install YDL Linux than to use OS X. If you need excellent read performance, the maximum performance of your server will be up to 8 times better. If your server is only going to serve a limited number of users, YDL Linux will allow you to run with a less expensive system.
If the usage pattern of your server is more OLTP, Transaction processing oriented, we give you the same advice. Our quick tests with InnoDB show the same kind of behavior and we have noticed very slow file system performance. At this point, we do not have enough data to be conclusive. We noticed, for example, that importing data in our database (via the ">" command) took up to 8 times longer.
The G5 a.k.a. Power 970FXYou might not have noticed it, but there is in fact a lot of good news in this article for owners of current Apple systems. Gcc 4.0 promises a lot better (FP) performance in open source software. The improvement from gcc 4.0 over gcc 3.3.3 and 3.3 is amazing on the PowerFX family: almost a 70% improved FP performance!
Now that the open source community finally has a decent compiler for the Apple platform, Apple management decides to step over to another architecture. Ironically, right now, the Intel architecture needs a super-optimized compiler (Intel's own) to reach the FP performance that the G5 now reaches with a very popular but far less aggressive compiler (gcc).
Combined with the data from our first article, we can safely say that the G5 2.7 GHz FP performance is at least as good as the best x86 CPUs. Integer performance seems to be between 70% and 80% of the fastest x86 CPUs, while FP/SIMD performance can actually surpass x86 in certain situations.
With the dual-core Power 970MP available and IBM's current outstanding track record when it comes to multi-core CPUs, big question marks can be placed on whether or not the switch to Intel CPUs will - from a technical point of view - be such a big step forward as Steve Jobs claims. There is more: each core has 1 MB cache instead of the current 512 KB, which will improve integer performance quite a bit as it lessens the impact of the biggest problem of the G5 - the high latency access to the memory system.
It is again ironic that the Power 970MP is far more advanced than the current Intel Dual-cores when it comes to power management. Each core can be placed independently in a power-saving state called doze, while the other core continues operation.
Xserve, silently cooled. Below the G5 with the cover, you can see the heatsink.
A low power Power 970FX is also available and consumes about 16 Watts at 1.6 GHz; so it seems that IBM, although slightly late, could have provided everything that Apple needs. The G5 with its 58 million transistors and 66 mm² die size is not really a hot CPU. The Xserve (2 x 2.3 GHz G5) was by far the quietest 1U air-cooled server that ever entered our lab in Kortrijk.
The Usual SuspectsThe Mac OS X kernel environment includes the Mach kernel, BSD, the I/O Kit, file systems, and networking components. Some of these components slow down MySQL significantly. While our rough profiling has not identified the true culprits, we think that we can narrow the possible suspects to:
- Relatively high TCP Latency that we measured
- The implementation of the threading system. Does the pthread to Mach threads mapping involve some overhead, or is this the result of the traditional performance problem of the micro kernel, namely the high latency of such a kernel on system calls? While Mac Os X is not a micro-kernel, the problem might still exist as the Mach core is deep inside that kernel. Is there IPC overhead? Lmbench signaling benchmarks seem to suggest that there is.
- The finer grained locking in the current version of Tiger does not appear to be working for some reason and we still have the "two lock" system of Panther.
We look forward to testing other database and server apps on the Mac OS X platform. Critical reports that point out weaknesses can only help the Apple community move forward and keep the Apple people on their toes.
References Threading on OS X
 Lmbench: Portable Tools for Performance Analysis
Larry McVoy, Silicon Graphics
Carl Staelin, Hewlett-Packard Laboratories
 Performance Characterization of a Quad Pentium Pro SMP Using OLTP
Kimberly Keeton*, David A. Patterson*, Yong Qiang He+, Roger C. Raphael+, and Walter E. Baker
Computer Science Division
University of California at Berkeley