Setting Expectations: A Preview of What's to Come in Mobile

Sitting in the audience at the iPhone 5s launch I remember seeing this graph showing iPhone CPU performance increase since the first iPhone. Apple claimed a 41x increase in CPU performance if you compared the Cyclone cores in its A7 SoC to the ARM11 core in the first iPhone. What’s insane is just how short of a time period that comparison spans: 2007 - 2013.

I ran SunSpider on all of the iPhones in our 5s review to validate Apple’s numbers. I came out with roughly a 100x increase in performance, or something closer to half of that if you could run later versions of iOS (with Safari/js perf improvements). SunSpider is a very CPU and browser bound workload, but even if we turn to something a bit closer to real world usage like Browsermark 2.0 I measured a 5x increase in CPU performance over the past 6 years of iPhones.

I frequently refer to the progress we’ve seen in mobile as being hyper-moore’s-law. Until recently, the gains in mobile hadn’t yet approached a point where they were limited by process technology. Instead it’s variables like cost or time to market that govern how much performance was delivered each year. We’re at the beginnings of all of this changing, and it’ll eventually look a lot like what we’ve had in the desktop and mobile CPU space for years now.

When performance results from the new Mac Pro first hit, there seemed to be disappointment in how small some of the gains were. If you compare it to the progress in CPU performance Apple has demonstrated on the other side of the fence, you’re bound to be underwhelmed.

Having personally reviewed every CPU architecture that has gone into the Mac Pro since its launch, I had a rough idea of what to expect from each generation - so I decided to put it all in a chart.

I went back through all of my Conroe, Penryn, Nehalem, Westmere and Ivy Bridge data, looked at IPC improvement in video encoding/3D rendering workloads and used it to come up with the charts below. I made a table of every CPU offered in the Mac Pro, and scaled expected performance according to max single and multicore turbo.

Let’s first start by looking at what you can expect if you always buy the absolute cheapest Mac Pro. That means starting off with the Xeon 5130, moving to the E5462, then the W3520, W3530, W3565 and ending up with the E5-1620 v2 in today’s Mac Pro. I’ve put all of the choices in the table below:

Mac Pro - Cheapest Configuration Upgrade Path
  CPU Chips Cores per Chip Total Cores / Threads Clock Base/1CT/MaxCT Launch Price
Mid 2006 Xeon 5130 2 2 4 / 4 2.0/2.0/2.0 GHz $2199
Early 2008 Xeon E5462 1 4 4 / 4 2.8/2.8/2.8 GHz $2299
Early 2009 Xeon W3520 1 4 4 / 8 2.66/2.93/2.8 GHz $2499
Mid 2010 Xeon W3530 1 4 4 / 8 2.8/3.06/2.93 GHz $2499
Mid 2012 Xeon W3565 1 4 4 / 8 3.2/3.46/3.33 GHz $2499
Late 2013 Xeon E5-1620 v2 1 4 4 / 8 3.7/3.9/3.7GHz $2999

If you always bought the cheapest Mac Pro CPU offering, this is what your performance curve in both single and multithreaded workloads would look like:

The first thing that stands out is both workloads follow roughly the same curve. The entry-level Mac Pro has always been a quad-core option, so you get no increased MT scaling (if you exclude the initial Nehalem bump from enabling Hyper Threading, which all subsequent Mac Pros have supported).

If you’ve always bought the slowest Mac Pro you’ll end up with a Mac Pro today that’s roughly 2.2x the performance of the very first Mac Pro. It’s a substantial increase in performance, but definitely not the sort of gains we’ve seen in mobile. For anyone who has been following the x86 CPU evolution over the past decade, this shouldn’t come as a surprise. There are huge power tradeoffs associated with aggressively scaling single threaded performance. Instead what you see at the core level is a handful of conservatively selected improvements. Intel requires that any new microarchitectural feature introduced has to increase performance by 2% for every 1% increase in power consumption. The result is the end of unabated increase in single threaded performance. The gains you see in the curve above are more or less as good as they get. I should point out that this obviously ignores the ~10% IPC gains offered by Haswell (since we don’t yet have a Haswell-EP). It’s also worth noting that Intel presently delivers the best single threaded performance in the industry. Compared to AMD alone you’re looking at somewhere around a 40% advantage, and ARM doesn’t yet offer anything that competes at these performance levels. It’s bound to be harder to deliver big gains when you’re at this performance level.

Back to the curve at hand, the increase in performance the 2013 Mac Pro offers is arguably one of the best upgrades over the life of the system - assuming you always opted for the entry level quad-core configuration.

What if you always did the opposite though and picked the highest-end CPU configuration? Same deal as before, I’ve documented the upgrade path in the table below:

Mac Pro - Most Expensive Configuration Upgrade Path
  CPU Chips Cores per Chip Total Cores / Threads Clock Base/1CT/MaxCT Launch Price
Mid 2006 Xeon X5365 2 4 8 / 8 3.0/3.0/3.0 GHz $3999
Early 2008 Xeon X5482 2 4 8 / 8 3.2/3.2/3.2 GHz $4399
Early 2009 Xeon X5570 2 4 8 / 16 2.93/3.33/3.06 GHz $5899
Mid 2010 Xeon X5670 2 6 12 / 24 2.93/3.33/3.06 GHz $6199
Mid 2012 Xeon X5675 2 6 12 / 24 3.06/3.46/3.2 GHz $6199
Late 2013 Xeon E5-2697 v2 1 12 12 / 24 2.7/3.5/3.0 GHz $6999

Now things start to get interesting. For starters, single and multithreaded performance scaling is divergent. The high-end CPU option started as two quad-core CPUs but after three generations moved to a total of twelve cores. What this means is that after the early 2009 model you see a pretty significant increase in multithreaded performance for the fastest Mac Pro configuration. Scaling since then has been comparatively moderate as you’re looking at IPC and frequency improvements mostly with no change in core count.

The single threaded performance improvement, by comparison, is fairly mild. If you bought the most expensive Mac Pro configuration back in 2006 you had a 3GHz part. In the past 7 years peak single core turbo has only improved by 30% to 3.9GHz. Granted there are other efficiency gains that help push the overall improvement north of 50%, but that’s assuming you haven’t purchased anything since 2006. If you bought into the Mac Pro somewhere in the middle and opted for a high-end configuration, you definitely won’t see an earth shattering increase in single threaded CPU performance. Note that we’re only looking at one vector of overall performance here. We aren’t taking into account things like storage and GPU performance improvements (yet).

For the third configuration I wanted to pick something in the middle. The issue is that there is no middle config for entirety of the Mac Pro’s history. In some cases shooting for the middle meant you’d end up with 4 cores, while other times it meant 6, 8 or 12. We settled on trying to shoot for a $4000 configuration each time and never go above it. It turns out that if you always had a $4000 budget for a Mac Pro and tried to optimize for CPU performance you’d end up with a somewhat bizarre upgrade path. The path we took is listed in the table below:

Mac Pro - Mid-Range Configuration Upgrade Path
  CPU Chips Cores per Chip Total Cores / Threads Clock Base/1CT/MaxCT Launch Price
Mid 2006 Xeon 5160 2 2 4 / 4 3.0/3.0/3.0 GHz $3299
Early 2008 Xeon E5472 2 4 8 / 8 3.0/3.0/3.0 GHz $3599
Early 2009 Xeon W3580 1 4 4 / 8 3.33/3.6/3.46 GHz $3699
Mid 2010 Xeon W3680 1 6 6 / 12 3.33/3.6/3.46 GHz $3699
Mid 2012 Xeon E5645 2 6 12 / 24 2.4/2.67/2.4 GHz $3799
Late 2013 Xeon E5-1650 v2 1 6 6 / 12 3.5/3.9/3.6 GHz $3999

Around $4000 the Mac Pro went from a quad-core system to eight-cores, back down to four cores, then up to six, then twelve and finally settling back at six cores this generation. What this means is a cycling between improving single and multithreaded performance over the course of the past 7 years:

Here’s where the comparison gets really interesting. If you spent $3799 on a Mac Pro last year, in order to see a multithreaded performance uplift on the CPU side you’d need to spend more this year. Single threaded performance on the other hand sees a big uptick compared to last year. The 2012 $4K config is the outlier however, if you have a budget fixed at $4000 then a 2013 Mac Pro will be quicker in all aspects compared to any previous generation Mac Pro at the same price point.

The bigger takeaway from this is the following: the very same limited gains in CPU performance will eventually come to ultra mobile devices as well. It’s only a matter of time before those CPU curves flatten out. What that does to the smartphone/tablet market is a discussion for another day.

Introduction, the Hardware, Pricing & Config Plotting the Mac Pro’s GPU Performance Over Time
Comments Locked

267 Comments

View All Comments

  • uhuznaa - Wednesday, January 1, 2014 - link

    For whatever it's worth: I'm supporting a video pro and what I can see in that crowd is that NOBODY cares for internal storage. Really. Internal storage is used for the software and of course the OS and scratch files and nothing else. They all use piles of external drives which are much closer to actual "media" you can carry around and work with in projects with others and archive.

    I fact I tried for a while to convince him of the advantages of big internal HDDs and he wouldn't have any of it. He found the flood of cheap USB drives you can even pick up at the gas station in the middle of the night the best thing to happen and USB3 a gift from heaven. They're all wired this way. Compact external disks that you can slap paper labels on with the name of the project on it and the version of that particular edit and that you can carry around are the best thing since sliced bread for them. And after a short while I had to agree that they're perfectly right with that for what they do.

    Apple is doing this quite right. Lots of bays are good for servers, but this is not a server. It's a workstation and work here means mostly work with lots of data that wants to be kept in nice little packages you can plug in and safely out and take with you or archive in well-labeled shelves somewhere until you find a use for it later on.

    (And on a mostly unrelated note: Premiere Pro may be the "industry standard" but god does this piece of software suck gas giants through nanotubes. It's a nightmarish UI thinly covering a bunch of code held together by chewing gum and duct tape. Apple may have the chance of a snowflake in hell against that with FCP but they absolutely deserve kudos for trying. I don't know if I love Final Cut, but I know I totally hate Premiere.)
  • lwatcdr - Wednesday, January 1, 2014 - link

    "My one hope is that Apple won’t treat the new Mac Pro the same way it did its predecessor. The previous family of systems was updated on a very irregular (for Apple) cadence. "

    This is the real problem. Haswell-EP will ship this year and it used a new socket. The proprietary GPU physical interface will mean those will probably not get updates quickly and they will be expensive. Today the Pro is a very good system but next year it will be falling behind.
  • boli - Wednesday, January 1, 2014 - link

    Hi Anand, cheers for the enjoyable and informative review.

    Regarding your HiDPI issue, I'm wondering if this might be an MST issue? Did you try in SST mode too?

    Just wondering because I was able to add 1920x1080 HiDPI to my 2560x1440 display no problem, by adding a 3840x2160 custom resolution to Switch Res X, which automatically added 1920x1080 HiDPI to the available resolutions (in Switch Res X).
  • mauler1973 - Wednesday, January 1, 2014 - link

    Great review! Now I am wondering if I can replicate this kind of performance in a hackintosh.
  • Technology Never Sleeps - Wednesday, January 1, 2014 - link

    Good article but I would suggest that your editor or proof reader review your article before its posted. It takes away from the professional nature of the article and website with so many grammatical errors.
  • Barklikeadog - Wednesday, January 1, 2014 - link

    Once again, a standard 2009 model wouldn't fair nearly as well here. Even with a Radeon HD 4870 I bet we'd be seeing significantly lower performance.

    Great review Anand, but I think you meant fare in that sentence.
  • name99 - Wednesday, January 1, 2014 - link

    " Instead what you see at the core level is a handful of conservatively selected improvements. Intel requires that any new microarchitectural feature introduced has to increase performance by 2% for every 1% increase in power consumption."

    What you say is true, but not the whole story. It implies that these sorts of small improvements are the only possibility for the future and that's not quite correct.
    In particular branch prediction has become good enough that radically different architectures (like CFP --- Continuous Flow Processing --- become possible). The standard current OoO architecture used by everyone (including IBM for both POWER and z, and the ARM world) grew from a model based on no speculation to some, but imperfect, speculation. So what it does is collect speculated results (via the ROB and RAT) and dribble those out in small doses as it becomes clear that the speculation was valid. This model never goes drastically off the rails, but is very much limited in how many OoO instructions it can process, both at the complete end (size of the ROB, now approaching 200 fused µ-instructions in Haswell) and at the scheduler end (trying to find instructions that can be processed because their inputs are valid, now approaching I think about 60 instructions in Haswell).
    These figures give us a system that can handle most latencies (FP instructions, divisions, reasonably long chains of dependent instructions, L1 latency, L2 latency, maybe even on a good day L3 latency) but NOT memory latency.

    And so we have reached a point where the primary thing slowing us down is data memory latency. This has been a problem for 20+ years, but now it's really the only problem. If you use best of class engineering for your other bits, really the only thing that slows you down is waiting on (data) memory. (Even waiting on instructions should not ever be a problem. It probably still is, but work done in 2012 showed that the main reason instruction prefetching failed was that the prefetched was polluted by mispredicted branches and interrupts. It's fairly easy to filter both of these once you appreciate the issue, at which point your I prefetcher is basically about 99.5% accurate across a wide variety of code. This seems like such an obvious an easy win that I expect it to move into all the main CPUs within 5 yrs or so.)

    OK, so waiting on memory is a problem. How do we fix it?
    The most conservative answer (i.e. requires the fewest major changes) is data pre fetchers, and we've had these growing in sophistication over time. They can now detect array accesses with strides across multiple cache lines, including backwaters, and we have many (at least 16 on Intel) running at the same time. Each year they become smarter about starting earlier, ending earlier, not polluting the cache with unneeded data. But they only speed up regular array accesses.

    Next we have a variety of experimental prefetchers that look for correlations in the OFFSETs of memory accesses; the idea being that you have things like structs or B-tree nodes that are scattered all over memory (linked by linked lists or trees or god knows what), but there is a common pattern of access once you know the base address of the struct. Some of these seem to work OK, with realistic area and power requirements. If a vendor wanted to continue down the conservative path, this is where they would go.

    Next we have a different idea, runahead execution. Here the idea is that when the “real” execution hits a miss to main memory, we switch to a new execution mode where no results will be stored permanently (in memory or in registers); we just run ahead in a kind of fake world, ignoring instructions that depend on the load that has missed. The idea is that, during this period we’ll trigger new loads to main memory (and I-cache misses). When the original miss to memory returns its result, we flush everything and restart at the original load, but now, hopefully, the runahead code started some useful memory accesses so that data is available to us earlier.
    There are many ways to slice this. You can implement it fairly easily using SMT infrastructure if you don’t have a second thread running on the core. You can do crazy things that try to actually preserve some of the results you generate during the runahead phase. Doing this naively you burn a lot of power, but there are some fairly trivial things you can do to substantially reduce the power.
    In the academic world, the claim is that for a Nehalem type of CPU this gives you about a 20% boost at the cost of about 5% increased power.
    In the real world it was implemented (but in a lousy cheap-ass fashion) on the POWER6 where it was underwhelming (it gave you maybe a 2% boost over the existing prefetchers); but their implementation sucked because it only ran 64 instructions during the run ahead periods. The simulations show that you generate about one useful miss to main memory per 300 instructions executed, so maybe two or three during a 400 to 500 cycles load miss to main memory, but 64 is just too short.
    It was also supposed to be implemented in the SUN Rock processor which was cancelled when Oracle bought Sun. Rock tried to be way more ambitious in their version of this scheme AND suffered from a crazy instruction fetch system that had a single fetch unit trying to feed eight threads via round robin (so each thread gets new instructions every eight cycles).
    Both these failures don’t, I think, tell us if this would work well if implemented on, say, an ARM core rather than adding SMT.

    Which gets us to SMT. Seems like a good idea, but in practice it’s been very disappointing, apparently because now you have multiple threads fighting over the same cache. Intel, after trying really hard, can’t get it to give more than about a 25% boost. IBM added 4 SMT threads to POWER7, but while they put a brave face on it, the best the 4 threads give you is about 2x single threaded performance. Which, hey, is better than 1x single threaded performance, but it’s not much better than what they get from their 2 threaded performance (which can do a lot better than Intel given truly massive L3 caches to share between threads).

    But everything so far is just add-ons. CFP looks at the problem completely differently.
    The problem we have is that the ROB is small, so on a load miss it soon fills up completely. You’d want the ROB to be about 2000 entries in size and that’s completely impractical. So why do we need the ROB? To ensure that we write out updated state properly (in small dribs and drabs every cycle) as we learn that our branch prediction was successful.
    But branch prediction these days is crazy accurate, so how about a different idea. Rather than small scale updating successful state every cycle, we do a large scale checkpoint every so often, generally just before a branch that’s difficult to predict. In between these difficult branches, we run out of order with no concern for how we writeback state — and in the rare occasions that we do screw up, we just roll back to the checkpoint. In between difficult branches, we just run on ahead even across misses to memory — kinda like runahead execution, but now really doing the work, and just skipping over instructions that depend on the load, which will get their chance to run (eventually) when the load completes.
    Of course it’s not quite that simple. We need to have a plan for being able to unwind stores. We need a plan for precise interrupts (most obviously for VM). But the basic idea is we trade today’s horrible complexity (ROB and scheduler window) for a new ball of horrible complexity that is not any simpler BUT which handles the biggest current problem, that the system grinds to a halt at misses to memory, far better than the current scheme.

    The problem, of course, is that this is a hell of a risk. It’s not just the sort of minor modification to your existing core where you know the worst that can go wrong; this is a leap into the wild blue yonder on the assumption that your simulations are accurate and that you haven’t forgotten some show-stopping issue.
    I can’t see Intel or IBM being the first to try this. It’s the sort of thing that Apple MIGHT be ambitious enough to try right now, in their current state of so much money and not having been burned by a similar project earlier in their history. What I’d like to see is a university (like a Berkeley/Stanford collaboration) try to implement it and see what the real world issues are. If they can get it to work, I don’t think there’s a realistic chance of a new SPARC or MIPS coming out of it, but they will generate a lot of valuable patents, and their students who worked on the project will be snapped up pretty eagerly by Intel et al.
  • stingerman - Wednesday, January 1, 2014 - link

    I think Intel has another two years left on the Mac. Apple will start phasing it out on the MacBook Air, Mac Mini and iMac. The MacBook rPros and finally the Mac Pro. Discreet x86 architecture is dead ending. Apple's going to move their Macs to SOC that they design. It will contain most of the necessary components and significantly reduce the costs of the desktops and notebooks. The Mac Pro will get it last giving time for the Pro Apps to be ported to Apple's new mobile and desktop 64-bit processors.
  • tahoey - Wednesday, January 1, 2014 - link

    Remarkable work as always. Thank you.
  • DukeN - Thursday, January 2, 2014 - link

    Biased much, Anand?

    Here's the Lenovo S30 I bought a couple of weeks back, and no it wasn't $4000 + like you seem to suggest.

    http://www.cdw.com/shop/products/Lenovo-ThinkStati...

    You picked probably the most overpriced SKU in the bunch just so you can prop up the ripoff that is your typical Apple product.

    Shame.

Log in

Don't have an account? Sign up now