by Anand Lal Shimpi, Brian Klug & Vivek Gowri on 3/12/2011 6:01:00 AM
Posted in smartphones , Tablets , Apple , iPad , iPad 2
Buy the Targus AMM01US Stylus Black Silver In
Amazon
$12.89
Newegg
$15.95
Buy.com
$14.99

I remember the speculation that lead up to Apple's iPad launch. The list of things everyone expected the device to do was absurd, and the theories on the architecture behind Apple's first branded SoC was just as fantastic. The simplest answer is sometimes the right one and as Ars Technica's Jon Stokes pointed out, the A4 was nothing more than a hardened ARM Cortex A8 core running at 1GHz in the iPad (and 800MHz in the iPhone 4).

The Cortex A8 is something we've covered extensively here so I won't go into great detail right now. It's a dual-issue, in-order architecture with a 13 stage integer pipeline and a non-pipelined FPU.

When Apple announced the iPad 2, it also briefly announced the A5 SoC. The only detail given? The A5 is a dual-core processor with a GPU that's 9x faster than what's in the A4.

There are only two recent ARM architectures that have multicore support: the ARM11 and the ARM Cortex A9. The A8 doesn't come in a multicore variant. Given how many other SoC vendors are shipping dual-core Cortex A9 SoCs, the A5 was likely no different than NVIDIA's Tegra 2, TI's OMAP 4 or Samsung's Exynos in that regard: armed with a pair of Cortex A9s running at 1GHz. Update: Geekbench reports clock speed at 900MHz. Update 2: Apple confirms 1GHz clock speed on the iPad 2 specs page.

Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion
Issue Width single-issue dual-issue dual-issue dual-issue
Pipeline Depth 8 stages 13 stages 9 stages 13 stages
Out of Order Execution N N Y Partial
FPU Optional VFPv2 (not-pipelined) VFPv3 (not-pipelined) Optional VFPv3-D16 (pipelined) VFPv3 (pipelined)
NEON N/A Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide)
Process Technology 90nm 65nm/45nm 40nm 40nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1GHz 1GHz

The Cortex A9 is similar to the A8 but with an out-of-order execution engine and a shallower pipeline (9 stages). The result is better-than-A8 performance at the same clock speed. The A9 also adds a fully pipelined FPU.

Now it's unclear what the rest of the A5 SoC looks like, but from the CPU standpoint I think it's safe to say that there are a pair of ARM Cortex A9s in there. We can look at the increase in Geekbench Floating Point scores for some proof:

Geekbench 2 - Floating Point Performance
  Apple iPad Apple iPad 2
Overall FP Score 456 915
Mandlebrot (single-threaded) 79.5 Mflops 279.1 Mflops
Mandlebrot (multi-threaded) 79.4 Mflops 554.7 Mflops
Dot Product (single-threaded) 245.7 Mflops 221.7 Mflops
Dot Product (multi-threaded) 247.2 Mflops 436.8 Mflops
LU Decomposition (single-threaded) 54.5 Mflops 205.4 Mflops
LU Decomposition (multi-threaded) 54.8 Mflops 421.6 Mflops
Primality Test (single-threaded) 71.2 Mflops 177.8 Mflops
Primality Test (multi-threaded) 69.3 Mflops 318.1 Mflops
Sharpen Image (single-threaded) 1.51 Mpixels/s 1.68 Mpixels/s
Sharpen Image (multi-threaded) 1.51 Mpixels/s 3.34 Mpixels/s
Blur Image (single-threaded) 760.2 Kpixels/s 665.5 Kpixels/s
Blur Image (multi-threaded) 753.2 Kpixels/s 1.32 Mpixels/s

Single threaded FPU performance is multiples of what we saw with the original iPad. This sort of an improvement in single-core performance is likely due to the pipelined Cortex A9 FPU. Looking at Linpack we see the same sort of huge improvement:

Linpack

Whether this performance advantage matters is another matter entirely. Although there aren't many FP intensive iPad apps available today, moving to the A5 is all about enabling developers - not playing catch up to software.

Memory size, bandwidth and operating frequencies are all unknowns that I was hoping to find out more about once I put hands on the iPad 2. Geekbench reports the iPad 2 at 512MB of memory, double the original iPad's 256MB. Remember that Apple has to deal with lower profit margins than it'd like with the iPad, but it refuses to cut corners on screen quality so something else has to give.

L2 cache size has also apparently increased from 512KB to 1MB. The L2 cache is shared among both cores and 1MB seems to be the sweet spot this generation.

Geekbench 2 - Memory Performance
  Apple iPad Apple iPad 2
Overall Memory Score 644 787
Read Sequential (single-threaded scalar) 340.6 MB/s 334.2 MB/s
Write Sequential (single-threaded scalar) 842.4 MB/s 1.07 GB/s
Stdlib Allocate (single-threaded scalar) 1.74 Mallocs/s 1.86 Mallocs/s
Stdlib Write (single-threaded scalar) 1.20 GB/s 2.30 GB/s
Stdlib Copy (single-threaded scalar) 740.6 MB/s 522.0 MB/s

Geekbench's memory tests show an improvement in effective bandwidth as well. The biggest improvement is in the stdlib write test which shows a near doubling of bandwidth from 1.2GB/s to 2.3GB/s. Unfortunately this isn't enough data to draw conclusions about bus width or DRAM operating frequency. Given the increases in CPU and GPU performance, an increase in memory bandwidth to go along with the two isn't surprising.

Geekbench shows a healthy increase in integer performance, both in single and multithreaded scenarios. The multithreaded advantage makes sense (two are better than one), but the lead in single threaded tests shows the benefit the A9 can deliver thanks to its shorter pipeline and ability to reorder instructions around stalls.

Geekbench 2 - Integer Performance
  Apple iPad Apple iPad 2
Overall FP Score 365 688
Blowfish (single-threaded) 13.9 MB/s 13.2 MB/s
Blowfish (multi-threaded) 14.3 MB/s 26.1 MB/s
Text Compression (single-threaded) 1.23 MB/s 1.50 MB/s
Text Compression (multi-threaded) 1.20 MB/s 2.82 MB/s
Text Decompression (single-threaded) 1.11 MB/s 2.09 MB/s
Text Decompression (multi-threaded) 1.08 MB/s 3.28 MB/s
Image Compress (single-threaded) 3.36 Mpixels/s 3.79 Mpixels/s
Image Compress (multi-threaded) 3.41 Mpixels/s 7.51 Mpixels/s
Image Decompress (single-threaded) 6.02 Mpixels/s 6.68 Mpixels/s
Image Decompress (multi-threaded) 5.98 Mpixels/s 13.1 Mpixels/s
Lua (single-threaded) 172.1 Knodes/s 273.4 Knodes/s
Lua (multi-threaded) 171.9 Knodes/s 542.9 Knodes/s

On average Geekbench shows a 31% increase in single threaded integer performance over the A4 in the original iPad. NVIDIA told me they saw a 20% increase in instructions executed per clock for the A9 vs. A8 and if we remove the one outlier (text decompression) that's about what we see here as well.

Geekbench 2
  Overall Integer FP Memory Stream
Apple iPad 448 365 456 644 325
Apple iPad 2 750 688 915 787 324

The increases in integer performance and memory bandwidth are likely what will have the largest impact on your experience. The fact that we're seeing big gains in single as well as multi-threaded workloads means the performance improvement should be universal across all CPU-bound apps.

What does all of this mean for performance in the real world? The iPad 2 is much faster than its predecessor. Let's start with our trusty javascript benchmarks: SunSpider and BrowserMark.

SunSpider Javascript Benchmark 0.9

Apple improved the Safari JavaScript engine in iOS 4.3, which right off the bat helped the original iPad become more competitive in this test. Even with both pads running iOS 4.3, the iPad 2 is 80% faster than the original iPad here.

The Motorola Xoom we recently reviewed scored a few percent slower than the iPad 2 in SunSpider as well. Running different OSes and browsers, it's difficult to conclude much when comparing the A5 to Tegra 2.

A bug in BrowserMark kept us from running it for the Xoom review but it's since been fixed. Again we're looking at mostly JavaScript performance here. Rightware modeled its benchmark after the JavaScript frameworks and functions used by websites like Facebook, Amazon and Gmail among others. The results are simply one aspect of web browsing performance, but an important one:

Rightware BrowserMark

The move from the A4 in the iPad 1 to the A5 in the iPad 2 boosts scores by 47%. More impressive however is just how much faster the Xoom is here. I suspect this has more to do with Google's software optimizations in the Honeycomb browser than hardware, but let's see how these tablets fare in our web page loading tests.

We debuted an early version of our 2011 web page loading tests in the Xoom review. Two things have changed since then: 1) iOS 4.3 came out, and 2) we changed our timing methods to produce more accurate results. It turns out that Honeycomb's browser was stopping our page load timer sooner than iOS', which resulted in some funny numbers when we got to the 4.3/Honeycomb comparison. To ensure accuracy we went back to timing by hand (each test was repeated at least 5 times and we present an average of the results). We also added two more pages to the test suite (Digg and Facebook).

2011 Page Load Test - Average

The iPad 2 generally loads web pages faster than the Xoom. On average it's a ~20% increase in performance. I wouldn't say that the improvement is necessarily noticeable when surfing most sites, but it's definitely measurable.

The move to iOS 4.3 really narrowed the gap between the original iPad and the Xoom. In some cases the two actually render pages in the same amount of time, however that's typically for lighter pages that are easy to render. Up the complexity and the Xoom easily distances itself from the original iPad.

2011 Page Load Test - AnandTech.com

2011 Page Load Test - Amazon.com

2011 Page Load Test - CNN.com

2011 Page Load Test - Digg.com

2011 Page Load Test - Engadget.com

2011 Page Load Test - Facebook.com

2011 Page Load Test - NYTimes.com

2011 Page Load Test - Reddit.com

We'll touch on this more in the full review but it's not all about performance when talking about web browsing between the iPad 2 and the Xoom. Although the iPad 2 may have faster render times on average, the Xoom still supports tabbed browsing which definitely has its advantages.

Introduction The GPU: PowerVR SGX 543MP2
Regarding Dual Core ARM Cortex A9 by Destiny on Saturday, March 12, 2011
If Apple iPAD 2, NVIDIA's Tegra 2, TI's OMAP 4 and Samsung's Exynos all use the same Dual Core ARM Cortex A9... why are there performance differences shown in your testing and benchmarks of these products?
Destiny
iPad uses iOS the others use variations of Android with who knows whats loaded in the background.

But the simple reason is different OS's provide different performance characteristics as they handle processes and memory loads differently.
StevoLincolnite
RE: Regarding Dual Core ARM Cortex A9 by Destiny on Saturday, March 12, 2011
Thank-you for the reply... now my knowledge and processor IQ just went up a notch... : )
Destiny
RE: Regarding Dual Core ARM Cortex A9 by solgae1784 on Saturday, March 12, 2011
Yep. All that hardware specs means nothing if your software can't utilize it. That much is clear even way back in the days.
solgae1784
RE: Regarding Dual Core ARM Cortex A9 by vol7ron on Saturday, March 12, 2011
It's not just due to the OS, it is also due to the other hardware coupled with the A9. For instance, more RAM means application data can be loaded quicker, rather than from the HD. The GPU and screen size/resolution also effect benchmarks - the amount of effect depends on the type of test.

Also the different hardware vendors may have modified some of the firmware instruction sets to make it more efficient.

But that's a big reason why these benchmarks are used, to have some sort of common ground that more accurately compares the different hardware/software combinations.
vol7ron
RE: Regarding Dual Core ARM Cortex A9 by geekfool on Thursday, March 17, 2011
actually , it's more a case that most current Linux ARM app writers have apparently not learn the basic lessons from the old Linux PPC days, where apple dev's clearly and logicically used to optimise all their app's C routines and wrote macros with SIMD code use etc.

as can be clearly seen here for instance SIMD makes a massive speed difference ,as can be seen in just one sample routine from the x86
http://pastebin.com/T0jt8VUB
"x264: All tests passed Yeah :)
nop: 196
optimize_chroma_dc_c: 415
optimize_chroma_dc_sse2: 203
optimize_chroma_dc_ssse3: 200
optimize_chroma_dc_sse4: 190
optimize_chroma_dc_avx: 177
"
its not uncommon to see up to *18 times faster* speeds in SIMD than C routines for instance.

until today's ARM NEON dev's stop relying on their crappy auto vectorising GCC to get a boost and start looking at the actual output of the compiler to see it generate brain dead SIMD code in many cases, then write better SIMD code by hand for their most used C routines , most apple dev's will continue to beat Linux ARM dev's on speed and data throughput it seems

ARM NEON SIMD was a 2009 GSOC project for x264

checkasm --bench
from the x264 git should give you interesting result's on the ARM NEON capable SOC too, even if its missing lots of SIMD code right now. good for seeing the generic C routines speed's and compared on different ARM SOC for instance, try it and report back you're speeds .

this real life Benchmark test came from the old x264-dev logs if anyone's interested in the real life Number's.

and StippenG's number's came from an older Quad A9/NEON developer board at his Uni apparently, so a current Marvell ARM v7 A9 quad /SIMD at 1.6 GHz for instance plus any/all the higher clocked 1GHz dual core ARM cortex A9 would produce a better result today OC.

640x360 at Ultrafast: 38.59 seems like a very good real life start for encoding on ARM cortex even without those extra SIMD patches being written yet

"2010-08-24 15:39:19 StippenG Some X264 Benchmarks (Rush Hour 640x360,preset=medium, crf=24): 4-core Cortex-A9 @ 400 MHz gives 5.55 fps,Beagleboard (A8 @ 720MHz) gives 1,65. Really nice speedup, considering the much higher frequency of the A8

2010-08-24 15:39:56 Dark_Shikari It'd go a lot faster if you used a faster preset.
2010-08-24 15:40:01 Dark_Shikari Or if you wrote some of the asm we're missing
2010-08-24 15:40:26 Dark_Shikari But yeah, that scales surprisingly well. about 3.5x faster

2010-08-24 15:40:27 StippenG Yes. Superfast gives 22.07. Ultrafast: 38.59
2010-08-24 15:41:23 StippenG Guess the out-of-order execution and shorter pipeline is really quite a bit better for performance

2010-08-24 15:41:35 < Dark_Shikari> Well, the A9 is known to be a lot faster
"

finally , making the effort to actually port "yasm" to ARM NEON would be a very good thing if you care to try and improve speed there
geekfool
RE: Regarding Dual Core ARM Cortex A9 by MonkeyPaw on Saturday, March 12, 2011
We also don't know the clocks of the A5. Maybe it's not safe to assume it's running at 1.0ghz?
MonkeyPaw
Er you're very right about that. Geekbench reports 900MHz :)

Take care,
Anand
Anand Lal Shimpi
RE: Regarding Dual Core ARM Cortex A9 by tipoo on Saturday, March 12, 2011
http://www.apple.com/ca/ipad/specs/

Its 1GHz. Geekbench reports the instantaneous speed, so you'll hear different numbers from that depending on what it ramps its speed down to to save power.
tipoo
RE: Regarding Dual Core ARM Cortex A9 by dagamer34 on Saturday, March 12, 2011
It's all about the OS at that point, just like how iOS 4.3 gives 2.5x increase in Javascript performance compared to iOS 4.0 even using the same original iPad.
dagamer34
Latest from AnandTech