Original Link: http://www.anandtech.com/show/5559/qualcomm-snapdragon-s4-krait-performance-preview-msm8960-adreno-225-benchmarks



If you've been following our SoC and smartphone related coverage over the past couple of years, you'll probably remember how Qualcomm let us take home an MDP8660 from MWC 2011 and thoroughly benchmark it. Qualcomm has done essentially the same thing this year, this time sending their latest and greatest MSM8960 SoC inside the aptly named MSM8960 Mobile Development Platform (MDP) just before MWC 2012. The timing is impeccable as we're fully expecting to start seeing MSM8960 based phones next week at MWC, and we've been telling you to hold off on any smartphone purchases until the 8960's arrival. Today we're finally able to give you an indication of just how fast Qualcomm's next-generation Snapdragon S4 will be.

We've already been teased MSM8960 and Krait a few times, and have talked about the architecture and what to expect from the SoC itself. The super short recap is this: Krait is the name of Qualcomm's new out of order ARMv7-A compatible CPU architecture (previous generations of Snapdragon used Scorpion) which is designed for TSMC's 28nm process. Inside MSM8960 are two Krait cores running at up to 1.5 GHz, Adreno 225 graphics, improved ISP and Qualcomm's new baseband with support for nearly every air interface out there.

The MDP we were sampled is clearly a descendant of the MDP MSM8660 we were given last year, sharing the same black utilitarian look and purpose-built design, though it's notable that the new device is markedly thinner. MDP MSM8960 is running Android 4.0.3 at 1.5 GHz, and includes 1 GB of LPDDR2, and a 4" 1024x600 display. It's an interesting note to make that both Intel and Qualcomm have somehow settled on 1024x600 for their reference designs.

We also asked Qualcomm for a copy of the old MDP8660 for comparison purposes, just to see how far we've come since the first dual core Snapdragon SoC.

Qualcomm Mobile Development Platform (MDPs)
MDP MSM8660 MDP MSM8960
SoC 1.5 GHz 45nm MSM8660 1.5 GHz 28nm MSM8960
CPU Dual Core Scorpion Dual Core Krait
GPU Adreno 220 Adreno 225
RAM 1 GB LPDDR2 1 GB LPDDR2
NAND 8 GB integrated, microSD slot 16 GB integrated, microSD slot
Cameras 13 MP Rear Facing with Autofocus and LED Flash, Front Facing (? MP) 13 MP Rear Facing with Autofocus and LED Flash, Front Facing (? MP)
Display 3.8" WVGA LCD-TFT 4.03" SWVGA (1024x600) LCD-TFT
Battery 3.3 Whr removable 5.6 Whr removable
OS Android 2.3.2 (Gingerbread) Android 4.0.3 (ICS)

We're taking a look at just CPU, GPU, and power performance today on MSM8960, as cellular baseband is disabled on the MDP just like it was when we looked at the previous MSM8660 MDP. We'll get a chance to investigate that further in the future, again right now the key areas are CPU, GPU, and power.

As we talked about in the previous MDP piece, the purpose of the MDP is just to serve as a reference design for both Qualcomm to get its Android port running on, and also for individual developers to profile and test their applications against. It's analogous to TI's OMAP Blaze platforms - you won't ever likely see one out in the wild, but it's a reference target that the silicon vendor leverages to port Android, and a piece of hardware that OEMs can use as a reference when they start customizing and building handsets.

Just like last time the MDP also comes with a software build that lets us easily enable or disable vsync on the device, restart surfaceflinger, and then run benchmark tests. There's no such analog for shipping retail devices, only development builds contain this functionality. As there are parts of each benchmark that could instantaneously peak over vsync's 60 FPS on some of the shipping platforms we're comparing to, we're providing results with vsync on and off.

In addition, last time when we ran tests on the MDP MSM8660 I noted that the governor was set to "performance" mode, which means it doesn't adaptively change CPU frequency as a function of load - it was 1.5 GHz all the time. This time the MDP MSM8960 came running with the much more typical "ondemand" governor selected, which does scale CPU frequency as a function of load, so there's less of a concern about the lack of pauses while the CPU changes performance states making results non comparable.



We won't go too deep into Krait's CPU architecture, because we've already done so in an earlier piece. What we can provide however is a quick recap. Architecturally Krait isn't a design of tradeoffs, rather it's a significant step forward along almost all vectors. Each core can fetch, decode and execute more instructions in parallel than its predecessor (Scorpion, Snapdragon S1/S2/S3).

Qualcomm Architecture Comparison
Scorpion Krait
Pipeline Depth 10 stages 11 stages
Decode 2-wide 3-wide
Issue Width 3-wide? 4-wide
Execution Ports 3 7
L2 Cache (dual-core) 512KB 1MB
Core Configurations 1, 2 1, 2, 4

Even if you're not comparing to Qualcomm's previous architecture, Krait maintains the same low level advantage over any other ARM Cortex A9 based design (NVIDIA Tegra 2/3, TI OMAP 4, Apple A5). Clock speeds are up with only a small increase in pipeline depth. The combination of these two factors alone should result in significant performance improvements for even single threaded applications. If you want to abstract by one more level: Krait will be faster regardless of application, regardless of usage model. You're looking at a generational gap in architecture here, not simply a clock bump.

Architecture Comparison
ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N Y Partial Y
FPU VFP11 (pipelined) VFPv3 (not-pipelined) Optional VFPv3 (pipelined) VFPv3 (pipelined) VFPv4 (pipelined)
NEON N/A Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

The memory interface of the chip has been improved tremendously. At a high level, the MSM8960 is Qualcomm's first SoC to feature PoP support for two LPDDR2 memory channels. We suspect there are lower level improvements to the memory interface as well however we don't have more details from Qualcomm, not to mention the current state of memory latency/bandwidth testing on Android is pretty abysmal.

Quantifying the Krait performance advantage requires a mixture of synthetic and application level tests. We'll start with Linpack, a Java port of the classic memory bandwidth/FPU test:

Linpack - Single-threaded

Linpack - Multi-threaded

Occasionally we'll see performance numbers that just make us laugh at their absurdity. Krait's Linpack performance is no exception. The performance advantage here is insane. The MSM8960 is able to deliver more than twice the performance of any currently shipping SoC. The gains are likely due in no small part to improvements in Krait's cache/memory controller. Krait can also issue multi-issue FP instructions, A9 class architectures can apparenty only dual-issue integer instructions.

Moving on we have our standard JavaScript benchmarks: Sunspider and Browsermark. Both of these tests show significant performance improvements, although understandably not by the margins we saw above in Linpack:

SunSpider Javascript Benchmark 0.9.1 - Stock Browser

BrowserMark

Krait and the MSM8960 are 20 - 35% faster than the dual-core Cortex A9s used in Samsung's Galaxy Nexus. For a look at how overall web page loading is impacted we loaded AnandTech.com three times and averaged the results. We presented results with the browser cache cleared after each run as well as results after all assets were cached:

AnandTech.com Page Loading Comparison (Stock ICS Browser)
Browser Cache Cleared Cache In Use
Qualcomm MDP MSM8960 (Krait) 5.5 seconds 3.0 seconds
Samsung Galaxy Nexus (ARM Cortex A9) 5.8 seconds 4.4 seconds

There's hardly any advantage when you're network bound, which is to be expected. However whenever the device can pull assets from a local cache (something that is quite common as images, CSS and even many page elements remain static between loads) the advantage grows considerably. Here we're seeing a 46% advantage from Krait over the Cortex A9 in the Galaxy Nexus.

We turn to Qualcomm's own Vellamo as a system/CPU/browser performance test:

Vellamo Overall Score

Again, we're showing a huge performance advantage here thanks to Krait. Seeing as how Vellamo is a Qualcomm benchmark don't get too attached to the advantage here, but it does echo some of what we've seen earlier.

Finally we have Rightware's Basemark OS 1.1 RC which is fast becomming an impressively polished system benchmark, one which will hopefully eventually take the place of the likes of Quadrant.

Basemark OS - System
HTC Rezound Galaxy Nexus MDP MSM8960
System Overall Score 658 538 907
Simple Java 1 298 loops/s 210 loops/s 375 loops/s
Simple Java 2 7.28 loops/s 8.61 loops/s 10.8 loops/s
SMP Test 35.3 loops/s 49.2 loops/s 64.4 loops/s
100K File (eMMC->SD) 6.49 mB/s 9.52 mB/s 8.64 mB/s
100K File (SD->eMMC) 33.0 mB/s 17.8 mB/s 39.8 mB/s
100K File (eMMC->eMMC) 37.8 mB/s 34.5 mB/s 48.9 mB/s
100K File (SD->SD) 8.47 mB/s 8.30 mB/s 12.7 mB/s
Database Operation 10.0 ops/s 5.73 ops/s 19.4 ops/s
Zip Compression 0.509 s 0.848 s 0.561 s
Zip Decompression 0.097 s 0.206 s 0.073 s

On the CPU centric tests Basemark OS is showing anywhere from a 20% - 80% increase in performance over the 1.5 GHz APQ8060 based HTC Rezound. IO performance is also tangibly improved although that could be a function of NAND performance rather than the SoC specifically.

These results as a whole simply quantify what we've felt during our use of the MSM8960 MDP: this is the absolute smoothest we've ever seen Ice Cream Sandwich run.



The MSM8960 is an unusual member of the Krait family in that it doesn't use an Adreno 3xx GPU. In order to get the SoC out quickly, Qualcomm paired the two Krait cores in the 8960 with a tried and true GPU design: the Adreno 225. Adreno 225 itself hasn't been used in any prior Qualcomm SoC, but it is very closely related to the Adreno 220 used in the Snapdragon S3 that we've seen in a number of recent handsets.

Compared to Adreno 220, 225 primarily adds support for Direct3D 9_3 (which includes features like multiple render targets). The resulting impact on die area is around 5% and required several months of work on Qualcomm's part.

From a compute standpoint however, Adreno 225 looks identical to Adreno 220. The big difference is thanks to the 8690's 28nm manufacturing process, Adreno 225 can now run at up to 400MHz compared to 266MHz in Adreno 220 designs. A 50% increase in GPU clock frequency combined with a doubling in memory bandwidth compared to Snapdragon S3 gives the Adreno 225 a sizable advantage over its predecessor.

Mobile SoC GPU Comparison
Adreno 225 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 Mali-400 MP4 GeForce ULP Kal-El GeForce
SIMD Name - USSE USSE2 USSE2 Core Core Core
# of SIMDs 8 4 4 8 4 + 1 8 12
MADs per SIMD 4 2 4 4 4 / 2 1 1
Total MADs 32 8 16 32 18 8 12
GFLOPS @ 200MHz 12.8 GFLOPS 3.2 GFLOPS 6.4 GFLOPS 12.8 GFLOPS 7.2 GFLOPS 3.2 GFLOPS 4.8 GFLOPS
GFLOPS @ 300MHz 19.2 GFLOPS 4.8 GFLOPS 9.6 GFLOPS 19.2 GFLOPS 10.8 GFLOPS 4.8 GFLOPS 7.2 GFLOPS

We turn to GLBenchmark and Basemark ES 2.0 V1 to measure the Adreno 225's performance:

GLBenchmark 2.1 - Egypt

GLBenchmark 2.1 - Pro

Limited by Vsync the Adreno 225 can actually deliver similar performance to the PowerVR SGX 543MP2 in Apple's A5. However if we drive up the resolution, avoid vsync entirely and look at 720p results the Adreno 225 falls short. Its performance is measurably better than anything else available on the Android side in the Egypt benchmark, however the older Pro test still shows the SGS2's Mali-400 implementation as quicker. The eventual move to Adreno 3xx GPUs will likely help address this gap.

GLBenchmark 2.1 - Egypt - Offscreen (720p)

GLBenchmark 2.1 - Pro - Offscreen (720p)

Basemark ES 2.0 tells a similar story (updated: notes below):

RightWare Basemark ES 2.0 V1 - Taiji

RightWare Basemark ES 2.0 V1 - Hoverjet

In the original version of this article we noticed some odd behavior on the part of the Mali-400 MP4 based Samsung Galaxy S 2. Initially we thought the ARM based GPU was simply faster than the Adreno 225 implementation in the MSM8960, however it turns out there was another factor at play. The original version of Basemark ES 2.0 V1 had anti-aliasing enabled and requested 4X MSAA from all devices that ran it. Some GPUs will run the test with AA disabled for various reasons (e.g. some don't technically support 4X MSAA), while others (Adreno family included) will run with it enabled. This resulted in the Adreno GPUs being unfairly penalized. We've since re-run all of the numbers with AA disabled and at WVGA (to avoid hitting vsync on many of the devices).

Basemark clearly favors Qualcomm's Adreno architecture, whether or not that's representative of real world workloads is another discussion entirely.

The results above are at 800 x 480. We're unable to force display resolution on the iOS version of Basemark so we've got a native resolution comparison below:

RightWare Basemark ES 2.0 V1 Comparison (Native Resolution)
Taiji Hoverjet
Apple iPhone 4S (960 x 640) 16.623 fps 30.178 fps
Qualcomm MDP MSM8960 (1024 x 600) 40.576 fps 59.586 fps

Even at its lower native resolution, Apple's iPhone 4S is unable to outperform the MSM8960 based MDP here. It's unclear why there's such a drastic reversal in standing between the Adreno 225 and PowerVR SGX 543MP2 compared to the GLBenchmark results. Needless to say, 3D performance can easily vary depending on the workload. We're still in dire need of good 3D game benchmarks on Android. Here's hoping that some cross platform iOS/Android game developers using Epic's UDK will expose frame rate counters/benchmarking tools in their games.



Power Measurements using Trepn

Measuring power draw is an interesting unique capability of Qualcomm's MDPs. Using their Trepn Profiler software and measurement hardware integrated into the MDP, we can measure a number of different power rails on the device, including power draw from each CPU core, the digital core (including video decoder and modem) and a bunch of other measures.

Measuring and keeping track of how different SoCs consumer power is something we've wanted to do for a while, and at least under the Qualcomm MDP umbrella at this point it's possible to measure right on the device.

The original goal was to compare power draw on 45nm MSM8660 versus 28nm MSM8960, however we encountered stability issues with Trepn profiler on the older platform that are still being resolved. Thankfully it is possible to take measures on MSM8960, and for this we turned to a very CPU intensive task that would last long enough to get a good measure, and also load both cores so we can see how things behave. That test is the Moonbat Benchmark, which is a web-worker wraper of the sunspider 0.9 test suite. We fired up a test consisting of 4 workers and 50 runs inside Chrome beta (which is web-worker enabled), and profiled using Trepn.

If you squint at the graph, you can see that one Krait core can use around 750 mW at maximum load. I didn't enable the CPU frequency graph (just to keep things simple above) but is 750 mW number happens right at 1.5 GHz. The green spikes from battery power are when we're drawing more than the available current from USB - this is also why you see devices sometimes discharge even when plugged in. There's an idle period at the end that I also left visible - you can see how quickly Qualcomm's governor suspends the second core completely after our moonbat test finishes running.

Here's another run of moonbat on Chrome Beta where we can see the same behavior, but zoomed in a bit better - each Krait core will consume anywhere between 450 mW and 750 mW depending on the workload, which does change during our run while V8 does its JIT compilation and Chrome dispach things to each CPU.

The next big question is obviously - well how much does GPU contribute to power drain? The red "Digital Core Rail Power" lines above include the Adreno 225 GPU, video decode, and "modem digital" blocks. Cellular is disabled on the MDP MSM8960, and we're not decoding any video, so in the right circumstances we can somewhat isolate out the GPU. To find out, I profiled a run of GLBenchmark Egypt on High settings (which is an entirely GPU compute bound test) and let it run to completion. You can see how the digital rail bounces between 800 mW and 1.2 W while the test is running. Egypt's CPU portions are pretty much single-threaded as well, as shown by the yellow and green lines above.

Another interesting case is what this looks like when browsing the web. I fired up the analyzer and loaded the AnandTech homepage followed by an article, and scrolled the page in the trace above. Chrome and "Browser" on Android now use the GPU for composition and rendering the page, and you can see the red line in the plot spike up when I'm actively panning and translating around on the page. In addition, the second CPU core only really wakes up when either loading the page and parsing HTML.

One thing we unfortunately can't measure is how much power having the baseband lit up on each different air interface (CDMA2000 1x, EV-DO, WCDMA, LTE, etc.) consumes, as the MDP MSM8960 we were sampled doesn't have cellular connectivity enabled. This is something that we understand in theory (at least for the respective WCDMA and LTE radio resource states), but remains to be empirically explored. It's unfortunate that we also can't compare to the MDP MSM8660 quite yet, but that might become possible pretty quickly.



Final Words

It goes without saying that MSM8960 is a hugely important SoC release for Qualcomm. It's the first release with Qualcomm's new Krait CPU architecture, an entirely new cellular baseband with support for nearly every air interface, and is manufactured on TSMC's 28nm process. It says something that we're able to hold 28nm TSMC silicon in our hands in the form of the MDP, and it's only a matter of time before we start seeing Krait show up in devices in 2012.

We've gone over basically all of the benchmarks available to us on Android right now, and yet subjective performance impressions are still valuable. The MDP8960 is the absolute fastest we've seen Ice Cream Sandwich thus far - the UI is absolutely butter smooth everywhere, and web browsing in either Chrome or the stock Android Browser is also the smoothest we've seen it. There's no stutter bringing up the application switcher, or taking screenshots, two places that 4.0.3 still drops frames on the Galaxy Nexus.

Krait offers another generational leap in mobile SoC performance. The range of impact depends entirely on the workload but it's safe to say that it's noticeable. The GPU side of the equation has been improved tremendously as well, although that's mostly a function of 28nm enabling a very high clock speed for Qualcomm's Adreno 225. We are eager to see what the Adreno 3xx GPUs that will pair up with future Krait SoCs can do.

The big unknowns today are power consumption and the performance of shipping devices. While we were able to provide power numbers using Qualcomm's handy Trepn tool, we couldn't produce a reference point on older silicon. The move to 28nm and a second generation of cellular basebands has generally been heralded as being the answer to our battery life issues, particularly with LTE. It remains to be seen just how much of an improvement we'll see there. Knowing how much power MSM8960's cellular architecture uses is especially relevant when you consider that MDM9615 includes the exact same modem as MSM8960.

These initial results look extremely promising, however. Krait based devices should begin shipping sometime next quarter, the wait is almost over.

Log in

Don't have an account? Sign up now