The CPU Comparison: NVIDIA, TI & Qualcomm in 2011

NVIDIA makes two versions of the Tegra 2, one for tablets and one for smartphones. The difference between the two boils down to packaging size and TDP. NVIDIA hasn’t been too forthcoming with information but here’s what I know thus far:

NVIDIA Tegra 2
SoC Part Number CPU Clock GPU Clock Availability
NVIDIA Tegra 2 T20 1GHz 333MHz Now
NVIDIA Tegra 2 AP20H 1GHz 300MHz Now
NVIDIA Tegra 2 3D T25 1.2GHz 400MHz Q2 2011
NVIDIA Tegra 2 3D AP25 1.2GHz 400MHz Q2 2011

The T25/AP25 are believed to be the upcoming Tegra 2 3D SoCs. They increase CPU clock speed to 1.2GHz and GPU clock to 400MHz. The T20/AP20H are the current Tegra 2 models, with the T20 aimed at tablets and AP20H for smartphones. The Tegra 2 T20 and AP20H both run their CPU cores at up to 1GHz depending on software load.

Including NVIDIA’s Tegra 2 there are three competing CPU architectures at play in the 2011 SoC race: the ARM Cortex A8, ARM Cortex A9 and Qualcomm Scorpion (the CPU core at the heart of the Snapdragon SoC).

NVIDIA chose to skip the A8 generation and instead would jump straight to the Cortex A9. For those of you who aren’t familiar with ARM microprocessor architectures, the basic breakdown is below:

ARM Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9
Issue Width single-issue dual-issue dual-issue
Pipeline Depth 8 stages 13 stages 9 stages
Out of Order Execution N N Y
Process Technology 90nm 65nm/45nm 40nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1GHz

ARM11 was a single-issue, in-order architecture. Cortex A8 moved to dual-issue and A9 adds an out-of-order execution engine. The A9’s integer pipeline is also significantly shortened from 13 stages down to 9. The combination of out-of-order execution and a reduction in pipeline depth should give the Cortex A9 a healthy boost over the A8 at the same clock speed. The Cortex A8 is only supported in single-core configurations, while the Tegra 2 and TI’s OMAP 4 both use two A9 cores.

Each jump (ARM11 to A8 to A9) is good for at least a generational performance improvement (think 486 to Pentium, Pentium to Pentium Pro/II).

With each new generation of ARM architecture we also got a new manufacturing process and higher clock speeds. ARM11 was largely built at 90nm, while Cortex A8 started at 65nm. We saw most A8 SoCs transition to 40/45nm in 2010, which is where Cortex A9 will begin. The cadence will continue with A9 scaling down to 28nm in 2012 and the new Cortex A15 picking up where A9 leaves off.

Qualcomm’s Scorpion Core

The third contender in 2011 is Qualcomm’s Scorpion core. Scorpion is a dual-issue, mostly in-order microprocessor architecture developed entirely by Qualcomm. The Scorpion core implements the same ARMv7-A instruction set as the Cortex A8 and A9, however the CPU is not based on ARM’s Cortex A8 or A9. This is the point many seem to be confused about. Despite high level similarities, the Scorpion core is not Qualcomm’s implementation of a Cortex A8. Qualcomm holds an ARM architecture license which allows it to produce microprocessors that implement an ARM instruction set. This is akin to AMD holding an x86 license that allows it to produce microprocessors that are binary compatible with Intel CPUs. However calling AMD’s Phenom II a version of Intel’s Core i7 would be incorrect. Just like calling Scorpion a Cortex A8 is incorrect.

I mention high level similarities between Scorpion and the Cortex A8 simply because the two architectures appear alike. They both have dual-issue front ends and a 13-stage integer pipeline. Qualcomm claims the Scorpion core supports some amount of instruction reordering, however it’s not clear to what extent the CPU is capable of out-of-order execution. Intel’s Atom for example can reorder around certain instructions however it is far from an out-of-order CPU.

Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion
Issue Width single-issue dual-issue dual-issue dual-issue
Pipeline Depth 8 stages 13 stages 9 stages 13 stages
Out of Order Execution N N Y Partial
FPU Optional VFPv2 (not-pipelined) VFPv3 (not-pipelined) Optional VFPv3-D16 (pipelined) VFPv3 (pipelined)
NEON N/A Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide)
Process Technology 90nm 65nm/45nm 40nm 40nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1GHz 1GHz

Scorpion has some big advantages on the floating point side. Qualcomm implements ARM’s VFPv3 vector floating point instruction set on Scorpion, the same instructions supported by the Cortex A8. The Cortex A8’s FPU wasn’t pipelined. A single instruction had to make it through the FP pipeline before the next instruction could be issued. For those of you who remember desktop processors, the Cortex A8’s non-pipelined FPU is reminiscent of Intel’s 486 and AMD’s K6. It wasn’t until the Pentium processor that Intel gained a pipelined FPU, and for AMD that came with the Athlon. As a result, floating point code runs rather slowly on the Cortex A8. You can get around the A8’s poor FP performance for some workloads by using NEON, which is a much higher performance SIMD engine paired with the Cortex A8.

The Scorpion’s VFPv3 FPU is fully pipelined. As a result, floating point performance is much improved. Qualcomm also implements support for NEON, but with a wider 128-bit datapath (compared to 64-bit in the A8 and A9). As a result, Qualcomm should have much higher VFP and NEON performance than the Cortex A8 (we see a good example of this in our Linpack performance results).

While all Cortex A8 designs incorporated ARM’s NEON SIMD engine, A9 gives you the option of integrating either a SIMD engine (ARM’s Media Processing Engine, aka NEON) or a non-vector floating point unit (VFPv3-D16). NVIDIA chose not to include the A9’s MPE and instead opted for the FPU. Unlike the A8’s FPU, in the A9 the FPU is fully pipelined—so performance is much improved. The A9’s FPU however is still not as quick at math as the optional SIMD MPE.

Minimum Instruction Latencies (Single Precision)
Instruction FADD FSUB FMUL FMAC
ARM Cortex A8 (FPU) 9 cycles 9 cycles 10 cycles 18 cycles
ARM Cortex A9 (FPU) 4 cycles 4 cycles 5 cycles 8 cycles
ARM Cortex A8 (NEON) 5 cycles 5 cycles 5 cycles 9 cycles
ARM Cortex A9 (MPE/NEON) 5 cycles 5 cycles 5 cycles 9 cycles

Remember that the A8's FPU isn't pipelined so it can't complete these instructions every cycle, resulting throughput that's nearly equal to the instruction latency. The A9's FPU by comparison is fully pipelined, giving it much higher instruction throughput compared to the A8.

NVIDIA claims implementing MPE would incur a 30% die penalty for a performance improvement that impacts only a minimal amount of code. It admits that at some point integrating a SIMD engine makes sense, just not yet. The table above shows a comparison of instruction latency on various floating point and SIMD engines in A8 and A9. TI’s OMAP 4 on the other hand will integrate ARM’s Cortex A9 MPE. Depending on the code being run, OMAP 4 could have a significant performance advantage vs. the Tegra 2. Qualcomm's FPU/NEON performance should remain class leading in non-bandwidth constrained applications.

Unfortunately for Qualcomm, much of what impacts smartphone application performance today isn’t bound by floating point performance. Future apps and workloads will definitely appreciate Qualcomm’s attention to detail, but loading a web page won’t touch the FPU.

The Scorpion core remains largely unchanged between SoC generations. It won’t be until 28nm in 2012 that Qualcomm introduces a new microprocessor architecture. Remember that as an architecture licensee Qualcomm is going to want to create architectures that last as long as possible in order to recover its initial investment. Microprocessor licensees however have less invested into each generation and can move to new architectures much faster.

Cache & Memory Controller Comparison

NVIDIA outfits the Tegra 2 with a 1MB L2 cache shared between the two Cortex A9 cores. A shared L2/private L1 structure makes the most sense for a dual-core CPU as we’ve learned from following desktop CPUs for years. It’s only once you make the transition to 3 or 4 cores that it makes sense to have private L2s and introduce a large, catch-all shared L3 cache.

Qualcomm’s upcoming QSD8660 only has a 512KB L2 cache shared between its two cores, while TI’s OMAP 4 has a Tegra 2-like 1MB L2. In these low power parts, having a large L2 with a good hit rate is very important. Moving data around a chip is always very power intensive. The close proximity of the L2 cache to the CPU cores helps keep power consumption down. Any data that has to be fetched from main memory requires waking up the external memory interface as well as the external or on-package DRAMs. A trip to main memory easily requires an order of magnitude more power than pulling data from a local cache.

While the OMAP 4 and Tegra 2 both have a larger L2 cache, it’s unclear what frequency the cache operates at. Qualcomm’s L2 operates at core frequency and as a result could offer higher bandwidth/lower latency operation.

NVIDIA opted for a single-channel 32-bit LPDDR2 memory interface. Qualcomm’s QSD8660/MSM8x60 and TI’s upcoming OMAP 4 have two LPDDR2 channels. NVIDIA claims that a narrower memory bus with more efficient arbitration logic is the best balance for power/performance at the 40nm process node. In order to feed the data hungry CPUs and GPU, NVIDIA specs the Tegra 2 for use with 600MHz datarate LPDDR2 memory (although the LG Optimus 2X actually has 800MHz datarate DRAM on package, it still only runs at 600MHz).

Assuming all memory controllers are equally efficient (which is an incorrect assumption), NVIDIA should have half the bandwidth of TI’s OMAP 4. NVIDIA’s larger L2 cache gives it an advantage over the QSD8660, giving Tegra 2 an effective memory latency advantage for a percentage of memory requests. Depending on the operating frequency of NVIDIA’s L2, Qualcomm could have a cache bandwidth advantage. The take away point here is that there’s no clear winner in this battle of specifications, just a comparison of tradeoffs.

The Dual-Core Comparison in 2011

In 2011 Qualcomm will introduce the QSD8660, a Snapdragon SoC with two 45nm Scorpion cores running at 1.2GHz. With a deeper pipeline, smaller cache and a largely in-order architecture, the QSD8660 should still trail NVIDIA’s Cortex A9 based Tegra 2 at the same clock speed. However Tegra 2 launches at 1GHz and it won’t be until Tegra 2 3D that we see 1.2GHz parts. Depending on timing we could see dual-core Qualcomm phones running at 1.2GHz competing with dual-core NVIDIA phones running at 1.0GHz. The winner between those two may not be as clear—it’ll likely vary based on workload.

At 1.2GHz I’d expect the Tegra 2 3D to be the fastest SoC for the entirety of 2011. Once 2012 rolls around we’ll reset the clock as Qualcomm introduces its next-generation microprocessor architecture.

NVIDIA clearly has an execution advantage as it is the first SoC maker to ship an ARM Cortex A9. NVIDIA’s biggest weakness on the CPU side is the lack of NEON support in Tegra 2, something that doesn’t appear to be an issue today but could be a problem in the future depending on how widespread NEON code becomes in apps. TI’s OMAP 4 includes both a NEON unit and a wider memory bus, the latter could be a performance advantage depending on how well designed the memory controller is.

Qualcomm is a bit behind on the architecture side. The Scorpion core began shipping in Snapdragon SoCs at the end of 2008 and the architecture won’t be refreshed until late 2011/2012. As Intel discovered with NetBurst, 4—5 year runs for a single microprocessor architecture are generally too long. Luckily Qualcomm appears to be on a ~3 year cadence at this point.

The QSD8660 running at 1.2GHz should be sufficient to at least keep Qualcomm competitive until the Scorpion’s replacement arrives (although I suspect NVIDIA/TI will take the crown with their A9 designs). One aspect we haven’t talked about (mostly because there isn’t any good data available) is power consumption. It’s unclear how the Scorpion core compares to the Cortex A9 in terms of power consumption at the same process node.

NVIDIA's Tegra 2 The GeForce ULV
POST A COMMENT

75 Comments

View All Comments

  • GoodRevrnd - Tuesday, February 08, 2011 - link

    TV link would be awesome, but why would you need the phone to bridge the TV and network?? Reply
  • aegisofrime - Monday, February 07, 2011 - link

    May I suggest x264 encoding as a test of the CPU power? There's a version of x264 available for ARM chips, along with NEON optimizations. Should be interesting! Reply
  • Shadowmaster625 - Monday, February 07, 2011 - link

    What is the point in having a high performance video processor when you cannot do the two things that actually make use of it? Those two things are: 1. Watch any movie in your collection without transcoding? (FAIL) 2. Play games. No actual buttons = FAIL. If you think otherwise then you dont actually play games. Just stick with facebook flash trash. Reply
  • TareX - Wednesday, February 09, 2011 - link

    The only reason I'd pay for a dual core phone is smooth flash-enabled web browsing, not gaming. Reply
  • zorxd - Monday, February 07, 2011 - link

    Stock Android has it too. There is also E for EDGE and G for GPRS. Reply
  • Exophase - Monday, February 07, 2011 - link

    Hey Anand/Brian,

    There are some issues I've found with some information in this article:

    1) You mention that Cortex-A8 is available in a multicore configuration. I'm pretty sure there's no such thing; you might be thinking of ARM11MPCore.

    2) The floating point latencies table is just way off for NEON. You can find latencies here:
    http://infocenter.arm.com/help/index.jsp?topic=/co...
    It's the same in Cortex-A9. The table is a little hard to read; you have to look at the result and writeback stages to determine the latency (it's easier to read the A9 version). Here's the breakdown:
    FADD/FSUB/FMUL: 5 cycles
    FMAC: 9 cycles (note that this is because the result of the FMUL pipeline is then threaded through the FADD pipeline)
    The table also implies Cortex-A9 adds divide and sqrt instructions to NEON. In actuality, both support reciprocal approximation instructions in SIMD and full versions in scalar. The approximation instructions have both initial approximation with ~9 bits of precision and Newton Rhapson step instructions. The step instructions function like FMACs and have similar latencies. This kind of begs the question of where the A9 NEON DIV and SQRT numbers came from.

    The other issue I have with these numbers is that it only mentions latency and not throughput. The main issue is that the non-pipelined Cortex-A8 FPU has throughput almost as bad as its latency, while all of the other implementations have single cycle throughput for 2x 64-bit operations. Maybe throughput is what you mean by "minimum latency", however this would imply that Cortex-A9 VFP can't issue every cycle, which isn't the case.

    3) It's obvious from the GLBenchmark 2.0 Pro screenshot that there are some serious color limitations from Tegra 2 (look at the woman's face). This is probably due to using 16-bit. IMG has a major advantage in this area since it renders at full 32-bit (or better) precision internally and can dither the result to 16-bit to the framebuffer, which looks surprisingly similar in quality to non-dithered 32-bit. This makes a 16-bit vs 16-bit framebuffer comparison between the two very unbalanced - it's far more fair to just do both at 32-bit, but it doesn't look like the benchmark has any option for it. Furthermore, Tegra 2 is limited to 16-bit (optionally non-linear) depth buffers, while IMG utilizes 32-bit floating point depth internally. This is always going to be a disadvantage for Tegra 2 and is definitely worth mentioning in any comparison.

    Finally I feel like ranting a little bit about your use of the Android Linpack test. Anyone with a little common sense can tell that a native implementation of Linpack on these devices will yield several dozen times more than 40MFLOPS (should be closer to 1-4 FLOP/CPU cycle). What you see here is a blatant example of Dalvik's extreme inability to perform with floating point code that extends well beyond an inability to perform SIMD vectorization.
    Reply
  • metafor - Monday, February 07, 2011 - link

    According to the developer of Linpack on Android:

    http://www.greenecomputing.com/category/android/

    It is mostly FP64 calculations done on Dalvik. While this may not be the fastest way to go about doing linear algebra, it is a fairly good representation of relative FP64 performance (which only exist in VFP).

    And let's face it, few app developers are going to dig into Android's NDK and write NEON optimized code.
    Reply
  • Exophase - Monday, February 07, 2011 - link

    Then let's ask this instead: who really cares about FP64 performance on a smartphone? I'd also argue that it is not even a good representation of relative FP64 performance since that's being obscured so much by the quality of the JITed code. Hence why you see Scorpion and A9 perform a little over twice as fast as A8 (per-clock) instead of several times faster. VFP is still in-order on Cortex-A9, competent scheduling matters.

    Maybe a lot of developers won't write NEON code on Android, but where it's written it could very well matter. For one thing, in Android itself. And theoretically one day Dalvik could actually be generating NEON competently.. so some synthetic tests of NEON could be a good look at what could be.
    Reply
  • metafor - Monday, February 07, 2011 - link

    Well, few people really :)

    Linpack as it currently exists on Android probably doesn't tell very much at all. But if you're just going to slap together an FP heavy app (pocket scientific computing anyone?) and aren't a professional programmer, this likely represents the result you see.

    I wouldn't mind seeing SpecFP ported natively to Android and running NEON. But alas, we'd need someone to roll up their sleeves and do that.

    I did do a native compile of Linpack using gcc to test on my Evo, though. It's still not SIMD code, of course, but native results using VFP were around the 70-80MFLOPS mark. Of course, it's scheduling for the A8's FPU and not Scorpion's.
    Reply
  • Anand Lal Shimpi - Monday, February 07, 2011 - link

    Thanks for your comment :)

    1) You're very right, I was thinking about the ARM11 - fixed :)

    2) Make that 2 for 2. You're right on the NEON values, I mistakenly grabbed the values from the cycles column and not the result column. The DIV/SQRT columns were also incorrect, I removed them from the article.

    I mentioned the lack of pipelining in the A8 FPU earlier in the article but I reiterated it underneath the table to hammer the point home. I agree that the lack of pipelining is the major reason for the A8's poor FP performance.

    3) Those screenshots were actually taken on IMG hardware. IMG has some pretty serious rendering issues running GLBenchmark 2.0.

    4) I'm not happy with the current state of Android benchmarks - Linpack included. Right now we're simply including everything we can get our hands on, but over the next 24 months I think you'll see us narrow the list and introduce more benchmarks that are representative of real world performance as well as contribute to meaningful architecture analysis.

    Take care,
    Anand
    Reply

Log in

Don't have an account? Sign up now