The CPU Comparison: NVIDIA, TI & Qualcomm in 2011

NVIDIA makes two versions of the Tegra 2, one for tablets and one for smartphones. The difference between the two boils down to packaging size and TDP. NVIDIA hasn’t been too forthcoming with information but here’s what I know thus far:

NVIDIA Tegra 2
SoC Part Number CPU Clock GPU Clock Availability
NVIDIA Tegra 2 T20 1GHz 333MHz Now
NVIDIA Tegra 2 AP20H 1GHz 300MHz Now
NVIDIA Tegra 2 3D T25 1.2GHz 400MHz Q2 2011
NVIDIA Tegra 2 3D AP25 1.2GHz 400MHz Q2 2011

The T25/AP25 are believed to be the upcoming Tegra 2 3D SoCs. They increase CPU clock speed to 1.2GHz and GPU clock to 400MHz. The T20/AP20H are the current Tegra 2 models, with the T20 aimed at tablets and AP20H for smartphones. The Tegra 2 T20 and AP20H both run their CPU cores at up to 1GHz depending on software load.

Including NVIDIA’s Tegra 2 there are three competing CPU architectures at play in the 2011 SoC race: the ARM Cortex A8, ARM Cortex A9 and Qualcomm Scorpion (the CPU core at the heart of the Snapdragon SoC).

NVIDIA chose to skip the A8 generation and instead would jump straight to the Cortex A9. For those of you who aren’t familiar with ARM microprocessor architectures, the basic breakdown is below:

ARM Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9
Issue Width single-issue dual-issue dual-issue
Pipeline Depth 8 stages 13 stages 9 stages
Out of Order Execution N N Y
Process Technology 90nm 65nm/45nm 40nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1GHz

ARM11 was a single-issue, in-order architecture. Cortex A8 moved to dual-issue and A9 adds an out-of-order execution engine. The A9’s integer pipeline is also significantly shortened from 13 stages down to 9. The combination of out-of-order execution and a reduction in pipeline depth should give the Cortex A9 a healthy boost over the A8 at the same clock speed. The Cortex A8 is only supported in single-core configurations, while the Tegra 2 and TI’s OMAP 4 both use two A9 cores.

Each jump (ARM11 to A8 to A9) is good for at least a generational performance improvement (think 486 to Pentium, Pentium to Pentium Pro/II).

With each new generation of ARM architecture we also got a new manufacturing process and higher clock speeds. ARM11 was largely built at 90nm, while Cortex A8 started at 65nm. We saw most A8 SoCs transition to 40/45nm in 2010, which is where Cortex A9 will begin. The cadence will continue with A9 scaling down to 28nm in 2012 and the new Cortex A15 picking up where A9 leaves off.

Qualcomm’s Scorpion Core

The third contender in 2011 is Qualcomm’s Scorpion core. Scorpion is a dual-issue, mostly in-order microprocessor architecture developed entirely by Qualcomm. The Scorpion core implements the same ARMv7-A instruction set as the Cortex A8 and A9, however the CPU is not based on ARM’s Cortex A8 or A9. This is the point many seem to be confused about. Despite high level similarities, the Scorpion core is not Qualcomm’s implementation of a Cortex A8. Qualcomm holds an ARM architecture license which allows it to produce microprocessors that implement an ARM instruction set. This is akin to AMD holding an x86 license that allows it to produce microprocessors that are binary compatible with Intel CPUs. However calling AMD’s Phenom II a version of Intel’s Core i7 would be incorrect. Just like calling Scorpion a Cortex A8 is incorrect.

I mention high level similarities between Scorpion and the Cortex A8 simply because the two architectures appear alike. They both have dual-issue front ends and a 13-stage integer pipeline. Qualcomm claims the Scorpion core supports some amount of instruction reordering, however it’s not clear to what extent the CPU is capable of out-of-order execution. Intel’s Atom for example can reorder around certain instructions however it is far from an out-of-order CPU.

Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion
Issue Width single-issue dual-issue dual-issue dual-issue
Pipeline Depth 8 stages 13 stages 9 stages 13 stages
Out of Order Execution N N Y Partial
FPU Optional VFPv2 (not-pipelined) VFPv3 (not-pipelined) Optional VFPv3-D16 (pipelined) VFPv3 (pipelined)
NEON N/A Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide)
Process Technology 90nm 65nm/45nm 40nm 40nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1GHz 1GHz

Scorpion has some big advantages on the floating point side. Qualcomm implements ARM’s VFPv3 vector floating point instruction set on Scorpion, the same instructions supported by the Cortex A8. The Cortex A8’s FPU wasn’t pipelined. A single instruction had to make it through the FP pipeline before the next instruction could be issued. For those of you who remember desktop processors, the Cortex A8’s non-pipelined FPU is reminiscent of Intel’s 486 and AMD’s K6. It wasn’t until the Pentium processor that Intel gained a pipelined FPU, and for AMD that came with the Athlon. As a result, floating point code runs rather slowly on the Cortex A8. You can get around the A8’s poor FP performance for some workloads by using NEON, which is a much higher performance SIMD engine paired with the Cortex A8.

The Scorpion’s VFPv3 FPU is fully pipelined. As a result, floating point performance is much improved. Qualcomm also implements support for NEON, but with a wider 128-bit datapath (compared to 64-bit in the A8 and A9). As a result, Qualcomm should have much higher VFP and NEON performance than the Cortex A8 (we see a good example of this in our Linpack performance results).

While all Cortex A8 designs incorporated ARM’s NEON SIMD engine, A9 gives you the option of integrating either a SIMD engine (ARM’s Media Processing Engine, aka NEON) or a non-vector floating point unit (VFPv3-D16). NVIDIA chose not to include the A9’s MPE and instead opted for the FPU. Unlike the A8’s FPU, in the A9 the FPU is fully pipelined—so performance is much improved. The A9’s FPU however is still not as quick at math as the optional SIMD MPE.

Minimum Instruction Latencies (Single Precision)
Instruction FADD FSUB FMUL FMAC
ARM Cortex A8 (FPU) 9 cycles 9 cycles 10 cycles 18 cycles
ARM Cortex A9 (FPU) 4 cycles 4 cycles 5 cycles 8 cycles
ARM Cortex A8 (NEON) 5 cycles 5 cycles 5 cycles 9 cycles
ARM Cortex A9 (MPE/NEON) 5 cycles 5 cycles 5 cycles 9 cycles

Remember that the A8's FPU isn't pipelined so it can't complete these instructions every cycle, resulting throughput that's nearly equal to the instruction latency. The A9's FPU by comparison is fully pipelined, giving it much higher instruction throughput compared to the A8.

NVIDIA claims implementing MPE would incur a 30% die penalty for a performance improvement that impacts only a minimal amount of code. It admits that at some point integrating a SIMD engine makes sense, just not yet. The table above shows a comparison of instruction latency on various floating point and SIMD engines in A8 and A9. TI’s OMAP 4 on the other hand will integrate ARM’s Cortex A9 MPE. Depending on the code being run, OMAP 4 could have a significant performance advantage vs. the Tegra 2. Qualcomm's FPU/NEON performance should remain class leading in non-bandwidth constrained applications.

Unfortunately for Qualcomm, much of what impacts smartphone application performance today isn’t bound by floating point performance. Future apps and workloads will definitely appreciate Qualcomm’s attention to detail, but loading a web page won’t touch the FPU.

The Scorpion core remains largely unchanged between SoC generations. It won’t be until 28nm in 2012 that Qualcomm introduces a new microprocessor architecture. Remember that as an architecture licensee Qualcomm is going to want to create architectures that last as long as possible in order to recover its initial investment. Microprocessor licensees however have less invested into each generation and can move to new architectures much faster.

Cache & Memory Controller Comparison

NVIDIA outfits the Tegra 2 with a 1MB L2 cache shared between the two Cortex A9 cores. A shared L2/private L1 structure makes the most sense for a dual-core CPU as we’ve learned from following desktop CPUs for years. It’s only once you make the transition to 3 or 4 cores that it makes sense to have private L2s and introduce a large, catch-all shared L3 cache.

Qualcomm’s upcoming QSD8660 only has a 512KB L2 cache shared between its two cores, while TI’s OMAP 4 has a Tegra 2-like 1MB L2. In these low power parts, having a large L2 with a good hit rate is very important. Moving data around a chip is always very power intensive. The close proximity of the L2 cache to the CPU cores helps keep power consumption down. Any data that has to be fetched from main memory requires waking up the external memory interface as well as the external or on-package DRAMs. A trip to main memory easily requires an order of magnitude more power than pulling data from a local cache.

While the OMAP 4 and Tegra 2 both have a larger L2 cache, it’s unclear what frequency the cache operates at. Qualcomm’s L2 operates at core frequency and as a result could offer higher bandwidth/lower latency operation.

NVIDIA opted for a single-channel 32-bit LPDDR2 memory interface. Qualcomm’s QSD8660/MSM8x60 and TI’s upcoming OMAP 4 have two LPDDR2 channels. NVIDIA claims that a narrower memory bus with more efficient arbitration logic is the best balance for power/performance at the 40nm process node. In order to feed the data hungry CPUs and GPU, NVIDIA specs the Tegra 2 for use with 600MHz datarate LPDDR2 memory (although the LG Optimus 2X actually has 800MHz datarate DRAM on package, it still only runs at 600MHz).

Assuming all memory controllers are equally efficient (which is an incorrect assumption), NVIDIA should have half the bandwidth of TI’s OMAP 4. NVIDIA’s larger L2 cache gives it an advantage over the QSD8660, giving Tegra 2 an effective memory latency advantage for a percentage of memory requests. Depending on the operating frequency of NVIDIA’s L2, Qualcomm could have a cache bandwidth advantage. The take away point here is that there’s no clear winner in this battle of specifications, just a comparison of tradeoffs.

The Dual-Core Comparison in 2011

In 2011 Qualcomm will introduce the QSD8660, a Snapdragon SoC with two 45nm Scorpion cores running at 1.2GHz. With a deeper pipeline, smaller cache and a largely in-order architecture, the QSD8660 should still trail NVIDIA’s Cortex A9 based Tegra 2 at the same clock speed. However Tegra 2 launches at 1GHz and it won’t be until Tegra 2 3D that we see 1.2GHz parts. Depending on timing we could see dual-core Qualcomm phones running at 1.2GHz competing with dual-core NVIDIA phones running at 1.0GHz. The winner between those two may not be as clear—it’ll likely vary based on workload.

At 1.2GHz I’d expect the Tegra 2 3D to be the fastest SoC for the entirety of 2011. Once 2012 rolls around we’ll reset the clock as Qualcomm introduces its next-generation microprocessor architecture.

NVIDIA clearly has an execution advantage as it is the first SoC maker to ship an ARM Cortex A9. NVIDIA’s biggest weakness on the CPU side is the lack of NEON support in Tegra 2, something that doesn’t appear to be an issue today but could be a problem in the future depending on how widespread NEON code becomes in apps. TI’s OMAP 4 includes both a NEON unit and a wider memory bus, the latter could be a performance advantage depending on how well designed the memory controller is.

Qualcomm is a bit behind on the architecture side. The Scorpion core began shipping in Snapdragon SoCs at the end of 2008 and the architecture won’t be refreshed until late 2011/2012. As Intel discovered with NetBurst, 4—5 year runs for a single microprocessor architecture are generally too long. Luckily Qualcomm appears to be on a ~3 year cadence at this point.

The QSD8660 running at 1.2GHz should be sufficient to at least keep Qualcomm competitive until the Scorpion’s replacement arrives (although I suspect NVIDIA/TI will take the crown with their A9 designs). One aspect we haven’t talked about (mostly because there isn’t any good data available) is power consumption. It’s unclear how the Scorpion core compares to the Cortex A9 in terms of power consumption at the same process node.

NVIDIA's Tegra 2 The GeForce ULV
Comments Locked

75 Comments

View All Comments

  • matt b - Tuesday, February 8, 2011 - link

    Just curious because I've heard rumors that HP will use the Qualcomm chipset and I've also heard rumors that they will stick with Ti for their new tablets/phones. I just wondered if you know for sure since I know that you met with folks at CES. I hope that we all find out tomorrow at the HP event.
    Great review.
  • TareX - Wednesday, February 9, 2011 - link

    I'd like to see Tegra 2 on the Xoom compared to Tegra 2 on the Optimus 2X.

    Why? Well, simply put, the only Android version that seems to be optimized for dual-core is Honeycomb.
  • Dark Legion - Wednesday, February 9, 2011 - link

    Why is there no Incredible on 2.2? I could understand if you had both 2.1 and 2.2, like the Evo, but as it is now does not show the full/current performance.
  • Morke - Thursday, February 10, 2011 - link

    "It’s a strange dichotomy that LG sets up with this launcher scheme that divides “downloaded” apps from “system applications,” one that’s made on no other Android device I’ve ever seen but the Optimus One. The end result is that most of the stuff I want (you know, since I just installed it) is at the very last page or very bottom of the list, resulting in guaranteed scrolling every single time. If you’re a power user, just replace the launcher with something else entirely."

    You are not right there.
    First you can create additonal categories (aside from system applications and downloads) and move applications between them.
    Secondly you can rearrange the ordering of the applications inside a category (allowing you to have those on top which you access most frequently). You can also delete applications right away in this edit mode.

    There is a youtube video demonstrating this:
    http://www.youtube.com/watch?v=Dvvtl6pSNp8
    See time index starting with 4:21.

    Maybe you should correct your review on this?
  • Morke - Thursday, February 10, 2011 - link

    The correct youtube URL demonstrating application launcher management is actually
    http://www.youtube.com/watch?v=lDo-1-jwLko&fea...
  • brj_texas - Thursday, February 10, 2011 - link

    Anand,
    A question on the statement in the benchmarking section, "the SunSpider benchmark isn't explicitly multithreaded, although some of the tests within the benchmark will take advantage of more than one core. "

    My understanding was that all of the tests within sunspider are single-threaded, but a dual-core processor can run the javascript engine (and the sunspider tests) in a separate thread from the main browser thread when you call sunspider from a browser window.

    Can you clarify which tests support multi-threading in sunspider if that is in fact what you meant?

    On the topic of multi-threading, we've used moonbat, a multi-core variant of sunspider, to explicitly test multi-core performance with javascript code. I wonder if you have any other benchmarks under investigation that measure multi-core performance?
    Thanks

    -Brian
  • worldbfree4me - Saturday, February 12, 2011 - link

    Thanks for another thorough and in-depth analysis. But I have a question to ask,

    Should we upgrade (break our 2 year contract agreement for this phone) or ride out our contract?

    We trust and value your our opinion. Tom’s hardware does a GPU hierarchy chart every few months, can you do a phone hierarchy in the future?
  • lashton - Sunday, February 13, 2011 - link

    They have a really good idea and lead the market but it falls short because its not quite right
  • tnepres - Tuesday, April 5, 2011 - link

    I now own a optimus 2x. The first was dead on arrival, but this one is perfect. The LG software is innovative and pleasing to the eye. In various places they made real improvements to the UI that are just brilliant,ie. the ability to sort and categorize apps. At times the UI is not as fast as you would expect, especially when adding apps/widgets to one of the 7 pages. It seems LG generates a list of widgets for you, so you can see what apps support this mode, and that takes about a second. As I recall, on HTC devices you are just presented with a list of apps and u have to try and see if you can widget it.

    The LG keybord has a brilliant feature, you tab the side of the phone to move the cursor. Sadly in other respects the keyboard is lacking, ie. when you long-pres you do not get the alternates you might wish, such as numbers.

    The batterytime is superb, using the UI consumes much less power than on my desire.

    Copy/paste in the browser does not activate via long-pres, you have to hit menu button, but on the plus side its easier to use than what HTC made.

    During 2 days of very intensive use i have had 1 app (partially) crash and that was the marketplace. No other issues so far, its my verdict that the unstability issues are overrated.

    No problems with wifi using stock ISP (TDC) supplied router. (sagemcom)

    To engadget: How on earth (!!?!!?) can you state there is no use for dualcore. When browsing one loads flash the other the rest. Its so fast you cant believe it. Try loading www.ufc.com on a non dualcore phone and you get my drift.

    I do not hesitate to give the optimus 2x my warm recommendations.

    VERDICT: 9/10 (missing 4g)
  • Sannat - Thursday, May 12, 2011 - link

    gsmarena sound benchmark for optimus 2x isnt great...could it be a s/w issue...??

Log in

Don't have an account? Sign up now