NVIDIA's Tegra 2

The old definition of a computer is a microprocessor, memory, storage, input and output. The modern definition expands it a bit to include coprocessors (e.g. GPUs) as well as multiple types and levels of memory and storage. The type of input and output devices have changed as well. In smartphones keyboards are virtual and output is integrated into the phone. Although the definition of a computer has evolved, it’s not all that different.

In the old days, almost each one of these parts of a computer was a discrete component. You had a CPU, memory, a hard drive, a video processor (before they were GPUs) all independent of one another. Go back prior to the 486 and you’ll even find that your CPU had to rely on an external FPU for all floating point math.

Moore’s Law has given us bigger, faster, better in all of these areas. Intel’s 486 was the first million transistor x86 processor, introduced in 1989. Sandy Bridge, introduced in January, is just shy of a billion transistors. Sixteen megabytes of memory was a big deal 20 years ago, today high end PCs ship with several gigabytes of memory. Another side effect of Moore’s Law however is integration.

At first integration brought things like the FPU and a second level cache onto the processor die. Over time more components were brought in to the fold. AMD integrated the memory controller in its Athlon 64 processor. Intel brought graphics on-package with Clarkdale/Arrandale, and later on-die with Sandy Bridge. AMD is doing the same with Fusion.

In the smartphone space, the integration is even more pronounced. With physical space and power as major constraints, smartphone chip makers have been forced to further trade performance for integration. The level of integration is so high within a smartphone that you almost never hear about what CPU a phone uses, but rather what Application Processor it uses otherwise known as an SoC (System on Chip or System on a Chip).


Package on Package (DRAM on top, SoC on bottom)—source: statschippack.com

Integrate a CPU, GPU, memory controller, video decoder, audio decoder, video encoder (sometimes), camera processor, system memory and maybe even a modem onto a single chip and you’ve got something that can only be described as a System on a Chip. It’s a single physical chip that integrates nearly all of the functions of the entire computer. Nearly all of the aforementioned components are on a single piece of silicon, with the exception of any integrated memory. To save board real estate and enable smaller form factors, it’s not uncommon to stack DRAM on top of the SoC package instead of beside it. The SoC in a PoP (Package on Package) stack has contacts on its top surface that line up with the balls on the DRAM for power and signaling. PoP stacks work because the SoC underneath doesn’t dissipate much heat and thus doesn’t mind being insulated by some DRAM up top.

Examples of SoCs are Qualcomm’s Snapdragon, Texas Instruments’ OMAP 4 and of course the subject at hand, NVIDIA’s Tegra 2. Although this is a review of LG’s Optimus 2X, it’s just as much a review of NVIDIA’s Tegra 2.

Tegra 2: The SoC

As a System on a Chip, NVIDIA’s Tegra 2 has a number of processors that make up the whole. Having its roots in the PC industry and being used to briefing inquisitive press, NVIDIA put together this handy die shot that shows the various parts of the Tegra 2:

With the exception of two blocks, the Tegra 2 die is entirely NVIDIA’s own creation. The A7 and Cortex A9 blocks are IP licensed from ARM. The entire die is manufactured at TSMC on a 40nm process, similar to NVIDIA’s high end GPUs. While its GPUs are built on TSMC’s 40nm “G” process, Tegra 2 is a slightly different beast.

Most foundries offer two variants of the same manufacturing process: general purpose (G) and low power (LP). The feature size is the same, however the transistors are tuned differently. TSMC’s general purpose process transistors are very fast and low voltage, unfortunately they have very high leakage current. Transistors, as you may know, are electrical on/off switches. Apply a voltage to them and current flows, remove the voltage and current stops flowing. In reality sometimes current flows when you don’t want it to, and this is referred to as leakage. High leakage is a side effect of the nice high performance transistors we need to run the fastest processors.

TSMC’s 40nm LP process uses lower voltage, slower switching transistors (can’t run at as high of a clock speed) that, as a result, have very low leakage characteristics. The lower your leakage and the lower your voltage, the lower your overall power is.

For an SoC, you’d assume that the whole thing would be built at the 40nm LP process. See those two Cortex A9 cores in the diagram above? Remember how they’re licensed from ARM? Those things are pretty high performance, they run at 1GHz. Everything else in the chip runs at 300MHz or less for the most part. On top of that, the libraries ARM provides are optimized for TSMC’s 40nm G process.

As a result, Tegra 2 uses a mixture of G and LP 40nm transistors on two separate voltage rails. The two Cortex A9 cores and the L2 cache are built using TSMC’s 40nm G process transistors, while the rest of the SoC (including the GPU) is built using 40nm LP transistors. The pair of A9s can be powered down together although not independently. We’ll get to a deeper discussion of the ARM Cortex A9 shortly.

The ISP (Image Signal Processor), located in the upper left of the die shot above, is responsible for taking the output from a camera (still/video) controller and processing into a usable video state. The Tegra 2 ISP is capable of processing images at a rate of 80 megapixels per second. The ISP supports two cameras, a 12 megapixel primary and a 5 megapixel secondary. The math works out to a maximum of 6 frames per second captured from the primary sensor at 12MP. LG uses the Tegra 2’s ISP to enable a 6 fps burst mode as you’ll see later on in the review, unfortunately it only works at a 2MP resolution. NVIDIA’s strong ISP looks better on paper than in practice it seems.

The video encode processor does real time H.264 video encoding, which is used when capturing video from the camera sensor. NVIDIA doesn’t provide any specs on what the encoder is capable of, but we’re not too impressed with the quality of its output (again, you’ll see more later).

The audio processor is dedicated hardware for audio encoding and decoding. This is used for audio capture as well as audio playback. Even MP3 playback is done on the dedicated audio processor so the Cortex A9s remain powered down to maximize battery life.

The Tegra 2 video decoder can fully accelerate the decode of 1080p H.264 Baseline profile videos at up to 20Mbps. The specs mostly look good on paper as you won’t be playing anything near that bitrate on your smartphone. NVIDIA includes dual-display capabilities with Tegra 2. The SoC can output the frame buffer to the smartphone’s display as well as an external display via HDMI out.

The ARM A7 nestled in between the video decoder and the L2 cache is used as a system management core. It handles communication between blocks, power management and general SoC management. The remaining blocks (outside of I/O) are the two CPU cores and the GPU. Those both require a lot more detail.

LG Optimus 2X Hardware Impressions The CPU Comparison: NVIDIA, TI & Qualcomm in 2011
POST A COMMENT

75 Comments

View All Comments

  • GoodRevrnd - Tuesday, February 08, 2011 - link

    TV link would be awesome, but why would you need the phone to bridge the TV and network?? Reply
  • aegisofrime - Monday, February 07, 2011 - link

    May I suggest x264 encoding as a test of the CPU power? There's a version of x264 available for ARM chips, along with NEON optimizations. Should be interesting! Reply
  • Shadowmaster625 - Monday, February 07, 2011 - link

    What is the point in having a high performance video processor when you cannot do the two things that actually make use of it? Those two things are: 1. Watch any movie in your collection without transcoding? (FAIL) 2. Play games. No actual buttons = FAIL. If you think otherwise then you dont actually play games. Just stick with facebook flash trash. Reply
  • TareX - Wednesday, February 09, 2011 - link

    The only reason I'd pay for a dual core phone is smooth flash-enabled web browsing, not gaming. Reply
  • zorxd - Monday, February 07, 2011 - link

    Stock Android has it too. There is also E for EDGE and G for GPRS. Reply
  • Exophase - Monday, February 07, 2011 - link

    Hey Anand/Brian,

    There are some issues I've found with some information in this article:

    1) You mention that Cortex-A8 is available in a multicore configuration. I'm pretty sure there's no such thing; you might be thinking of ARM11MPCore.

    2) The floating point latencies table is just way off for NEON. You can find latencies here:
    http://infocenter.arm.com/help/index.jsp?topic=/co...
    It's the same in Cortex-A9. The table is a little hard to read; you have to look at the result and writeback stages to determine the latency (it's easier to read the A9 version). Here's the breakdown:
    FADD/FSUB/FMUL: 5 cycles
    FMAC: 9 cycles (note that this is because the result of the FMUL pipeline is then threaded through the FADD pipeline)
    The table also implies Cortex-A9 adds divide and sqrt instructions to NEON. In actuality, both support reciprocal approximation instructions in SIMD and full versions in scalar. The approximation instructions have both initial approximation with ~9 bits of precision and Newton Rhapson step instructions. The step instructions function like FMACs and have similar latencies. This kind of begs the question of where the A9 NEON DIV and SQRT numbers came from.

    The other issue I have with these numbers is that it only mentions latency and not throughput. The main issue is that the non-pipelined Cortex-A8 FPU has throughput almost as bad as its latency, while all of the other implementations have single cycle throughput for 2x 64-bit operations. Maybe throughput is what you mean by "minimum latency", however this would imply that Cortex-A9 VFP can't issue every cycle, which isn't the case.

    3) It's obvious from the GLBenchmark 2.0 Pro screenshot that there are some serious color limitations from Tegra 2 (look at the woman's face). This is probably due to using 16-bit. IMG has a major advantage in this area since it renders at full 32-bit (or better) precision internally and can dither the result to 16-bit to the framebuffer, which looks surprisingly similar in quality to non-dithered 32-bit. This makes a 16-bit vs 16-bit framebuffer comparison between the two very unbalanced - it's far more fair to just do both at 32-bit, but it doesn't look like the benchmark has any option for it. Furthermore, Tegra 2 is limited to 16-bit (optionally non-linear) depth buffers, while IMG utilizes 32-bit floating point depth internally. This is always going to be a disadvantage for Tegra 2 and is definitely worth mentioning in any comparison.

    Finally I feel like ranting a little bit about your use of the Android Linpack test. Anyone with a little common sense can tell that a native implementation of Linpack on these devices will yield several dozen times more than 40MFLOPS (should be closer to 1-4 FLOP/CPU cycle). What you see here is a blatant example of Dalvik's extreme inability to perform with floating point code that extends well beyond an inability to perform SIMD vectorization.
    Reply
  • metafor - Monday, February 07, 2011 - link

    According to the developer of Linpack on Android:

    http://www.greenecomputing.com/category/android/

    It is mostly FP64 calculations done on Dalvik. While this may not be the fastest way to go about doing linear algebra, it is a fairly good representation of relative FP64 performance (which only exist in VFP).

    And let's face it, few app developers are going to dig into Android's NDK and write NEON optimized code.
    Reply
  • Exophase - Monday, February 07, 2011 - link

    Then let's ask this instead: who really cares about FP64 performance on a smartphone? I'd also argue that it is not even a good representation of relative FP64 performance since that's being obscured so much by the quality of the JITed code. Hence why you see Scorpion and A9 perform a little over twice as fast as A8 (per-clock) instead of several times faster. VFP is still in-order on Cortex-A9, competent scheduling matters.

    Maybe a lot of developers won't write NEON code on Android, but where it's written it could very well matter. For one thing, in Android itself. And theoretically one day Dalvik could actually be generating NEON competently.. so some synthetic tests of NEON could be a good look at what could be.
    Reply
  • metafor - Monday, February 07, 2011 - link

    Well, few people really :)

    Linpack as it currently exists on Android probably doesn't tell very much at all. But if you're just going to slap together an FP heavy app (pocket scientific computing anyone?) and aren't a professional programmer, this likely represents the result you see.

    I wouldn't mind seeing SpecFP ported natively to Android and running NEON. But alas, we'd need someone to roll up their sleeves and do that.

    I did do a native compile of Linpack using gcc to test on my Evo, though. It's still not SIMD code, of course, but native results using VFP were around the 70-80MFLOPS mark. Of course, it's scheduling for the A8's FPU and not Scorpion's.
    Reply
  • Anand Lal Shimpi - Monday, February 07, 2011 - link

    Thanks for your comment :)

    1) You're very right, I was thinking about the ARM11 - fixed :)

    2) Make that 2 for 2. You're right on the NEON values, I mistakenly grabbed the values from the cycles column and not the result column. The DIV/SQRT columns were also incorrect, I removed them from the article.

    I mentioned the lack of pipelining in the A8 FPU earlier in the article but I reiterated it underneath the table to hammer the point home. I agree that the lack of pipelining is the major reason for the A8's poor FP performance.

    3) Those screenshots were actually taken on IMG hardware. IMG has some pretty serious rendering issues running GLBenchmark 2.0.

    4) I'm not happy with the current state of Android benchmarks - Linpack included. Right now we're simply including everything we can get our hands on, but over the next 24 months I think you'll see us narrow the list and introduce more benchmarks that are representative of real world performance as well as contribute to meaningful architecture analysis.

    Take care,
    Anand
    Reply

Log in

Don't have an account? Sign up now