Performance

When I sat down with the 2X at CES, naturally the first thing we did was run our usual suite of benchmarking tools on the phone. The phone was running Android 2.2.1 then, and even though the numbers were good, LG hadn’t quite finalized the software and didn’t think those numbers were representative. We didn’t publish, but knew that performance was good.

Obviously now we don’t have such limitations, and won’t keep you waiting any longer to see how Tegra 2 compares to other phones we’ve benchmarked already. We’ve already talked about performance a bit in our initial preview from back when we got the 2X, but there’s obviously a lot more we have now.

Before we start that discussion however, we need to talk about multithreading in Android. Android itself already is multithreaded natively, in fact, that’s part of delivering speedy UI. The idea is to render the UI using one thread and distribute slow tasks into a background threads as necessary. In the best case multithreaded scenario on Android, the main thread communicates to child threads using a handler class, and hums along until they come back with results and messages. It’s nothing new from a UI perspective—keep the thread drawing the screen active and speedy while longer processes run in the background. The even better part is that multiprocessor smartphones can immediately take advantage of multiple cores and distribute threads appropriately with Android. That said, Android 3.x (Honeycomb) brings a much tighter focus on multithreading and bringing things like garbage collecting off of the first CPU and onto the second. In case you haven't figured it out by now, Android releases generally pair with and are tailored to a specific SoC. If you had to assign those, it'd look something like this: 2.0-2.1—TI OMAP3, 2.2—Qualcomm Snapdragon, 2.3—Samsung Hummingbird, 3.0—Tegra 2.

Back to the point however, the same caveats we saw with multithreading on the PC apply in the mobile space. Applications need to be developed with the expressed intent of being multithreaded to feel faster. The big question on everyone's mind is whether Android 2.2.x can take advantage of those multiple cores. Turns out, the answer is yes.

First off, we can check that Android 2.2.1 on the 2X is indeed seeing the two Cortex-A9 cores by checking dmesg, which thankfully is quite easy to do over adb shell after a fresh boot. Sure enough, inside we can see two cores being brought up during boot by the kernel:

<4>[  118.962880] CPU1: Booted secondary processor
<6>[  118.962989] Brought up 2 CPUs
<6>[  118.963003] SMP: Total of 2 processors activated (3997.69 BogoMIPS).
<7>[  118.963025] CPU0 attaching sched-domain:
<7>[  118.963036]  domain 0: span 0-1 level CPU
<7>[  118.963046]   groups: 0 1
<7>[  118.963063] CPU1 attaching sched-domain:
<7>[  118.963072]  domain 0: span 0-1 level CPU
<7>[  118.963079]   groups: 1 0
<6>[  118.986650] regulator: core version 0.5

The 2X runs the same 2.6.32.9 linux kernel common to all of Android 2.2.x, but in a different mode. Check out the first line of dmesg from the Nexus One:

<5>[ 0.000000] Linux version 2.6.32.9-27240-gbca5320 (android-build@apa26.mtv.corp.google.com) (gcc version 4.4.0 (GCC) ) #1 PREEMPT Tue Aug 10 16:42:38 PDT 2010

Compare that to the 2X:

<5>[ 0.000000] Linux version 2.6.32.9 (sp9pm_9@sp9pm2pl3) (gcc version 4.4.0 (GCC) ) #1 SMP PREEMPT Sun Jan 16 20:58:43 KST 2011

The major difference is the inclusion of “SMP” which shows definitively that Symmetric Multi-Processor support is enabled on the kernel, which means the entire platform can use both CPUs. PREEMPT of course shows that kernel preemption is enabled, which both have turned on. Again, having a kernel that supports multithreading isn’t going to magically make everything faster, but it lets applications with multiple threads automatically spread them out across multiple cores.

Though there are task managers on Android, seeing how many threads a given process has running isn’t quite as easy as it is on the desktop, however there still are ways of gauging multithreading. The two tools we have are both checking “dumpsys cpuinfo” from over adb shell, and simply looking at the historical CPU use reported in a monitoring program we use called System Panel which likely looks at the same thing.

The other interesting gem we can glean from the dmesg output are the clocks NVIDIA has set for most of the interesting bits of Tegra 2 in the 2X. There’s a section of output during boot which looks like the following:

<4>[  119.026337] ADJUSTED CLOCKS:
<4>[  119.026354] MC clock is set to 300000 KHz
<4>[  119.026365] EMC clock is set to 600000 KHz (DDR clock is at 300000 KHz)
<4>[  119.026373] PLLX0 clock is set to 1000000 KHz
<4>[  119.026379] PLLC0 clock is set to 600000 KHz
<4>[  119.026385] CPU clock is set to 1000000 KHz
<4>[  119.026391] System and AVP clock is set to 240000 KHz
<4>[  119.026400] GraphicsHost clock is set to 100000 KHz
<4>[  119.026408] 3D clock is set to 100000 KHz
<4>[  119.026415] 2D clock is set to 100000 KHz
<4>[  119.026423] Epp clock is set to 100000 KHz
<4>[  119.026430] Mpe clock is set to 100000 KHz
<4>[  119.026436] Vde clock is set to 240000 KHz

We can see the CPU set to 1 GHz, but the interesting bits are that LPDDR2 runs at 300 MHz (thus with DDR we get to 600 MHz), and the GPU is clocked at a relatively conservative 100 MHz, compared to the majority of PowerVR GPUs which run somewhere around 200 MHz. It turns out that AP20H lets the GPU clock up to 300 MHz under load as we'll discuss later. The other clocks are a bit more mysterious, Vde could be the Video Decode Engine, Mpe could be Media Processing Engine (which is odd since Tegra 2 uses the A9 FPU instead of MPE), the others are even less clear.

The other interesting bit is how RAM is allocated on the 2X—there’s 512 MB of it, of which 384 MB is accessible by applications and Android. 128 MB is dedicated entirely to the GPU. You can pull that directly out of some other dmesg trickery as well:

mem=383M@0M nvmem=128M@384M

The first 384 are for general RAM, the remaining 128 MB is NVIDIA memory which we can only assume is dedicated entirely to the GPU.

The Partners and the Landscape Performance: Tegra 2 Benchmarked
POST A COMMENT

75 Comments

View All Comments

  • GoodRevrnd - Tuesday, February 8, 2011 - link

    TV link would be awesome, but why would you need the phone to bridge the TV and network?? Reply
  • aegisofrime - Monday, February 7, 2011 - link

    May I suggest x264 encoding as a test of the CPU power? There's a version of x264 available for ARM chips, along with NEON optimizations. Should be interesting! Reply
  • Shadowmaster625 - Monday, February 7, 2011 - link

    What is the point in having a high performance video processor when you cannot do the two things that actually make use of it? Those two things are: 1. Watch any movie in your collection without transcoding? (FAIL) 2. Play games. No actual buttons = FAIL. If you think otherwise then you dont actually play games. Just stick with facebook flash trash. Reply
  • TareX - Wednesday, February 9, 2011 - link

    The only reason I'd pay for a dual core phone is smooth flash-enabled web browsing, not gaming. Reply
  • zorxd - Monday, February 7, 2011 - link

    Stock Android has it too. There is also E for EDGE and G for GPRS. Reply
  • Exophase - Monday, February 7, 2011 - link

    Hey Anand/Brian,

    There are some issues I've found with some information in this article:

    1) You mention that Cortex-A8 is available in a multicore configuration. I'm pretty sure there's no such thing; you might be thinking of ARM11MPCore.

    2) The floating point latencies table is just way off for NEON. You can find latencies here:
    http://infocenter.arm.com/help/index.jsp?topic=/co...
    It's the same in Cortex-A9. The table is a little hard to read; you have to look at the result and writeback stages to determine the latency (it's easier to read the A9 version). Here's the breakdown:
    FADD/FSUB/FMUL: 5 cycles
    FMAC: 9 cycles (note that this is because the result of the FMUL pipeline is then threaded through the FADD pipeline)
    The table also implies Cortex-A9 adds divide and sqrt instructions to NEON. In actuality, both support reciprocal approximation instructions in SIMD and full versions in scalar. The approximation instructions have both initial approximation with ~9 bits of precision and Newton Rhapson step instructions. The step instructions function like FMACs and have similar latencies. This kind of begs the question of where the A9 NEON DIV and SQRT numbers came from.

    The other issue I have with these numbers is that it only mentions latency and not throughput. The main issue is that the non-pipelined Cortex-A8 FPU has throughput almost as bad as its latency, while all of the other implementations have single cycle throughput for 2x 64-bit operations. Maybe throughput is what you mean by "minimum latency", however this would imply that Cortex-A9 VFP can't issue every cycle, which isn't the case.

    3) It's obvious from the GLBenchmark 2.0 Pro screenshot that there are some serious color limitations from Tegra 2 (look at the woman's face). This is probably due to using 16-bit. IMG has a major advantage in this area since it renders at full 32-bit (or better) precision internally and can dither the result to 16-bit to the framebuffer, which looks surprisingly similar in quality to non-dithered 32-bit. This makes a 16-bit vs 16-bit framebuffer comparison between the two very unbalanced - it's far more fair to just do both at 32-bit, but it doesn't look like the benchmark has any option for it. Furthermore, Tegra 2 is limited to 16-bit (optionally non-linear) depth buffers, while IMG utilizes 32-bit floating point depth internally. This is always going to be a disadvantage for Tegra 2 and is definitely worth mentioning in any comparison.

    Finally I feel like ranting a little bit about your use of the Android Linpack test. Anyone with a little common sense can tell that a native implementation of Linpack on these devices will yield several dozen times more than 40MFLOPS (should be closer to 1-4 FLOP/CPU cycle). What you see here is a blatant example of Dalvik's extreme inability to perform with floating point code that extends well beyond an inability to perform SIMD vectorization.
    Reply
  • metafor - Monday, February 7, 2011 - link

    According to the developer of Linpack on Android:

    http://www.greenecomputing.com/category/android/

    It is mostly FP64 calculations done on Dalvik. While this may not be the fastest way to go about doing linear algebra, it is a fairly good representation of relative FP64 performance (which only exist in VFP).

    And let's face it, few app developers are going to dig into Android's NDK and write NEON optimized code.
    Reply
  • Exophase - Monday, February 7, 2011 - link

    Then let's ask this instead: who really cares about FP64 performance on a smartphone? I'd also argue that it is not even a good representation of relative FP64 performance since that's being obscured so much by the quality of the JITed code. Hence why you see Scorpion and A9 perform a little over twice as fast as A8 (per-clock) instead of several times faster. VFP is still in-order on Cortex-A9, competent scheduling matters.

    Maybe a lot of developers won't write NEON code on Android, but where it's written it could very well matter. For one thing, in Android itself. And theoretically one day Dalvik could actually be generating NEON competently.. so some synthetic tests of NEON could be a good look at what could be.
    Reply
  • metafor - Monday, February 7, 2011 - link

    Well, few people really :)

    Linpack as it currently exists on Android probably doesn't tell very much at all. But if you're just going to slap together an FP heavy app (pocket scientific computing anyone?) and aren't a professional programmer, this likely represents the result you see.

    I wouldn't mind seeing SpecFP ported natively to Android and running NEON. But alas, we'd need someone to roll up their sleeves and do that.

    I did do a native compile of Linpack using gcc to test on my Evo, though. It's still not SIMD code, of course, but native results using VFP were around the 70-80MFLOPS mark. Of course, it's scheduling for the A8's FPU and not Scorpion's.
    Reply
  • Anand Lal Shimpi - Monday, February 7, 2011 - link

    Thanks for your comment :)

    1) You're very right, I was thinking about the ARM11 - fixed :)

    2) Make that 2 for 2. You're right on the NEON values, I mistakenly grabbed the values from the cycles column and not the result column. The DIV/SQRT columns were also incorrect, I removed them from the article.

    I mentioned the lack of pipelining in the A8 FPU earlier in the article but I reiterated it underneath the table to hammer the point home. I agree that the lack of pipelining is the major reason for the A8's poor FP performance.

    3) Those screenshots were actually taken on IMG hardware. IMG has some pretty serious rendering issues running GLBenchmark 2.0.

    4) I'm not happy with the current state of Android benchmarks - Linpack included. Right now we're simply including everything we can get our hands on, but over the next 24 months I think you'll see us narrow the list and introduce more benchmarks that are representative of real world performance as well as contribute to meaningful architecture analysis.

    Take care,
    Anand
    Reply

Log in

Don't have an account? Sign up now