ARM told us to expect some of the first 64-bit ARMv8 based SoCs to ship in 2014, and it looks like we're seeing just that. Today Qualcomm is officially announcing its first 64-bit SoC: the Snapdragon 410 (MSM8916). 

Given that there's no 64-bit Android available at this point, most of the pressure to go to 64-bit in the Android space is actually being driven by the OEMs who view 64-bit support as a necessary checkbox feature at this point thanks to Apple's move with the A7. Combine that with the fact that the most ready 64-bit IP from ARM is the Cortex A53 (successor to the Cortex A5/A7 line), and all of the sudden it makes sense why Qualcomm's first 64-bit mobile SoC is aimed at the mainstream market (Snapdragon 400 instead of 600/800).

I'll get to explaining ARM's Cortex A53 in a moment, but first let's look at the specs of the SoC:

Qualcomm Snapdragon 410
Internal Model Number MSM8916
Manufacturing Process 28nm LP
CPU 4 x ARM Cortex A53 1.2GHz+
GPU Qualcomm Adreno 306
Memory Interface 1 x 64-bit LPDDR2/3
Integrated Modem 9x25 core, LTE Category 4, DC-HSPA+

At a high level we're talking about four ARM Cortex A53 cores, likely running around 1.2 - 1.4GHz. Having four cores still seems like a requirement for OEMs in many emerging markets unfortunately, although I'd personally much rather see two higher clocked A53s. Qualcomm said the following about 64-bit in its 410 press-release:

"The Snapdragon 410 chipset will also be the first of many 64-bit capable processors as Qualcomm Technologies helps lead the transition of the mobile ecosystem to 64-bit processing.”

Keep in mind that Qualcomm presently uses a mix of ARM and custom developed cores in its lineup. The Snapdragon 400 line already includes ARM (Cortex A7) and Krait based designs, so the move to Cortex A53 in the Snapdragon 410 isn't unprecedented. It will be very interesting to see what happens in the higher-end SKUs. I don't assume that Qualcomm will want to have a split between 32 and 64-bit designs, which means we'll either see a 64-bit Krait successor this year or we'll see more designs that leverage ARM IP in the interim. 

As you'll see from my notes below however, ARM's Cortex A53 looks like a really good choice for Qualcomm. It's an extremely power efficient design that should be significantly faster than the Cortex A5/A7s we've seen Qualcomm use in this class of SoC in the past.

The Cortex A53 CPU cores are paired with an Adreno 306 GPU, a variant of the Adreno 305 used in Snapdragon 400 based SoCs (MSM8x28/8x26).

The Snapdragon 410 also features an updated ISP compared to previous 400 offerings, adding support for up to a 13MP primary camera (no word on max throughput however).

Snapdragon 410 also integrates a Qualcomm 9x25 based LTE modem block (also included in the Snapdragon 800/MSM8974), featuring support for LTE Category 4, DC-HSPA+ and the usual legacy 3G air interfaces.

All of these IP blocks sit behind a single-channel 64-bit LPDDR2/3 memory interface.

The SoC is built on a 28nm LP process and will be sampling in the first half of 2014, with devices shipping in the second half of 2014. Given its relatively aggressive schedule, the Snapdragon 410 may be one of the first (if not the first) Cortex A53 based SoCs in the market. 

A Brief Look at ARM's Cortex A53

ARM's Cortex A53 is a dual-issue in-order design, similar to the Cortex A7. Although the machine width is unchanged, the A53 is far more flexible in how instructions can be co-issued compared to the Cortex A7 (e.g. branch, data processing, load-store, & FP/NEON all dual-issue from both decode paths). 

The A53 is fully ISA compatible with the upcoming Cortex A57, making A53 the first ARMv8 LITTLE processor (for use in big.LITTLE) configurations with an A57

The overall pipeline depth hasn't changed compared to the Cortex A7. We're still dealing with an 8-stage pipeline (3-stage fetch pipeline + 5 stage decode/execute for integer or 7 for NEON/FP). The vast majority of instructions will execute in one cycle, leaving branch prediction as a big lever for increasing performance. ARM significantly increased branch prediction accuracy with the Cortex A53, so much that it was actually leveraged in the dual-issue, out-of-order Cortex A12. ARM also improved the back end a bit, improving datapath throughput. 

The result of all of this is a dual-issue design that's pushed pretty much as far as you can without going out-of-order. Below are some core-level performance numbers, all taken in AArch32 mode, comparing the Cortex A53 to its A5/A7 competitors:

Core Level Performance Comparison
All cores running at 1.2GHz DMIPS CoreMark SPECint2000
ARM Cortex A5 1920 - 350
ARM Cortex A7 2280 3840 420
ARM Cortex A9 r4p1 - - 468
ARM Cortex A53 2760 4440 600

Even ignoring any uplift from new instructions or 64-bit, the Cortex A53 is going to be substantially faster than its predecessors. I threw in hypothetical SPECint2000 numbers for a 1.2GHz Cortex A9 to put A53's performance in even better perspective. You should expect to see better performance than a Cortex A9r4 at the same frequencies, but the A9r4 is expected to hit much higher frequencies (e.g. 2.3GHz for Cortex A9 r4p1 in NVIDIA's Tegra 4i). 

ARM included a number of power efficiency improvements and is targeting 130mW single-core power consumption at 28nm HPM (running SPECint 2000). I'd expect slightly higher power consumption at 28nm LP but we're still talking about an extremely low power design.

I'm really excited to see what ARM's Cortex A53 can do. It's a potent little architecture, one that I wish we'd see taken to higher clock speeds and maybe even used in higher end devices at the same time. The most obvious fit for these cores however is something like the Moto G, which presently uses the 32-bit Cortex A7. Given Qualcomm's schedule, I wouldn't be surprised to see something like a Moto G update late next year with a Snapdragon 410 inside. Adding LTE and four Cortex A53s would really make that the value smartphone to beat.

POST A COMMENT

95 Comments

View All Comments

  • Wilco1 - Monday, December 09, 2013 - link

    No, the amount of logic involved doing the actual addition is a small proportion of the total involved in the execution of a single instruction. So a 64-bit addition might use maybe 5% more power than a 32-bit addition, not twice as much. Reply
  • Exophase - Tuesday, December 10, 2013 - link

    It's moot anyway (of course you're well aware of this, just explaining for everyone else), AArch64 has 32-bit arithmetic operations and most code is limited to 32-bit integers outside of pointer arithmetic. Reply
  • ciplogic - Wednesday, December 11, 2013 - link

    Dan, most CPU logic is not in math, but in a lot other components like: CPU cache (which is sometimes more than 1/2 of the entire CPU), branch prediction, memory addressing unit, etc. Also, when you use 64 bit CPUs, the code is using still 32 bit integers, making the transistor count the same. Without knowing the full specifics, most of 64 bit integers can be implemented by using 32 bit integer math, so the extra added logic can be reduced even further (as an uncommon path).

    Are more registers faster? Oh, yeah. By a large amount because the registers run like 4 times faster than L1 cache (or even more), like 10-20 times faster than L2 cache, and the L2 cache is typically 10x faster than the memory access. A compiler that can have 2x more registers on the target CPU will likely give a code that is not 4 times faster, but 30-50% speedup is doable in a lot of real code. LLVM (the main backend optimizer) stated that when improving by 10% the register allocation got a speedup up-to 20% http://bit.ly/1d7B3aw
    Reply
  • xdrol - Monday, December 09, 2013 - link

    That would sound nice, but you miss the point that ARM8 has a 32 bit mode that is compatible with ARM7 (and transitively with older ARM ISAs). So they cannot "wipe that slate clean" at all, everything has to be there. Reply
  • Wilco1 - Monday, December 09, 2013 - link

    More registers are generally better indeed, however the gain from 14 to 31 is not that large - studies indicated around 20-24 is optimal. Note there are drawbacks as well to having more registers such as a slower process switch.

    The A53 includes all 32-bit instructions, so can run all existing binaries. So nothing has been ditched at all. The power savings are not due being 64-bit and not due to the new ISA either. The efficiency improvements are simply due to it being newer and better than its predecessors (if it had been 32-bit then any gains would be the same).

    64-bit code will often run a little faster than 32-bit but not hugely so. While the 64-bit ISA allows for power savings in decoding, 64-bit pointers and registers increase power slightly, so which effect is larger will depend on each particular application.
    Reply
  • Exophase - Tuesday, December 10, 2013 - link

    Do you links for any studies outside of the ones AMD did when evaluating x86-64? Those are good for a single data point, but they're a bit limited given that they were specific to x86-64 and a relatively wide OoO uarch. In-order uarchs, in comparison, benefit from code that's more aggressively scheduled to hide latency which increases register pressure. Reply
  • Wilco1 - Tuesday, December 10, 2013 - link

    I was thinking of the original RISC studies for MIPS and SPARC. They are old now but I confirmed those results for ARM - basically the benefit of each extra register goes down exponentially. If you have a good pressure-aware scheduler (few compilers get it right...) then you only need a few extra registers. Reply
  • Exophase - Tuesday, December 10, 2013 - link

    Thanks for the clarification. I'd say that even a sweet spot is 20-24 justifies 31 GPRs (plus SP). I'd also argue that research done with the original MIPS and SPARC aren't perfectly representative of something like Cortex-A53 either. In my experience, going from hand coding ARM9 to Cortex-A8 assembly presented a lot of new challenges in scheduling which absolutely increased register pressure. Dual issue means you have to hide more instructions in a similar latency, and generally more latencies were added, like for address generation or shifts. The original 5 stage RISC CPUs like the first MIPS uarchs would be a lot closer to ARM9 than Cortex-A8. Cortex-A53 probably doesn't have as many interlock conditions as A8 but it should still be substantially worse than ARM9.

    One particular application I know I'd appreciate having 31 GPRs for is emulating another ISA with 16 GPRs, like x86-64..
    Reply
  • blanarahul - Thursday, December 12, 2013 - link

    1) Yes. I got confused in the quantity v/s width argument.
    2) Sorry. I was trying to comment on a topic about which I have little to no knowledge.
    Reply
  • ChipNano - Saturday, February 08, 2014 - link

    I think 64 bit processor was not relly required, but its just competing with APPLE !!! Reply

Log in

Don't have an account? Sign up now