Let's recap the current smartphone/tablet SoC landscape. Everything shipping today is built on a 4x-nm process, built either at Global Foundries, Samsung, TSMC or UMC. Next year we'll see a move to 28nm (bringing better performance and power characteristics) but between now and the end of 2012 there will be a myriad of designs available on the market.

The table below encapsulates much of what you can expect over the next 12+ months:

2011/2012 SoC Comparison
SoC Process Node CPU GPU Memory Bus Release
Apple A5 45nm 2 x ARM Cortex A9 w/ MPE @ 1GHz PowerVR SGX 543MP2 2 x 32-bit LPDDR2 Now
NVIDIA Tegra 2 40nm 2 x ARM Cortex A9 @ 1GHz GeForce 1 x 32-bit LPDDR2 Now
NVIDIA Tegra 3/Kal-El 40nm 4 x ARM Cortex A9 w/ MPE @ ~1.3GHz GeForce++ 1 x 32-bit LPDDR2 Q4 2011
Samsung Exynos 4210 45nm 2 x ARM Cortex A9 w/ MPE @ 1.2GHz ARM Mali-400 MP4 2 x 32-bit LPDDR2 Now
Samsung Exynos 4212 32nm 2 x ARM Cortex A9 w/ MPE @ 1.5GHz ARM Mali-400 MP4 2 x 32-bit LPDDR2 2012
ST-Ericsson NovaThor LP9600 (Nova A9600) 28nm 2 x ARM Cortex-A15 @ 2.5GHz IMG PowerVR Series 6 (Rogue) Dual Memory 2013
ST-Ericsson Novathor L9540 (Nova A9540) 32nm 2 x ARM Cortex A9 @ 1.85GHz IMG PowerVR Series 5 2 x 32-bit LPDDR2 2H 2012
ST-Ericsson NovaThor U9500 (Nova A9500) 45nm 2 x ARM Cortex A9 @ 1.2GHz ARM Mali-400 MP1 1 x 32-bit LPDDR2 Now
ST-Ericsson NovaThor U8500 45nm 2 x ARM Cortex A9 @ 1.0GHz ARM Mali-400 MP1 1 x 32-bit LPDDR2 Now
TI OMAP 4430 45nm 2 x ARM Cortex A9 w/ MPE @ 1.2GHz PowerVR SGX 540 2 x 32-bit LPDDR2 Now
TI OMAP 4460 45nm 2 x ARM Cortex A9 w/ MPE @ 1.5GHz PowerVR SGX 540 2 x 32-bit LPDDR2 Q4 11 - 1H 12
TI OMAP 4470 45nm 2 x ARM Cortex A9 w/ MPE @ 1.8GHz PowerVR SGX 544 2 x 32-bit LPDDR2 1H 2012
TI OMAP 5 28nm 2 x ARM Cortex A15 @ 2GHz PowerVR SGX 544MPx 2 x 32-bit LPDDR2 2H 2012
Qualcomm MSM8x60 45nm 2 x Scorpion @ 1.5GHz Adreno 220 1 x 32-bit LPDDR2* Now
Qualcomm MSM8960 28nm 2 x Krait @ 1.5GHz Adreno 225 2 x 32-bit LPDDR2 1H 2012

The key is this: other than TI's OMAP 5 in the second half of 2012 and Qualcomm's Krait, no one else has announced plans to release a new microarchitecture in the near term. Furthermore, if we only look at the first half of next year, Qualcomm is the only company that's focused on significantly improving per-core performance through a new architecture. Everyone else is either scaling up in core count (NVIDIA) or clock speeds. As we've seen in the PC industry however, generational performance gaps are hard to overcome - even with more cores or frequency.

Qualcomm has an ARM architecture license enabling it to build its own custom micro architectures that implement the ARM instruction set. This is similar to how AMD has an x86 license but designs its own chips rather than just producing clones of Intel processors. Qualcomm remains the only active player in the smartphone/tablet space that uses its architecture license to put out custom designs. The benefit to a custom design is typically better power and performance characteristics compared to the more easily synthesizable designs you get directly from ARM. The downside is development time and costs go up tremendously.

Scorpion was Qualcomm's first Snapdragon CPU architecture. At a high level, it looked very much like an optimized ARM Cortex A8 design although the two had nothing in common outside of instruction set. Scorpion was a dual-issue, in-order architecture that eventually scaled to dual-core and 1.5GHz variants.

Scorpion was pretty much the CPU architecture of choice in the 2009 - 2010 timeframe. Throughout 2011 however, Qualcomm has been very quiet as dual Cortex A9 designs from NVIDIA, Samsung and TI have surpassed it in terms of performance.

Going into 2012, Qualcomm is set for a return to glory as it will be the first to deliver a brand new microprocessor architecture and the first to ship 28nm SoCs in volume. Qualcomm's next-generation SoCs will also be the first to integrate an LTE modem on-die, which should enable LTE on nearly all high-end devices at much better power levels than current multi-chip 4x-nm solutions. Today we're able to talk a bit about the architecture details and performance expectations of Qualcomm's next-generation SoC due out in the first half of 2012.

Krait Architecture

The Krait processor is the heart of Qualcomm's second generation Snapdragon and it's the core of all Snapdragon S4 SoCs. Krait takes the aging base of Scorpion and gives it a much needed dose of adrenaline.

Krait's front end is significantly wider. The architecture can fetch and decode three instructions per clock. The decoders are equally capable of decoding any ARMv7-A instructions. The wider front end is a significant improvement over the 2-wide Scorpion core. It alone will be responsible for a tangible increase in IPC.

Architecture Comparison
  ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
Decode single-issue 2-wide 2-wide 2-wide 3-wide
Pipeline Depth 8 stages 13 stages 8 stages 10 stages 11 stages
Out of Order Execution N N Y Partial Y
FPU VFP11 (pipelined) VFPv3 (not-pipelined) Optional VFPv3-D16 (pipelined) VFPv3 (pipelined) VFPv3 (pipelined)
NEON N/A Y (64-bit wide) Optional MPE (64-bit wide) Y (128-bit wide) Y (128-bit wide)
Process Technology 90nm 65nm/45nm 40nm 40nm 28nm
Typical Clock Speeds 412MHz 600MHz/1GHz 1.2GHz 1GHz 1.5GHz

The execution back-end receives a similar expansion. Whereas the original Scorpion core only had three ports to its execution units, Krait increases that to seven. Krait can issue up to four instructions in parallel. The additional execution ports simply help prevent any artificial constraints on ILP. This is another area where Krait will be able to see significant IPC gains.

Krait's fetch and decode stages are obviously in-order, but the back-end is entirely out-of-order. Qualcomm claims that any instruction can be executed out of order, assuming that doing so doesn't create any new hazards. Instructions are retired in order.

Qualcomm lengthened Krait's integer pipeline slightly from 10 stages in Scorpion to 11 stages in Krait. Load/store operations tack on another two cycles and instructions that go through the Neon/VFP path further lengthen the pipe. ARM's Cortex A15 design by comparison features a 15-stage integer pipeline. Qualcomm's design does contain more custom logic than ARM's stock A15, which has typically given it a clock speed advantage. The A15's deeper pipeline should give it a clock speed advantage as well. Whether the two effectively cancel each other out remains to be seen.

Qualcomm Architecture Comparison
  Scorpion Krait
Pipeline Depth 10 stages 11 stages
Decode 2-wide 3-wide
Issue Width 3-wide? 4-wide
Execution Ports 3 7
L2 Cache (dual-core) 512KB 1MB
Core Configurations 1, 2 1, 2, 4

Krait has been upgraded to support the new virtualization instructions added in Cortex A15. Also like the A15, Krait enables LPAE for 40-bit memory addressing.

At a high-level Qualcomm has built a 3-wide, out-of-order engine that feels very much like a modern version of Intel's old P6. Whereas designs from the A8 generation looked like modern Pentiums, Krait takes us into the era of the Pentium II.

Note that courtesy of the wider front-end and OoO execution engine, Krait should be a higher performance architecture than Intel's Atom. That's right, you'll be able to get better performance than some of the very first Centrino notebooks in your smartphones come 2012.

Performance Expectations

Performance of ARM cores has always been characterized by DMIPS (Dhrystone Millions of Instructions per Second). An extremely old integer benchmark, Dhrystone was popular in the PC market when I was growing up but was abandoned long ago in favor of more representative benchmarks. You can get a general idea of performance improvements across similar architectures assuming there are no funny compiler tricks at play. The comparison of single-core DMIPS/MHz is below:

ARM DMIPS/MHz
  ARM11 ARM Cortex A8 ARM Cortex A9 Qualcomm Scorpion Qualcomm Krait
DMIPS/MHz 1.25 2.0 2.5 2.1 3.3

At 3.3, Krait should be around 30% faster than a Cortex A9 running at the same frequency. At launch Krait will run 25% faster than most A9s on the market today, a gap that will only grow as Qualcomm introduces subsequent versions of the core. It's not unreasonable to expect a 30 - 50% gain in performance over existing smartphone designs. ARM hasn't published DMIPS/MHz numbers for the Cortex A15, although rumors place its performance around 3.5 DMIPS/MHz.

Updated VeNum Unit

ARM's NEON instruction set is handled by a dedicated unit in all of its designs. Krait is no different. Qualcomm calls its NEON engine VeNum and has increased its issue capabilities by 50%. Whereas Scorpion could only issue two NEON instructions in parallel, Krait can do three.

Qualcomm's NEON data paths are still 128-bits wide.

Update: Qualcomm published its whitepaper on the Snapdragon S4. Check it out here.

Memory Hierarchy & Process Technology
POST A COMMENT

107 Comments

View All Comments

  • metafor - Friday, October 07, 2011 - link

    Scorpion does support dual-channel, however, the 8x60 series does not have two controllers. The 8x55/7x30 does, however and in most cases, are used in the configuration you described in the article. Reply
  • ArunDemeure - Friday, October 07, 2011 - link

    I knew MSM7x30/8x55 was dual-channel but I thought it was also available as a 64-bit LPDDR2 PoP solution? While it makes sense for most people to use it as single-channel LPDDR2 as opposed to dual-channel LPDDR1 these days, why would anyone ever have used both PoP and non-PoP DRAM at the same time? Maybe that old leaked presentation on Baidu listing all the MSM7x30/8x55 packages is wrong though. Reply
  • metafor - Friday, October 07, 2011 - link

    A single Scorpion and Adreno 205 just didn't need both channels. It makes more sense for a lot of OEM's to use single 32-bit LPDDR2. Reply
  • ArunDemeure - Friday, October 07, 2011 - link

    Hmm, that would certainly be news to me, it's possible but you'd still need a second memory controller and PHY so it makes very little sense. I can see a few possibilities:
    - The LPDDR2 and DDR2 subsystems aren't shared so in theory for tablets you could do 32-bit SiP LPDDR2+32-bit off-chip DDR2. Seems weird but not impossible.
    - You can do 32-bit ISM+32-bit PoP. Once again, why do this? Were they limited by package pins with a 0.4mm pitch? Seems unlikely with a 14x14 package but who knows.
    - You can genuinely do 32-bit PoP+32-bit on the PCB. Still seems really weird to me.

    The MSM7200(A) had a separate small LPDDR1 chip (16-bit bus with SiP) reserved mostly for the baseband while the primary OS-accessible DRAM was off-chip. This was obviously rather expensive (fwiw Qualcomm only 'won' that generation on software and weak competition IMO) and removed it to reduce cost (making the chip's memory arbitrage more complex) on the MSM7227. I'm not sure about the QSD8650, maybe it still optionally had that extra memory bus (SiP-only) but it was more flexible and never used, it's hard to find that kind of info.

    Cheers,
    Arun
    Reply
  • mythun.chandra - Friday, October 07, 2011 - link

    Anand,

    Isn't this what I had pinged you about earlier?
    Reply
  • z0mb13n3d - Friday, October 07, 2011 - link

    I suggest you look into the facts before passing such statements.

    I don't know where you or the OP are getting your information from (3GHz A15's, quad 2.5GHz Kraits hitting next year, Kraits using HKMG etc.), but that's been pretty inaccurate. All you're doing is speculating based on bits and pieces floating around in PDF's and slides. I still remember one of his claims from the previous thread '2x A15's > 4xA9's' . While no one in their right sense of mind would argue that a the wider, deeper, single A15 is better than a single A9, to make such an uninformed, blanket statement (and to back it up with useless DMIPS numbers!) just doesn't bode very well.
    Reply
  • ArunDemeure - Friday, October 07, 2011 - link

    ST-Ericsson has publicly indicated the A9600's A15s can run at up to 2.5GHz, and GlobalFoundries has publicly said that the A9600 uses their 28nm SLP process which uses High-K but not SiGe strain. Is it really hard to believe a 28HPM or 28HP A15 could easily reach 3GHz? I'm not sure anyone will do that in the phone/tablet market, but remember ARM also wants A15 to target slightly larger Windows 8 notebooks and (I'm not as optimistic about this) servers.

    As for Krait, Qualcomm's initial PR mentioned 2.5GHz (not just random slides) and APQ8064 is on TSMC 28HPM which uses High-K. If you don't trust either me or metafor on that, Qualcomm has also publicly stated that most of their chips will run on SiON but that they were considering High-K for chips running at 2GHz or above: http://semimd.com/blog/2011/02/07/qualcomm-shies-a...

    As for 2xA15 vs 4xA9, metafor's point is that most applications are still not sufficiently multithreaded. It has very little to do with DMIPS which is a worthless outdated benchmark (not that Coremark is perfect mind you - where oh where is my SPECInt for handhelds? Development platforms could support enough RAM to run it by now). Unlike him I think 4xA9 should be relatively competitive even if clearly inferior in some important cases, and as you imply it's a difficult and even fairly subjective topic, but I don't think metafor's opinion is unreasonable.
    Reply
  • z0mb13n3d - Friday, October 07, 2011 - link

    That is the point I'm trying to make! Semiconductor companies, by virtue of the fact that they have to sign OEM/ODM deals before they really even have working products almost always posture about how much their designs can go 'up to' or 'indicate' ratings and numbers. My beef with the earlier thread was that statements were being passed on as facts based purely on stuff posted in press releases. I can tell you, for a fact, that no 2.5GHz Krait (dual or quad) based product will be shipping in '12. I can also tell you for a fact that you will not see anything more than 1.8-2.2GHz (optimistic) in shipping A15's for mobile devices. I understand the A15 architecture is capable of much more, but to try and draw comparisons between a near-shipping mobile-spec quad-core A9 and an on-paper 3GHz A15 powering servers is not correct!

    If you did follow the previous thread closely, you will see that this was the only point I was trying to get across, in vain. No matter how you slice and dice it, the 2xA15 > 4xA9 argument is wrong. This is very similar to what we're seeing in the x86 market with Intel and AMD where the older, tri and quad core AMD's are still able to keep-up with or beat dual-core Intel's in threaded situations. Now it is an entirely different argument as to whether or not Google/MS/whoever else makes effective use of multi-core CPU's in their current mobile platforms and their relatively crude/simple kernels (as compared to desktop operating systems), but come Windows 8, I am willing to bet that quad core (or multi-core in general) SoC's will prove their worth.
    Reply
  • ArunDemeure - Friday, October 07, 2011 - link

    ST-E could underdeliver on the A9600, sure, but they've got a better process than OMAP5 and enough clever power saving tricks up their sleeve (some of which still aren't public) that I feel it's quite likely they won't. Remember 2.5GHz is only their peak frequency when a single core is on - they have not disclosed their throttling algorithms (which will certainly be more aggressive for everyone in the 28nm generation, especially on smartphone SKUs as opposed to tablets where higher TDPs are acceptable).

    Also multiple companies will be making A15s on 28HPM eventually. TSMC has indicated they have a lot of interest in HPM, and that should certainly clock at least 25% higher than GF's Gate-First Non-SiGe 28SLP. However the problem is that the A15 is quite power hungry, so I expect people will use that frequency headroom to undervolt and reduce power although a few might expose it with a TurboBoost-like mechanism. On the other hand, exposing the full 3GHz for Windows 8 on ARM mini-notebooks should be a breeze, and I don't see why you'd expect that to be a problem.

    As for 2.5GHz Quad-Core Krait in 2012 - I think they're still on schedule for tablets in late 2012, but then again NVIDIA was still on schedule for tablets in August 2011 back in February, so it's impossible to predict these things. Delays happen, and it'd be foolish not to take metafor seriously simply because he is unable to predict the unpredictable.

    Finally, 2xA15 vs 4xA9... metafor's point is that given the lower maturity of multithreading on handheld devices, it's more like high-end quad-core Intel CPUs beating eight-core AMD CPUs in the real world. As I said I'm not sure I agree, but it's fairly reasonable.
    Reply
  • dagamer34 - Saturday, October 08, 2011 - link

    I doubt it was a delay as much as nVidia being boastful. They've quite known for that. Reply

Log in

Don't have an account? Sign up now