Original Link: http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture
Qualcomm's New Snapdragon S4: MSM8960 & Krait Architecture Exploredby Brian Klug & Anand Lal Shimpi on October 7, 2011 12:35 PM EST
Let's recap the current smartphone/tablet SoC landscape. Everything shipping today is built on a 4x-nm process, built either at Global Foundries, Samsung, TSMC or UMC. Next year we'll see a move to 28nm (bringing better performance and power characteristics) but between now and the end of 2012 there will be a myriad of designs available on the market.
The table below encapsulates much of what you can expect over the next 12+ months:
|2011/2012 SoC Comparison|
|SoC||Process Node||CPU||GPU||Memory Bus||Release|
|Apple A5||45nm||2 x ARM Cortex A9 w/ MPE @ 1GHz||PowerVR SGX 543MP2||2 x 32-bit LPDDR2||Now|
|NVIDIA Tegra 2||40nm||2 x ARM Cortex A9 @ 1GHz||GeForce||1 x 32-bit LPDDR2||Now|
|NVIDIA Tegra 3/Kal-El||40nm||4 x ARM Cortex A9 w/ MPE @ ~1.3GHz||GeForce++||1 x 32-bit LPDDR2||Q4 2011|
|Samsung Exynos 4210||45nm||2 x ARM Cortex A9 w/ MPE @ 1.2GHz||ARM Mali-400 MP4||2 x 32-bit LPDDR2||Now|
|Samsung Exynos 4212||32nm||2 x ARM Cortex A9 w/ MPE @ 1.5GHz||ARM Mali-400 MP4||2 x 32-bit LPDDR2||2012|
|ST-Ericsson NovaThor LP9600 (Nova A9600)||28nm||2 x ARM Cortex-A15 @ 2.5GHz||IMG PowerVR Series 6 (Rogue)||Dual Memory||2013|
|ST-Ericsson Novathor L9540 (Nova A9540)||32nm||2 x ARM Cortex A9 @ 1.85GHz||IMG PowerVR Series 5||2 x 32-bit LPDDR2||2H 2012|
|ST-Ericsson NovaThor U9500 (Nova A9500)||45nm||2 x ARM Cortex A9 @ 1.2GHz||ARM Mali-400 MP1||1 x 32-bit LPDDR2||Now|
|ST-Ericsson NovaThor U8500||45nm||2 x ARM Cortex A9 @ 1.0GHz||ARM Mali-400 MP1||1 x 32-bit LPDDR2||Now|
|TI OMAP 4430||45nm||2 x ARM Cortex A9 w/ MPE @ 1.2GHz||PowerVR SGX 540||2 x 32-bit LPDDR2||Now|
|TI OMAP 4460||45nm||2 x ARM Cortex A9 w/ MPE @ 1.5GHz||PowerVR SGX 540||2 x 32-bit LPDDR2||Q4 11 - 1H 12|
|TI OMAP 4470||45nm||2 x ARM Cortex A9 w/ MPE @ 1.8GHz||PowerVR SGX 544||2 x 32-bit LPDDR2||1H 2012|
|TI OMAP 5||28nm||2 x ARM Cortex A15 @ 2GHz||PowerVR SGX 544MPx||2 x 32-bit LPDDR2||2H 2012|
|Qualcomm MSM8x60||45nm||2 x Scorpion @ 1.5GHz||Adreno 220||1 x 32-bit LPDDR2*||Now|
|Qualcomm MSM8960||28nm||2 x Krait @ 1.5GHz||Adreno 225||2 x 32-bit LPDDR2||1H 2012|
The key is this: other than TI's OMAP 5 in the second half of 2012 and Qualcomm's Krait, no one else has announced plans to release a new microarchitecture in the near term. Furthermore, if we only look at the first half of next year, Qualcomm is the only company that's focused on significantly improving per-core performance through a new architecture. Everyone else is either scaling up in core count (NVIDIA) or clock speeds. As we've seen in the PC industry however, generational performance gaps are hard to overcome - even with more cores or frequency.
Qualcomm has an ARM architecture license enabling it to build its own custom micro architectures that implement the ARM instruction set. This is similar to how AMD has an x86 license but designs its own chips rather than just producing clones of Intel processors. Qualcomm remains the only active player in the smartphone/tablet space that uses its architecture license to put out custom designs. The benefit to a custom design is typically better power and performance characteristics compared to the more easily synthesizable designs you get directly from ARM. The downside is development time and costs go up tremendously.
Scorpion was Qualcomm's first Snapdragon CPU architecture. At a high level, it looked very much like an optimized ARM Cortex A8 design although the two had nothing in common outside of instruction set. Scorpion was a dual-issue, in-order architecture that eventually scaled to dual-core and 1.5GHz variants.
Scorpion was pretty much the CPU architecture of choice in the 2009 - 2010 timeframe. Throughout 2011 however, Qualcomm has been very quiet as dual Cortex A9 designs from NVIDIA, Samsung and TI have surpassed it in terms of performance.
Going into 2012, Qualcomm is set for a return to glory as it will be the first to deliver a brand new microprocessor architecture and the first to ship 28nm SoCs in volume. Qualcomm's next-generation SoCs will also be the first to integrate an LTE modem on-die, which should enable LTE on nearly all high-end devices at much better power levels than current multi-chip 4x-nm solutions. Today we're able to talk a bit about the architecture details and performance expectations of Qualcomm's next-generation SoC due out in the first half of 2012.
The Krait processor is the heart of Qualcomm's second generation Snapdragon and it's the core of all Snapdragon S4 SoCs. Krait takes the aging base of Scorpion and gives it a much needed dose of adrenaline.
Krait's front end is significantly wider. The architecture can fetch and decode three instructions per clock. The decoders are equally capable of decoding any ARMv7-A instructions. The wider front end is a significant improvement over the 2-wide Scorpion core. It alone will be responsible for a tangible increase in IPC.
|ARM11||ARM Cortex A8||ARM Cortex A9||Qualcomm Scorpion||Qualcomm Krait|
|Pipeline Depth||8 stages||13 stages||8 stages||10 stages||11 stages|
|Out of Order Execution||N||N||Y||Partial||Y|
|FPU||VFP11 (pipelined)||VFPv3 (not-pipelined)||Optional VFPv3-D16 (pipelined)||VFPv3 (pipelined)||VFPv3 (pipelined)|
|NEON||N/A||Y (64-bit wide)||Optional MPE (64-bit wide)||Y (128-bit wide)||Y (128-bit wide)|
|Typical Clock Speeds||412MHz||600MHz/1GHz||1.2GHz||1GHz||1.5GHz|
The execution back-end receives a similar expansion. Whereas the original Scorpion core only had three ports to its execution units, Krait increases that to seven. Krait can issue up to four instructions in parallel. The additional execution ports simply help prevent any artificial constraints on ILP. This is another area where Krait will be able to see significant IPC gains.
Krait's fetch and decode stages are obviously in-order, but the back-end is entirely out-of-order. Qualcomm claims that any instruction can be executed out of order, assuming that doing so doesn't create any new hazards. Instructions are retired in order.
Qualcomm lengthened Krait's integer pipeline slightly from 10 stages in Scorpion to 11 stages in Krait. Load/store operations tack on another two cycles and instructions that go through the Neon/VFP path further lengthen the pipe. ARM's Cortex A15 design by comparison features a 15-stage integer pipeline. Qualcomm's design does contain more custom logic than ARM's stock A15, which has typically given it a clock speed advantage. The A15's deeper pipeline should give it a clock speed advantage as well. Whether the two effectively cancel each other out remains to be seen.
|Qualcomm Architecture Comparison|
|Pipeline Depth||10 stages||11 stages|
|L2 Cache (dual-core)||512KB||1MB|
|Core Configurations||1, 2||1, 2, 4|
Krait has been upgraded to support the new virtualization instructions added in Cortex A15. Also like the A15, Krait enables LPAE for 40-bit memory addressing.
At a high-level Qualcomm has built a 3-wide, out-of-order engine that feels very much like a modern version of Intel's old P6. Whereas designs from the A8 generation looked like modern Pentiums, Krait takes us into the era of the Pentium II.
Note that courtesy of the wider front-end and OoO execution engine, Krait should be a higher performance architecture than Intel's Atom. That's right, you'll be able to get better performance than some of the very first Centrino notebooks in your smartphones come 2012.
Performance of ARM cores has always been characterized by DMIPS (Dhrystone Millions of Instructions per Second). An extremely old integer benchmark, Dhrystone was popular in the PC market when I was growing up but was abandoned long ago in favor of more representative benchmarks. You can get a general idea of performance improvements across similar architectures assuming there are no funny compiler tricks at play. The comparison of single-core DMIPS/MHz is below:
|ARM11||ARM Cortex A8||ARM Cortex A9||Qualcomm Scorpion||Qualcomm Krait|
At 3.3, Krait should be around 30% faster than a Cortex A9 running at the same frequency. At launch Krait will run 25% faster than most A9s on the market today, a gap that will only grow as Qualcomm introduces subsequent versions of the core. It's not unreasonable to expect a 30 - 50% gain in performance over existing smartphone designs. ARM hasn't published DMIPS/MHz numbers for the Cortex A15, although rumors place its performance around 3.5 DMIPS/MHz.
Updated VeNum Unit
ARM's NEON instruction set is handled by a dedicated unit in all of its designs. Krait is no different. Qualcomm calls its NEON engine VeNum and has increased its issue capabilities by 50%. Whereas Scorpion could only issue two NEON instructions in parallel, Krait can do three.
Qualcomm's NEON data paths are still 128-bits wide.
Update: Qualcomm published its whitepaper on the Snapdragon S4. Check it out here.
Cache & Memory Hierarchy
Qualcomm has a three level exclusive cache hierarchy in Krait. The lower two levels are private per core, while the third level is shared among all cores. Qualcomm calls these caches L0, L1 and L2.
Each Krait core has an 8KB L0 cache (4KB instruction + 4KB data cache). The L0 cache is direct mapped and accessible in a single cycle. Qualcomm claims an 85% hit rate in this level 0 cache, which helps save power by not firing up the larger L1 cache. The hierarchy is exclusive so L0 data isn't necessarily duplicated in L1.
Each core also has a 32KB L1 cache (16KB instruction + 16KB data). The L1 4-way set associative and can also be accessed in a single cycle. There's no way prediction at work here. With 1 cycle latency to both L0 and L1, the primary advantage here is power.
|Krait Cache Architecture|
|L0||4KB + 4KB||Direct Mapped||Core|
|L1||16KB + 16KB||4-way set associative||Core|
|L2||1MB (dual core) or 2MB (quad core)||8-way set associative||1.3GHz max|
The L2 cache is shared among all cores. In dual-core designs the L2 cache is sized at 1MB (up from 512KB in Scorpion), while quad-core Krait SoCs will have a 2MB L2. Krait's L2 cache is 8-way set associative.
While the L0 and L1 caches operate at core frequency and are on the same voltage plane as their associated core, the L2 cache is separate. To save power the L2 cache runs at its own frequency (up to 1.3GHz depending on the currently requested performance level). The L2 cache is on its own power plane and can be power gated if necessary.
Although Scorpion featured a dual-channel LPDDR2 memory controller, in a PoP configuration only one channel was available to any stacked DRAM. In order to get access to both 32-bit memory channels the OEM had to implement a DRAM on-package as well as an external DRAM on the PCB. Memory requests could be interleaved between the two DRAM, however Qualcomm seemed to prefer load balancing between the two with CPU/GPU accesses being directed to the lower latency PoP DRAM. Very few OEMs seemed to populate both channels and thus Scorpion based designs were effectively single-channel offerings.
Krait removes this limitation and now OEMs can utilize both memory channels in a PoP configuration (simply put two 32-bit DRAM die on the PoP stack) or in an external configuration. The split PoP/external DRAM organization is no longer supported. This change will hopefully mean we'll see more dual-channel Krait designs than we saw with Scorpion, which will in turn improve performance.
Process Technology and Clock Speeds
Krait will be the world's first smartphone CPU built on a 28nm process. Qualcomm is working with both TSMC and Global Foundries, although TSMC will produce the first chips. Krait will be built, at first, on TSMC's standard 28nm LP process. According to Qualcomm there's less risk associated with TSMC's non-HKMG process. Qualcomm was quick to point out that the entire MSM8960 SoC is built on a 28nm LP process compared to NVIDIA's 40nm LPG design in Kal-El. From Qualcomm's perspective, 40nm G transistors are only useful at reducing leakage at high temperatures but for the majority of the time a homogeneous LP design makes more sense.
Just like Scorpion, Krait places each core on its own voltage plane driven at its own clock frequency. Cores can be clocked independently of one another, which Qualcomm insists gives it a power advantage in many workloads.
The first implementation of Krait will be in a dual-core 1.5GHz MSM8960, however a second revision of the silicon will be introduced next year that increases clock speed to 1.7 - 2.0GHz. Qualcomm claims that at the same 1.05V core voltage, Krait can run at 1.7GHz vs. 1.55GHz for Scorpion. At these two clock speeds and at the same voltage, Qualcomm tells us that Krait consumes 265mW of power vs. 432mW running an undisclosed workload. Although it should be possible to draw more power than Scorpion under load, Krait should hopefully be able to improve overall power efficiency by completing tasks quicker and thus dropping down to idle faster than its predecessor. Smartphone and tablet battery life should remain the same at worst and improve at best, as a result.
The Adreno 225 GPU
Qualcomm has historically been pretty silent about its GPU architectures. You'll notice that specific details of Adreno GPU execution resources have been absent from most of our SoC comparisons. Starting with MSM8960 however, this is starting to change.
The MSM8960 uses a current generation Adreno GPU with a couple of changes. Qualcomm calls this GPU the Adreno 225, a follow-on to Adreno 220. Subsequent Krait designs will use Adreno 3xx GPUs based on a brand new architecture.
As we discussed in our Samsung Galaxy S 2 review, Qualcomm's Adreno architecture is a tile based immediate mode renderer with early-z rejection. By Qualcomm's own admission, Adreno is somewhere in the middle of the rendering spectrum between IMRs and Imagination Technologies' TBDR architectures. One key difference is Adreno's tiling isn't as fine grained as IMG's.
Architecturally the Adreno 225 and 220 are identical. Adreno 2xx is a DX9-class unified shader design. There's a ton of compute on-board with eight 4-wide vector units and eight scalar units.
Each 4-wide vector unit is capable of a maximum of 8 MADs per clock, while each scalar unit is similarly capable of 2 MADs per clock. That works out to 160 floating point operations per clock, or 32 GFLOPS at 200MHz.
Update: Qualcomm has clarified the capabilities of its 4-wide Vector ALUs. Similar to the PowerVR SGX 543, each 4-wide vector ALU is capable of four MADs (one per component). The scalar units cannot be combined to do any MADs, although they are helpful we haven't really been tracking those in this table (IMG has something similar) so we've excluded them for now.
|Mobile SoC GPU Comparison|
|Adreno 225||PowerVR SGX 540||PowerVR SGX 543||PowerVR SGX 543MP2||Mali-400 MP4||GeForce ULP||Kal-El GeForce|
|# of SIMDs||8||4||4||8||4 + 1||8||12|
|MADs per SIMD||4||2||4||4||4 / 2||1||?|
|GFLOPS @ 200MHz||12.8 GFLOPS||3.2 GFLOPS||6.4 GFLOPS||12.8 GFLOPS||7.2 GFLOPS||3.2 GFLOPS||?|
|GFLOPS @ 300MHz||19.2 GFLOPS||4.8 GFLOPS||9.6 GFLOPS||19.2 GFLOPS||10.8 GFLOPS||4.8 GFLOPS||?|
Looking at the table above you'll see that this is the same amount of computing power than even IMG's PowerVR SGX 543MP2. However as we've already seen in our tests, Adreno 220 isn't anywhere near as quick.
Shader compiler efficiency and data requirements to actually populate a Vec4+1 array are both unknowns, and I suspect both significantly gate overall Adreno performance. There's also the fact that the Adreno 22x family only has two TMUs compared to four in the 543MP2, limiting texturing performance. Combine that with the fact that most Adreno 220 GPUs have been designed into single-channel memory controller systems and you've got a recipe for tons of compute potential limited by other bottlenecks.
With Adreno 225 Qualcomm improves performance along two vectors, the first being clock speed. While Adreno 220 (used in the MSM8660) ran at 266MHz, Adreno 225 runs at 400MHz thanks to 28nm. Secondly, Qualcomm tells us Adreno 225 is accompanied by "significant driver improvements". Keeping in mind the sheer amount of compute potential of the Adreno 22x family, it only makes sense that driver improvements could unlock a lot of performance. Qualcomm expects the 225 to be 50% faster than the outgoing 220
Qualcomm claims that MSM8960 will be able to outperform Apple's A5 in GLBenchmark 2.x at qHD resolutions. We'll have to wait until we have shipping devices in hand to really put that claim to the test, but if true it's good news for Krait as the A5 continues to be the high end benchmark for mobile GPU performance.
While Adreno 225 is only Direct3D feature level 9_3 compliant, Qualcomm insisted that when the time is right it will have a D3D11 capable GPU using its own IP - putting to rest rumors of Qualcomm looking to license a third party GPU in order to be competitive in Windows 8 designs. Although Qualcomm committed to delivering D3D11 support, it didn't commit to a timeframe.
MSM8960 Cellular Connectivity
Until now, to get 4G LTE connectivity in a smartphone has required using two basebands - one for delivering 4G LTE connectivity, and a more traditional smartphone-geared baseband for voice on 2G and 3G data. Take Verizon’s 4G LTE smartphone lineup for example, where many devices combine MSM8655 for camping a 1x voice session alongside MDM9600 for EVDO and LTE, or some other similar combination. Further, all those LTE basebands are built on 45nm process and really geared towards data specific applications.
For a while now we’ve also been talking about 28nm LTE basebands, and specifically the multimode connectivity on MSM8960. This is the first of Qualcomm’s S4 SoCs, and includes 4G LTE connectivity alongside the usual assortment of WCDMA/GSM/CDMA2000 standards. MSM8960’s cellular baseband is based around Qualcomm’s second generation (3GPP Rel.9) LTE modem, which is exactly what’s inside MDM9x15 which we’ve talked about in the past.
The full laundry list of what air interfaces MDM8960 supports is impressive - LTE FDD/TDD, UMTS, CDMA, TD-SCDMA (for Chinese markets), and GERAN (GSM/EDGE). I’ve made a small table below which gives the full laundry list.
|Snapdragon S4 - MSM8960 Cellular Support|
|LTE FDD||100 Mbps DL / 50 Mbps UL (Cat. 3, 3GPP Rel.9)|
|LTE TDD||68 Mbps DL / 17 Mbps UL (Cat. 3, 3GPP Rel.9)|
|UMTS||DC-HSPA+ 42 Mbps DL (Cat. 24) / 11 Mbps UL (Cat. 8)|
|CDMA2000||1xAdvanced, EVDO Rev.B (14.7 Mbps DL / 5.4 Mbps UL)|
|TD-SCDMA||TD-SCDMA 4.2 Mbps DL / 2.2 Mbps UL|
What’s new again is inclusion of a category 3, 4G LTE baseband into the SoC alongside DC-HSPA+ and TD-SCDMA for the Chinese market. This is a substantial increase in the number of air interfaces supported onboard the SoC which will enable tighter integration and lower power from the baseband being manufactured on that same 28nm process. There’s still the requirement for external RF and transceiver (using RTR8600 or something similar) which houses all the analog, but that’s the same everywhere else.
Since the baseband in MSM8960 is shared with MDM9x15, the two are both 3GPP Release 9 devices, whereas presently MDM9600 and other launch LTE devices are 3GPP Release 8, which was the launch standard. This newer 3GPP release brings a number of improvements, and closer to transitioning to Voice over LTE (VoLTE) and SRVCC (single radio voice call continuity) for fallback to GSM/UMTS or 1x voice in the circumstance that 4G LTE coverage fades. The present combination of a camped 1x voice session alongside 4G LTE for data is also possible in MDM8960, which is exactly what’s done in the case of the HTC Thunderbolt.
In time, carriers will transition to using VoLTE and enrich the voice experience by offering services that work across the data session, alongside some circuit switched (CS) traditional 2G/3G voice to fall back to. For CDMA networks that’ll continue being the dual RF scenario which uses 1x for voice, and for UMTS networks that’ll be a SRVCC augmented fast handover to 3G for voice calls. This handover and call setup is targeted to take place in under one second.
There’s more to the connectivity situation as well, as MSM8960 includes built in WLAN 802.11b/g/n (single spatial stream), Bluetooth, and GPS. These are integrated directly into the MSM8960 the same way the cellular modem is and only require some external RF to use.
Of course, it’s one thing to talk about all this connectivity on MSM8960 and something else entirely to see it. With MSM8660, Qualcomm gave us one of their Mobile Development Platforms (MDPs) which is something of a reference design and development board for each SoC generation.
This time was no exception, and they showed off their new MSM8960 MDP connected to Verizon’s 4G LTE network streaming 1080p YouTube video, loading pages, and finally running a few speedtests using the Speedtest.net application.
This was all over Verizon’s 4G LTE network at Qualcomm HQ in San Diego and worked impressively well for hardware and software that still isn’t production level. In spite of marginal signal in the room we performed testing in, the MDP finished tests with pretty decent results. I ran some more tests on a Droid Bionic in the same room and saw similar results.
Qualcomm has had MSM8960 silicon back in house for the past 3 months and is on-track for a release sometime in the first half of next year. Assuming Qualcomm can deliver on its claims, performance alone would be enough to sell this chip. Improved power characteristics and integrated LTE baseband really complete the package though.
The implications for a 1H 2012 MSM8960 release are tremendous. Android users will have to choose between a newer software platform (OMAP 4 running Ice Cream Sandwich) or much faster hardware (MSM8960). Windows Phone users may finally get a much needed performance boost if Microsoft chooses to standardize on Krait for its Windows Phone hardware refresh next year. End users will benefit as next year's smartphones and tablets will see, once again, a generational performance improvement over what's shipping today. LTE should also start to see much more widespread adoption (at the high end) as a result of Qualcomm's integrated LTE baseband.