NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

Name: NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On
Item: NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

by Anand Lal Shimpi & Brian Klug on February 24, 2013 3:00 PM EST

75 Comments | Add A Comment

75 Comments

It basically goes without saying, but Tegra 4i, formerly Grey, is primarily a story about the modem. About halfway through 2011, NVIDIA acquired Icera, a player in the handset cellular baseband market who had managed to fit a software-defined radio into the power budget for a mobile device. NVIDIA clearly recognized the importance of being able to deliver an SoC with an integrated modem, and after the Icera acquisition, and later Grey’s appearance on the roadmap, it just became a matter of when that would happen. NVIDIA executed pretty quickly, it turns out, and the fruits of that work manifest themselves in two products — Icera’s standalone i500 and Tegra 4i which contains the same i500 IP block, minus USB interfaces and a different memory subsystem.

Icera was an interesting player in the baseband space because their architecture was that of a software defined radio on the digital baseband side. The reality is that nobody is either fully software defined or burned out to ASIC (hardware), but rather somewhere inbetween. For example, Qualcomm is a combination of software and hardware, though it’s never been entirely clear what functional blocks are ASIC and which other blocks are software, though I’ve been told this is often a matter of whatever is most advantageous for power and what gets re-used most. That said, Icera’s implementation is the furthest towards being pure software defined of anyone, with the entire digital baseband being just one big platform to run their own software atop. There’s an external transceiver which does downconversion, but after that it’s pure software. The question has always been how Icera could afford to build a power competitive platform with an entirely software designed stack, and the clue lies in their choice of 28 HPM instead of LP or HPL silicon for i500 and 4i. Icera designs to a high performance process, then turns off blocks when they’re not in use, rather than make a larger SoC that’s lower leakage. The result is that NVIDIA claims a 40% smaller die for i500 than MDM9x15.

Icera has maintained the same software platform across its modem portfolio, each time taking advantage of increased processing performance afforded by faster silicon to include more air interfaces and capabilities. The result is an interesting advantage when it comes to operator certification — they can make a compelling case for each new version being close to the previous one, since they run the same PHY and protocol stack entirely in software.

In the case of i500, it runs the exact same multimode modem software as the i4xx series, just with increased DXP support for 15 and 20 MHz channel bandwidths on LTE.

The entire stack ends up being 1.2 million lines of C and DXP code, with a total size of 7.7 MB compiled. NVIDIA gave a great breakdown of the protocol stack as well.

The real name for Icera i500 is ICE9045, and it is paired with a ICE9245 transceiver. ICE9045 is built on 28nm HPM as I mentioned before, and ICE9245 remains 65nm TSMC LP CMOS process which is RF friendly. ICE9045 supports basically all the 3GPP air interfaces, as mentioned in the earlier announcement piece. There’s up to Category 3 LTE on the baseband at launch, with Category 4 in the future. For WCDMA, up to Category 24 (42 Mbps) (dual carrier with 64QAM), and interestingly enough the same Category 18 16 QAM with 2x2 MIMO (28 Mbps) as earlier implemented in Icera 450, and an optional future upgrade to Category 28 64 QAM with 2x2 MIMO (84 Mbps). Of course there’s also TD-SCDMA, GSM/EDGE, full support for voice including AMR-WB and VoLTE/IMS. In addition to 2x2 MIMO the ICE9045 can also do 4x4 MIMO on LTE with a second ICE9245 transceiver.

NVIDIA broke down the ICE9045 functionally, which consists of two large DXP units and one smaller DXP unit which runs the rest of the software and management stack. The two larger DXP units run at up to 1.3 GHz. The Icera instruction set consists of two different fundamental sets. Icera refers to these as the “C” and “D” side, with C being rather obvious. The C side is unsurprisingly a C-complier targeted version of the 3GPP protocol stack, and manages the higher level functions of the modem above physical interface, and is a scalar machine. In the block diagram, the C side runs on DXP1, the D side runs on the larger beefier DXP0 and DXP2 machines.

The D side is a proprietary assembly language vector instruction set that runs the physical layer of the modem, this is a combination of specific libraries that really make up the magic and give the Icera platform its reconfigurability. NVIDIA gave an excellent breakdown of the data paths inside both sides. It turns out that in an LTE configuration one core does all the inverse FFTs and MIMO matrix math, the second core does rate matching and decoding. The ultimate goal is to have each of the cores processing around the same equal workload, and since it’s software these tasks can be shuffled in-between to get the i500 running each core at the lowest possible frequency and voltage. Each of the cores can also be individually power collapsed.

On the transceiver side we have ICE9245 which is again TSMC 65nm LP CMOS. This talks over a digital interface to the baseband, and increases the number of primary RX ports from 6 to 8. All filtering and channelization is onboard the ISP. The goal is to keep all the A2D and D2A on the transceiver and away from the baseband. NVIDIA hasn’t quite given me the frequency breakdown for their low band, high band, and super high band, but you can figure it out just by looking. There’s also diversity ports for the same set of bands. All these ports are created equal and can handle 20 MHz LTE, in addition to inter-band and intra-band CA.

The big question is how well Icera’s i500 will stack up against the MDM9x15s and MDM9x25s of the world. Icera believes its ability to switch between four fundamentally different receiver types on the fly and dedicate processing to different functions based on the cellular geometry at the moment offers it an advantage in the long run.

NVIDIA demonstrated a Phoenix with Tegra 4i cabled up to a base station emulator doing full Category 3 LTE at 100 Mbps nice and steady. I asked whether they could go any higher to demonstrate the headroom in the current i500 architecture for upgrading to Category 4 and was shown just short of 120 Mbps on the two week old Tegra 4i silicon.

Gallery: Phoenix LTE Demonstration

Tegra 4 Performance ISP - Chimera Computational Camera

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

75 Comments

View All Comments

PingviN - Monday, February 25, 2013 - link
Tegra made an operating loss of $150 million for fiscal year 2012, despite getting into both the Nexus 7 (the refresh coming this year has been lost to Qualcomm) and the Surface RT. Sales prognosis cut almost in half for the fiscal year 2013. To date, Nvidia hasn't had any profit coming out of Tegra and now it's in limbo mode until Tegra 4 is released because Tegra 3 gets smashed by it's competition.

It's been a pretty crappy year for Tegra.
guilmon14 - Tuesday, February 26, 2013 - link
I don't know anything about this company "tegra", but have you heard about Nvidia? I heard they're doing great!

http://nvidianews.nvidia.com/Releases/NVIDIA-Repor...

According to this Nvidia is up in income, revenue, and equity.

If you wanted to check the easy way just look at nvidia's wikipedia page, gives you all the nice money numbers.
http://en.wikipedia.org/wiki/Nvidia
trajan2448 - Monday, February 25, 2013 - link
5 years down the road phones will be cooking our dinner. It's amazing how fast the tech is advancing now.
Scannall - Monday, February 25, 2013 - link
If they don't hustle right along, SOC's with the PowerVR 6 series (Rogue) will beat them to market. And considering their GPU just barely squeaks by the iPad as it is, it will be behind early on.
Khato - Monday, February 25, 2013 - link
Was it specifically stated that the Tegra 4 SPECint/W figure was running on the high speed cores? As is mentioned later on the page, a SPECint2000 of 520 is within reach of the power optimized companion core, so the only reason I'd expect NVIDIA to not use the companion core for this data is if they explicitly stated that it wasn't.

Part of the cause for my suspicion is that the Power vs DMIPS chart that Samsung recently provided for the Exynos 5 Octa shows 8k DMIPS at 1 watt... and from the press coverage back in 2009 for the A9 hard macros there's both the 10k DMIPS at 1.9 watts and 2GHz with the speed speed optimized and 4k DMIPS at 250 mW and 800 MHz for the power optimized. Which equate to 5.26 DMIPS/mW and 8 DMIPS/mW, respectively. Now the 2GHz data point should be even worse off than Tegra 3 and yet it only shows the Samsung Exynos 5 Octa as being 52% more efficient.

Going into estimating rather than published numbers, if we up the efficiency of Tegra 3 a bit compared to that 2GHz figure then it's likely going to be closer to A15 being 30% more efficient... which you then add the known ~40% efficiency bump going from a performance to power implementation and you get the kind of drastic increase NVIDIA is touting.
Wilco1 - Monday, February 25, 2013 - link
It doesn't matter whether they used the 5th core or one of the fast ones. By definition the cross over point is where the 5th core uses as much power as a fast core. Since that is ~800MHz, the power efficiency is the same. The 5th core can likely clock to well over 1GHz, but then it uses more power than a fast core.

You are basically right that some of the 73% MIPS/W improvement comes from the 40-28nm process change. However the combined improvement of process and micro architecture means that you can use the low power core far more often. The 5th core in Tegra 4 is effectively more than 3 times as fast than the one in Tegra 3. So that means lots of tasks which needed 1-2 fast Tegra 3 cores can now run on the 5th Tegra 4 core. That means the power efficiency will actually improve by what NVidia suggests.
Khato - Monday, February 25, 2013 - link
Mind sharing the source for that? The wording in this article implies differently - "That 825MHz mark ends up being an important number, because that’s where the fifth companion Cortex A15 tops out at." Given 1.9GHz for the performance-optimized cores, something around 800 MHz sounds about right for the max frequency of a power-optimized version.

Anyway, there's no question that Tegra 4 will be quite a bit more power efficient simply by virtue of being able able to run more workloads exclusively on the companion core. As said before, in exchange for a much lower cap on maximum frequency a power optimized synthesis gives at least a 40% bump in efficiency... and now that power optimized core will still deliver respectable performance.
Wilco1 - Monday, February 25, 2013 - link
Read http://www.nvidia.com/content/PDF/tegra_white_pape... it explains the difference between leakage and active power on low power and high performance transistors. It explicitly says the 5th core in Tegra is capped at 500MHz as that is where it is as power efficient as a fast core. The graphs and the word capped suggest the 5th core can go faster but there is no point.

Note that Tegra 3 uses a different process with low power transistors for the 5th core rather than a low power synthesis (not that they couldn't have done that too, but it is never mentioned and the 5th core looks pretty much the same in the die plots). I presume Tegra 4 does the same on the 28nm process.
Khato - Tuesday, February 26, 2013 - link
Okay, so your commentary is based on the Tegra 3 which is using an entirely different approach to power savings for the companion core. Note that all of the data I was referencing for the difference in efficiency between ARM's two A9 hard macros was on the same process and hence is more applicable to the case of Tegra 4. As you correctly state, Tegra 3 gains its power efficiency for the companion core by using the LP process rather than a low power synthesis, likely due to it being a simpler and faster route to the desired end result and equally effective for their design goals.

Tegra 4 isn't playing process games for the companion core. How do you gain efficiency on the same process? You loosen timings to allow for the usage of smaller transistors, less flop stages, so on so forth. The end result being that you sacrifice maximum switching speed to reduce both leakage and dynamic power. From all the information that NVIDIA has made available it's a completely different implementation from Tegra 3.
Wilco1 - Tuesday, February 26, 2013 - link
Tegra 4 does exactly the same as Tegra 3. According to NVidia's white paper on Tegra 4 (http://www.nvidia.com/docs/IO/116757/NVIDIA_Quad_a... it also uses low power transistors for the 5th core. Again if you look at the die photos of Tegra 4 all 5 cores are identical just like Tegra 3. So that seems to exclude a different synthesis.

The way NVidia get a low power core is by using low power transistors. TSMC 28nm process supports several different transistor libraries, from high performance high leakage to low performance low leakage. Based on the information we have all they have done is swap the transistor libraries.

NVIDIA Tegra 4 Architecture Deep Dive, Plus Tegra 4i, Icera i500 & Phoenix Hands On

Post Your Comment

75 Comments

View All Comments

PingviN - Monday, February 25, 2013 - link

guilmon14 - Tuesday, February 26, 2013 - link

trajan2448 - Monday, February 25, 2013 - link

Scannall - Monday, February 25, 2013 - link

Khato - Monday, February 25, 2013 - link

Wilco1 - Monday, February 25, 2013 - link

Khato - Monday, February 25, 2013 - link

Wilco1 - Monday, February 25, 2013 - link

Khato - Tuesday, February 26, 2013 - link

Wilco1 - Tuesday, February 26, 2013 - link

Log in

Don't have an account? Sign up now