NVIDIA Tegra K1 Preview & Architecture Analysis

Name: NVIDIA Tegra K1 Preview & Architecture Analysis
Item: NVIDIA Tegra K1 Preview & Architecture Analysis

by Brian Klug & Anand Lal Shimpi on January 6, 2014 6:31 AM EST

88 Comments | Add A Comment

88 Comments

Final Words

NVIDIA’s challenge with Tegra has always been getting design wins. In the past NVIDIA offered quirky alternatives to Qualcomm, most of the time at a more attractive price point. With Tegra K1, NVIDIA offers a substantial feature and performance advantage thanks to its mobile Kepler GPU. I still don’t anticipate broad adoption in the phone space. If NVIDIA sees even some traction among Android tablets that’s enough to get to the next phase, which is trying to get some previous generation console titles ported over to the platform.

NVIDIA finally has the hardware necessary to give me what I’ve wanted ever since SoC vendors first started focusing on improving GPU performance: the ability to run Xbox 360 class titles in mobile. With Tegra K1 the problem goes from being a user interface, hardware and business problem to mostly a business problem. Android support for game controllers is reasonable enough, and K1 more or less fixes the hardware limitations, leaving only the question of how do game developers make enough money to justify the effort of porting. I suspect if we’re talking about moving over a library of existing titles that have already been substantially monetized, there doesn’t need to be all that much convincing. NVIDIA claims it’s already engaged with many game developers on this front, but I do believe it’ll still be an uphill battle.

If I were in Microsoft’s shoes, I’d view Tegra K1 as an opportunity to revolutionize my mobile strategy. Give users the ability to run games like Grand Theft Auto V on a mobile device in the not too distant future and you’ve now made your devices more interesting to a large group of users. I don’t anticipate many wanting to struggle to play console games on a 5-inch touchscreen, but with a good controller dock (or tablet with a kickstand + wireless controller) the interface problem goes away.

For the first time I’m really excited about an NVIDIA SoC. It took the company five generations to get here, but we finally have an example of NVIDIA doing what it’s really good at (making high performance GPUs) in mobile. NVIDIA will surely update its Tegra Note 7 to a Tegra K1 version (most of its demos were run in a Tegra Note 7 chassis), but even if that and Shield are the best we get the impact on the rest of the market will be huge. With Tegra K1, NVIDIA really raised the bar on the GPU performance.

The CPU side, at least for the Cortex A15 version is less interesting to me. ARM’s Cortex A15, particularly at high clocks, has proven to be a decent fit for tablets. I am curious to see how the Denver version of Tegra K1 turns out. If the rumors are true, Denver could very well be one of the biggest risks we’ve seen taken in pursuit of building a low power mobile CPU. I am eager to see how this one plays out.

Finally: two big cores instead of a silly number of tiny cores

NVIDIA hasn’t had the best track record of meeting shipping goals on previous Tegra designs. I really hope we see Tegra K1, particularly the Denver version, ship on time (although I'm highly doubtful this will happen - new custom CPU core, GPU and process all at the same time?). I’m very eager not only for an install base of mobile devices with console-class GPUs to start building, but also to see what Denver can do. I suspect we’ll find out more at GTC about the latter. It’s things like Tegra K1 that really make covering the mobile space exciting.

Tegra K1 ISP & Video

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

88 Comments

View All Comments

name99 - Monday, January 6, 2014 - link
This is not especially new (though it might have been in Transmeta's time).

Given the existence of robust and generally accurate branch prediction, a number of architectures have been proposed that are based on checkpoints and rollbacks rather than a ROB. There are a number of ways you can slice this, with the newest, richest, ideas having names like CFP (Continuous Flow Processing) and DOE (Distributed OutOfOrder Execution), both created by folks with Intel affiliations.

What these architectures do is help you with long memory latency delays because (in spite of what the above author said) OoO doesn't help much there. OoO covers L1 delays, most L2 delays, some L3 delays if you're lucky, and very little of the main memory delay. That's why prefetching is still an active area of research (e.g. there were some minor but cute improvements to prefetch in Ivy Bridge). The problem is the length of the ROB limits how far you can cover latency in a ROB architecture, and you can't make the ROB much larger because that increases the size (and slows down) the register file. Checkpoint architectures are not constrained in this way.

HOWEVER all this is neither here nor there.
There are three interesting claims being made about Denver
- it uses a checkpoint architecture. Interesting if true, because this type of architecture has the potential to be the general replacement for ROB OoO; even if the first implementation is only equivalent of ROB OoO, there are many new optimizations it opens up
- it uses some sort of "Code Morphing". Who knows WTF this means. Could be anything from rewriting ARM assembly to an internal ISA (like Apple have done many times, from 68K->PPC to Rosetta; likewise DEC did this to run x86 binaries on Alpha) to PPro style µOps to something very minor like the way POWER "cracks" a few instructions to simpler instructions.
- it is "7-wide". If this is an issue width, it's a bullshit measure that no-one who knows anything cares about. If this is a Decode/Rename/Dispatch width, it is a major leap forward, and the only likely way it is doable at such low power is through use of a trace cache which records dependency and remap information. If nVidia has this, it would be very cool.

Given that this is nVidia, my betting would be that every one of these is underwhelming. The exciting checkpoint architecture is in fact a standard ROB (with standard ROB limitations). The code morphing is minor cracking of a few "hard" instructions. The 7-wide refers to issue width so, ho-hum.
Loki726 - Tuesday, January 7, 2014 - link
"This is not especially new."

Agreed. I mainly posted it for reference in case someone had not seen it before.
Da W - Monday, January 6, 2014 - link
For that matter i would prefer a Kabini surface mini and for AMD to follow Nvidia in game streaming (from PC or from Xbox one).
chizow - Monday, January 6, 2014 - link
Great write-up guys, you're right, this is the most exciting announcement I've seen in the CPU/GPU/SoC space in a very long time, similar to A7 Cyclone but 2x that due to both CPU and GPU bombshells. It's probably the first analysis I've read in full because everything was just that interesting relative to what the rest of the industry is doing.

One burning question that I did not see touched upon at all, here or elsewhere:

****What does Tegra K1 do for Nvidia's Kepler IP tech licensing prospects?

It seems to me, even if Tegra itself is not a smash hit for Nvidia in terms of design wins, the GPU technology is so disruptive that even if it gets into a few major designs (Surface 3, Nexus 7 2014, Asus Transformer for example) it may very well *FORCE* the other major industry players (Intel, Samsung, Apple) that don't have their own in-house graphics IP to license Kepler to remain competitive?

What do you all think? Any buzz on that front at CES?
OreoCookie - Friday, January 10, 2014 - link
As far as I can tell, nVidia only compared the GPU performance of the A7 to Tegra K1 but not the CPU performance. I'd be very curious to see how the Denver cores compare to Apple's Cyclone cores, though.

Also, given Tegra's release date, it'll compete with Apple's A8.
Krysto - Saturday, January 11, 2014 - link
Based on the (limited) technical description and how massive those cores are, along with clock speeds that are almost twice as high as what Apple typically uses, I'd say they will beat Apple's A8 (probably just an upgraded Cyclone) pretty easily - unless Nvidia did something stupid with that software translation that adds too much overhead and and cuts the performance too much.

But since we don't know exactly what's going on inside of those CPU cores, we'll have to wait for more details or a direct comparison (and hopefully Denver actually arrives this fall, and not next year).
OreoCookie - Sunday, January 12, 2014 - link
Initially, I thought so, too, but knowing it's a Transmeta Crusoe-like design, I'd be much more cautious about performance. At the same clockspeed, the Crusoe was about half or a third as fast as a Pentium III. The advantage was that the cpus consumed much less power.

Of course that tells us nothing of a comparison between the A7 or A8 and a Denver-based K1 other than that the architectures are not directly comparable.
name99 - Monday, January 6, 2014 - link
"
We’ve seen code morphing + binary translation done in the past, including famously in Transmeta’s offerings in the early 2000s, but it’s never been done all that well at the consumer client level.
"

Actually we've seen a few different versions of it which have worked just fine.
One obvious example (not consumer, but transparent) was IBM's switch over from custom cores to POWER cores for i-Series.
More on the consumer end, Apple have been doing this for years if you use OpenCL on their products --- they convert, on the fly, a byte code version of the GPU instructions to the target GPU. And of course anything that uses a JIT, whether it's targeting Java or JS (or Dalvik for that matter) is doing a similar sort of thing.

There may be uniquely painful aspects to doing this for x86/Windows, especially 15 years ago, but I don't think Transmeta's failure tells us anything --- this mainstream-ish tech. Especially now, in a world with hypervisors, where you have a more well-defined "space" for control code to run and bring up the OS step by step.
ruthan - Tuesday, January 7, 2014 - link
Ok, they maybe have enough GPU performance in this chip on paper. But how is final TDP SOC power consumation for 64 bit piece?
But if you want to have realy PS3 or Xbox performance, which was advertised / promised till original Ipad and we still arent here at all.
Other problem are game engines middleware performance, because 80% of mobile games using Unity3D engine, which in by my experience, much more HW resources greedy and inefficient (C# - has automatic garbage collection, all in Unity running in single thread, GUI performance is terrible, PhysX implementation is signle thread) that, that console developement kits.
Back into problem, GPU is maybe ok, but for final overall performance you need also CPU with desktop like performance and to freed GPU with data and im dont think so that these weak ARM is nearly here.

So in overall i dont agree with these big perfromance and desktop like performance promises at all, would be ok, but it is only empty words.
kwrzesien - Tuesday, January 7, 2014 - link
I think nVidia has finally done it with a great SoC/GPU! I hope they get a few very solid design wins, it could change alot.

Looking at those beautiful chip diagrams I think they have the CPU/CPU balance just right.

NVIDIA Tegra K1 Preview & Architecture Analysis

Final Words

Post Your Comment

88 Comments

View All Comments

name99 - Monday, January 6, 2014 - link

Loki726 - Tuesday, January 7, 2014 - link

Da W - Monday, January 6, 2014 - link

chizow - Monday, January 6, 2014 - link

OreoCookie - Friday, January 10, 2014 - link

Krysto - Saturday, January 11, 2014 - link

OreoCookie - Sunday, January 12, 2014 - link

name99 - Monday, January 6, 2014 - link

ruthan - Tuesday, January 7, 2014 - link

kwrzesien - Tuesday, January 7, 2014 - link

Log in

Don't have an account? Sign up now