The GPU

Despite the Denver surprise, the big story behind Tegra K1 is its GPU. Prior to K1, all previous Tegra designs implemented some derivative of what became known as the GeForce ULP core. This was a non-unified architecture that, at times, looked a lot like NV40. The design was never all that impressive from a performance or power efficiency standpoint. It was cost effective and often constrained by a narrow memory interface.

Going into Project Logan, which became Tegra K1, NVIDIA made the decision (around 3 years ago) to abandon the GeForce ULP roadmap and instead combine mobile and PC GPU roadmaps. Tegra K1 would be the first design to leverage a PC GPU, in this case Kepler. The bigger implication is that all future Tegra SoCs will integrate PC GPUs. The even crazier part of all of this is that all future NVIDIA GPUs will start out as mobile first designs (including Maxwell). Productization and market availability may happen in a different order, but all architectures will start as mobile designs and then be adopted to fit other, higher power segments. This is very much like Intel’s mobile-first realization of the mid-2000s with regards to notebook processors, but with NVIDIA and smartphone/tablet GPUs.

Kepler makes the move into mobile largely unchanged. This is a full Kepler implementation with the same size register file, shared L1 and is 100% ISA compatible with its big brother. It turns out that Kepler, as it was originally designed, was pretty good for mobile. If you take a GeForce 740M (2 SMX/384 CUDA core design), you’re looking at roughly a 19W GPU. Of that 19W, around 3W is memory IO, PCIe and other non-GPU things. You can subtract another 6W for leakage, bringing you down to 10W. Now that’s a 2 SMX design, so divide it in half and now you’re down to 5W. Drop the clock from 1GHz down to 900MHz, and the voltage as well, and now we’re talking around 2 - 3W for the GPU core and that’s without any re-architecting. Granted you can’t just subtract out things like leakage like that, but you get the point. Kepler wasn’t a bad starting point for a good mobile GPU design.

Tegra K1 features a single SMX (in a single GPC), which amounts to 192 CUDA cores. NVIDIA made the rookie mistake of calling Tegra K1 a 192-core processor, which made for some great headlines but largely does the industry a disservice.

Tessellation and geometry engines aren’t crippled compared to desktop Kepler. FP64 support is also present, at 1/24 the FP32 rate. There are 4 ROPs and 8 texture units, down from 16 in the PC version of Kepler. The big changes however are in the interconnects between all of the parts of the GPU.

The bigger implementations of Kepler have to be able to efficiently move data between multiple SMXes, ROPs and memory controllers. The interconnect fabric needed to do that doesn’t scale down well for mobile, where in many cases we’re dealing with one or two of those things instead of a dozen. By removing the complexity that exists in the bigger Kepler’s fabric you limit the ability for mobile Kepler to scale, but then again mobile Kepler is never going to scale to the sizes of big desktop GPUs so it’s not an issue. There are other changes outside of interconnect, with improved clock gating among other focuses on power efficiency.

NVIDIA updated the texture units to support ASTC, something that isn’t present in the desktop Kepler variants at this point. NVIDIA also hopes to use the GPU’s color compression features to reduce memory bandwidth requirements in UI rendering and not just 3D games.

With the changes NVIDIA made to the design, Kepler ends up being a < 2W GPU perfect for mobile. NVIDIA provided us with some data showing SoC + DRAM power while running GFXBench 3.0 (Manhattan), an OpenGL ES 3.0 test:

The data is presented in NVIDIA’s usual way where we’re not looking at peak performance but rather how Tegra K1 behaves when normalized to the performance of Apple’s A7 or Qualcomm’s Snapdragon 800. In both cases NVIDIA is claiming the ability to deliver equal performance at substantially better power efficiency.

NVIDIA shared some live demos that echoed the data above. Peak performance was capped to that of the A7 or Snapdragon 800, but SoC level power was always lower. It remains to be seen what power consumption looks like in a shipping configuration (which is almost always optimized for peak performance not equal performance at lower power), but it’s safe to say that concerns about Kepler being too power hungry for mobile are overrated.

The most compelling argument in favor of putting Kepler in a mobile SoC actually has to do with its API support. In one swift move NVIDIA goes from being disappointing in API support to industry leading. Since this is a full Kepler implementation (just a lower power/performing version) Tegra K1 maintains full API compatibility with NVIDIA’s flagship GeForce products. OpenGL ES 3.0 is supported but so are full OpenGL 4.4, DX11 and CUDA 6.0.

NVIDIA made it a point to say that high-end games developed for the PC or even current generation consoles could be ported over to Tegra K1 without issue. It’s perhaps over reaching a bit to claim the latter given the delta in performance (which NVIDIA hopes to make up in 4 generations!), but you can definitely argue that titles built for the previous generation of consoles (Xbox 360/PS3) could easily be ported to Tegra K1.

At its CES press conference NVIDIA teased the idea that Tegra K1 is actually more powerful than the last generation of consoles. The slide below attempts to drive that point home:

With a GPU clock of 950MHz (admittedly, a bit on the high end), NVIDIA can deliver substantially more raw horsepower than either previous generation console (192 CUDA cores * 2 FLOPS per core * 950MHz). Peak texture filtering performance and more importantly, memory bandwidth are lower than what was possible on these consoles but the numbers we’re talking about here aren’t substantial enough to prevent porting from happening. There may be some optimization needed but it definitely looks like Tegra K1 is the first mobile platform that can more or less run Xbox 360/PS3 titles, at least from a performance standpoint.

In pursuit of making porting and game development as simple as possible, NVIDIA demonstrated its NSight Tegra plugins for Visual Studio. Without changing the IDE that developers are used to, NSight Tegra allows developers to use the NDK toolchain all within Visual Studio. I’m not enough of a developer to know whether or not NVIDIA’s efforts in this space truly make life easy enough to port Xbox 360/PS3 games over to Android, but its VS integration demos looked convincing at least.

NVIDIA had a port of Serious Sam 3 running on Tegra K1 demo hardware just fine. Any games that are prepped for Steam OS are very easy to port over to Android. Once you make the move to OpenGL, the rest is allegedly fairly simple. The Serious Sam 3 port apparently took a matter of a couple of weeks to get ported over, with the bulk of the effort going into mapping controls to an Android environment.

CPU Option 2: Dual-Core 64-bit NVIDIA Denver Tegra K1 ISP & Video
POST A COMMENT

88 Comments

View All Comments

  • eddman - Monday, January 6, 2014 - link

    I'm wondering the same thing.

    Xbox and PS are gaming machines, running specialized OSes, and programmers can utilize low-level APIs and extract as much performance as possible.

    Tegra K1 might be powerful, but it'll still be running general purpose OSes like android and windows RT.

    Is there any way to know the performance gain by going low-level vs. high-level APIs for a video game? How much it really is? 5%? 15%? 40%?!!
    Reply
  • Krysto - Monday, January 6, 2014 - link

    Xbox360 and PS3 support DirectX9 and OpenGL ES 2.0+extensions. Many developers can and have made games with those APIs. Not all games on the consoles are "bare metal". So the "overall" difference in gaming, is probably not going to be very different.

    The real "problem" is that, those games will need to come to devices that have 1080p or even 2.5k resolutions, which will cut the graphics performance of the games by 2-4x, compared to Xbox/PS3. This is why I hate the OEMs for being so dumb and pushing resolutions even further on mobile.

    It's a waste of component money (could be used for different stuff instead), of battery life, and also of GPU resources.
    Reply
  • nicolapeluchetti - Monday, January 6, 2014 - link

    I guess really good games use low levels API, i mean GTA V looks amazing and the specs of the X-Box 360 are what they are.
    I agree with you that resolution will be a problem, but actually i really like the added resolution in everyday use. i recently switched from a note 2 to a nexus 5 and the extra resolution is fantastic.
    They will probably have to upscale things, render at 720p and render at 1080p
    Reply
  • Krysto - Tuesday, January 7, 2014 - link

    AMD said Mantle is pretty much bare-metal console API. And they said at their conference at CES that Battlefield 4 with that is 50 percent faster. So the difference is not huge, but significant.

    By far the biggest impact will be made by the resolution of the device. While games on Xbox 360 run at 720p, most devices with Tegra K1 will probably have at least a 1080p resolution, which is twice as many pixels, so it cuts the performance in half (or the graphics quality).
    Reply
  • TheJian - Sunday, January 12, 2014 - link

    Link please, and at what point in the vid do they say it (because some of those vids are 1hr+ for conferences)? I have seen only ONE claim and by a single dev who said you might get 20% if lucky. It is telling that we have NO benchmarks yet.

    But I'm more than happy to read about someone using Mantle actually saying they expect 45-50% IN GAME over a whole benchmark (not some specific operation that might only be used once). But I don't expect it to go ever 20%.

    Which makes sense given AMD shot so low with their comment of "we wouldn't do it for 5%". If it was easy to get even 40% wouldn't you say "we wouldn't do it for 25%"? Reality is they have to spend to get a dev to do this at all, because they gain NOTHING financially for using Mantle unless AMD is paying them.

    I'll be shocked to see BF4 over 25% faster than with it off (I only say 25% for THIS case because this is their best case I'm assuming, due to AMD funding it big time as a launch vehicle). Really I might be shocked at 20% but you gave me such a wide margin to be right saying 50%. They may not even get 20%.

    Why would ANY dev do FREE work to help AMD, and when done be able to charge ZERO over the cost of the game for everyone else that doesn't have mantle? It would be easier to justify it's use if devs could charge Mantle users say $15 extra per game. But that just won't work here. So you're stuck with amd saying "please dev, I know its more work and you won't ever make a dime from it, but it would be REALLY nice for us if you did this work free"...Or "Hi, my name is AMD, here's $8 Million dollars, please use Mantle". Only the 2nd option works at all, and even then you get Mantle being back burner the second the game needs to be fixed for the rest of us (BF4 for instance, all stuff on back burner until BF4 is fixed for regular users). This story is no different than Phsyx etc.
    Reply
  • nicolapeluchetti - Tuesday, January 7, 2014 - link

    Mantle is said to have 45% performance bonus compared to DirectX on Battlfield. Those are the rumours. Reply
  • OreoCookie - Monday, January 6, 2014 - link

    It's great to see that finally the SoC makers are being serious about GPU compute, now it's up to software developers to take advantage of all that compute horsepower. Given Apple's focus on GPU performance in the past, I'm curious to see what their A8 looks like and how it stacks up against Tegra K1 (in particular the Denver version). Reply
  • timchen - Monday, January 6, 2014 - link

    The Denver speculation really needs some justification.

    Doesn't common sense say that the same task is always more power efficient done with hardware rather than software? It would at least need a paragraph or two to explain how OoO or speculative execution or ILC can be more power efficient in software.

    Now if it is just that you need to build different binaries specifically for these cores, it then sounds a lot more like a compute GPU actually-- but as far as I understand so far general tasks are not suitable to run on those configurations, and parallelization for general problems is pretty much a dead horse (similar to P=NP?) now.
    Reply
  • KAlmquist - Monday, January 6, 2014 - link

    That speculation didn't make a lot of sense to me, either.

    One of the reasons that out of order execution improves performance is that cache misses are expensive. In an out of order processor, when a cache miss occurs the processor can defer the instructions that need that particular piece of data, and execute other instructions while waiting for the read to complete. To create "nice bundles of instructions that are already optimized for peak parallelism," you have to know how long each memory read is going to take.

    The writers mention the Transmeta Efficeon processor, which translated x86 instructions to native instructions and then executed them on an in-order processor. That was a fairly effective approach, but doesn't demonstrate that an in-order processor can compete with a modern out of order processor. After all, ARM started out producing in-order processors, which were very energy efficient, but eventually they had to produce an out of order design in order to increase performance without increasing the clock rate.
    Reply
  • Loki726 - Monday, January 6, 2014 - link

    Transmeta didn't have an in-order design in the same way that a normal CPU is in order. See their CGO paper: http://people.ac.upc.edu/vmoya/docs/transmeta-cgo....

    Here's the relevant text:

    "Compilers typically deal with recovery from speculation by generating compensation code, which re-
    executes incorrectly sequenced operations, performs operations omitted from the speculative code path, and
    corrects mismatches in register assignments (Freudenberger et al. [13]). With this approach, hardware
    support is required to defer faults of potentially faulting instructions moved above branches (e.g.,
    boosting,Smith et al. [23]), to detect overlapping memory operations scheduled out of sequence, and to branch to the
    compensation code (e.g., memory conflict buffers, Gallagher et al. [14], or the Intel IA-64 ALAT[18]).

    In contrast, Crusoe native VLIW processors provide an elegant hardware solution that supports arbitrary kinds of
    speculation and subsequent recovery and works hand-in-hand with the Code Morphing Software [8]. All registers
    holding x86 state are shadowed; that is, there exist two copies of each register, a working copy and a shadow
    copy. Normal atoms only update the working copy of the register. If execution reaches the end of a translation, a
    special commit operation copies all working registers into their corresponding shadow registers, committing the
    work done in the translation. On the other hand, if any exceptional condition, such as the failure of one of CMS’s
    translation assumptions, occurs inside the translation, the runtime system undoes the effects of all molecules
    executed since the last commit via a rollback operation that copies the shadow register values (committed at the
    end of the previous translation) back into the working registers.

    Following a rollback, CMS usually interprets the x86 instructions corresponding to the faulting translation, executing
    them in the original program order, handling any special cases that are encountered, and invoking the x86
    exception-handling procedure if necessary.

    Commit and rollback also apply to memory operations. Store data are held in a gated store buffer, from which they
    are only released to the memory system at the time of a commit. On a rollback, stores not yet committed can
    simply be dropped from the store buffer. To speed the common case of no rollback, the mechanism was designed so
    that commit operations are effectively “free”[27], while rollback atoms cost less than a couple of branch mispredictions."
    Reply

Log in

Don't have an account? Sign up now