Tegra X1's GPU: Maxwell for Mobile

Going into today’s announcement of the Tegra X1, while NVIDIA’s choice of CPU had been something of a wildcard, the GPU was a known variable. As announced back at GTC 2014, Erista – which we now know as Tegra X1 – would be a future Tegra product with a Maxwell GPU.

Maxwell of course already launched on the PC desktop as a discrete GPU last year in the Maxwell 1 based GM107 and Maxwell 2 based GM204. However despite this otherwise typical GPU launch sequence, Maxwell marks a significant shift in GPU development for NVIDIA that is only now coming to completion with the launch of the X1. Starting with Maxwell, NVIDIA has embarked on a “mobile first” design strategy for their GPUs; unlike Tegra K1 and its Kepler GPU, Maxwell was designed for Tegra from the start rather than being ported after the fact.

By going mobile-first NVIDIA has been able to reap a few benefits. On the Tegra side in particular, mobile-first means that NVIDIA’s latest and greatest GPUs are appearing in SoCs earlier than ever before – the gap between Maxwell 1 and Tegra X1 is only roughly a year, versus nearly two years for Kepler in Tegra K1. But it also means that NVIDIA is integrating deep power optimizations into their GPU architectures at an earlier stage, which for their desktop GPUs has resulted chart-topping power efficiency, and these benefits are meant to cascade down to Tegra as well.

Tegra X1 then is the first SoC to be developed under this new strategy, and for NVIDIA this is a very big deal. From a feature standpoint NVIDIA gets to further build on their already impressive K1 feature set with some of Maxwell’s new features, and meanwhile from a power standpoint NVIDIA wants to build the best A57 SoC on the market. With everyone else implementing (roughly) the same CPU, the GPU stands to be a differentiator and this is where NVIDIA believes their GPU expertise translates into a significant advantage.

Diving into the X1’s GPU then, what we have is a Tegra-focused version of Maxwell 2. Compared to Kepler before it, Maxwell 2 introduced a slew of new features into the NVIDIA GPU architecture, including 3rd generation delta color compression, streamlined SMMs with greater efficiency per CUDA core, and graphics features such as conservative rasterization, volumetric tiled resources, and multi-frame anti-aliasing. All of these features are making their way into Tegra X1, and for brevity’s sake rather than rehashing all of this we’ll defer to our deep dive on the Maxwell 2 architecture from the launch of the GeForce GTX 980.

For X1 in particular, while every element helps, NVIDIA’s memory bandwidth and overall efficiency increases are going to be among the most important of these improvements since they address two of the biggest performance bottlenecks facing SoC-class GPUs. In the case of memory bandwidth optimizations, memory bandwidth has long been a bottleneck at higher performance levels and resolutions, and while it’s a solvable problem, the general solution is to build a wider (96-bit or 128-bit) memory bus, which is very effective but also drives up the cost and complexity of the SoC and the supporting hardware. In this case NVIDIA is sticking to a 64-bit memory bus, so memory compression is very important for NVIDIA to help drive X1. This coupled with a generous increase in memory bandwidth from the move to LPDDR4 helps to ensure that X1’s more powerful GPU won’t immediately get starved at the memory stage.

Meanwhile just about everything about SoC TDP that can be said has been said. TDP is a limiting factor in all modern mobile devices, which means deceased power consumption directly translates into increased performance, especially under sustained loads. Coupled with TSMC’s 20nm SoC process, Maxwell’s power optimizations will further improve NVIDIA’s SoC GPU performance.

Double Speed FP16

Last but certainly not least however, X1 will also be launching with a new mobile-centric GPU feature not found on desktop Maxwell.  For X1 NVIDIA is implanting what they call “double speed FP16” support in their CUDA cores, which is to say that they are implementing support for higher performance FP16 operations in limited circumstances.

As with Kepler and Fermi before it, Maxwell only features dedicated FP32 and FP64 CUDA cores, and this is still the same for X1. However in recognition of how important FP16 performance is, NVIDIA is changing how they are handling FP16 operations for X1. On K1 FP16 operations were simply promoted to FP32 operations and run on the FP32 CUDA cores; but for X1, FP16 operations can in certain cases be packed together as a single Vec2 and issued over a single FP32 CUDA core.

There are several special cases here, but in a nutshell NVIDIA can pack together FP16 operations as long as they’re the same operation, e.g. both FP16s are undergoing addition, multiplication, etc. Fused multiply-add (FMA/MADD) is also a supported operation here, which is important for how frequently it is used and is necessary to extract the maximum throughput out of the CUDA cores.

In this respect NVIDIA is playing a bit of catch up to the competition, and overall it’s hard to escape the fact that this solution is a bit hack-ish, but credit where credit is due to NVIDIA for at least recognizing and responding to what their competition has been doing. Both ARM and Imagination have FP16 capabilities on their current generation parts (be it dedicated FP16 units or better ALU decomposition), and even AMD is going this route for GCN 1.2. So even if it only works for a few types of operations, this should help ensure NVIDIA doesn’t run past the competition on FP32 only to fall behind on FP16.

So why are FP16 operations so important? The short answer is for a few reasons. FP16 operations are heavily used in Android’s display compositor due to the simplistic (low-precision) nature of the work and the power savings, and FP16 operations are also used in mobile games at certain points. More critical to NVIDIA’s goals however, FP16 can also be leveraged for computer vision applications such as image recognition, which NVIDIA needs for their DRIVE PX platform (more on that later). In both of these cases FP16 does present its own limitations – 16-bits just isn’t very many bits to hold a floating point number – but there are enough cases where it’s still precise enough that it’s worth the time and effort to build in the ability to process it quickly.

Tegra X1 GPU By The Numbers

Now that we’ve covered the X1’s GPU from a feature perspective, let’s take a look the GPU from a functional unit/specification perspective.

Overall the X1’s GPU is composed of 2 Maxwell SMMs inside a single GPC, for a total of 256 CUDA cores. This compares very favorably to the single SMX in K1, as it means certain per-SMM/SMX resources such as the geometry and texture units have been doubled. Furthermore Maxwell’s more efficient CUDA cores means that X1 is capable of further extending its lead over Kepler, as we’ve already seen in the desktop space.

NVIDIA Tegra GPU Specification Comparison
  K1 X1
CUDA Cores 192 256
Texture Units 8 16
ROPs 4 16
GPU Clock ~950MHz ~1000MHz
Memory Clock 930MHz (LPDDR3) 1600MHz (LPDDR4)
Memory Bus Width 64-bit 64-bit
FP16 Peak 365 GFLOPS 1024 GFLOPS
FP32 Peak 365 GFLOPS 512 GFLOPS
Architecture Kepler Maxwell
Manufacturing Process TSMC 28nm TSMC 20nm SoC

Meanwhile outside of the CUDA cores NVIDIA has also made an interesting move in X1’s ROP configuration. At 16 ROPs the X1 has four times the ROPs of K1, and is consequently comparatively ROP heavy. This is as many ROPs as is on a GM107 GPU, for example. With that said, due to NVIDIA’s overall performance goals and their desire to drive 4K displays at 60Hz, there is a definite need to go ROP-heavy to make sure they can push the necessary amount of pixels. This also goes hand-in-hand with NVIDIA’s memory bandwidth improvements (efficiency and actual) which will make it much easier to feed those ROPs. This also puts the ROP:memory controller ratio at 16:1, the same ratio as on NVIDIA’s desktop Maxwell parts.

Finally, let’s talk about clockspeeds and expected performance. While NVIDIA is not officially publishing the GPU clockspeeds for the X1, based on their performance figures it’s easy to figure out. With NVIDIA’s quoted (and promoted) 1 TFLOPs FP16 performance figure for the X1, the clockspeed works out to a full 1GHz for the GPU (1GHz * 2 FP 16 * 2 FMA * 256 = 1 TFLOPs).

This is basically a desktop-class clockspeed, and it goes without saying that is a very aggressive GPU clockspeed for an SoC-class part. We’re going to have to see what design wins X1 lands and what the devices are like, but right now it’s reasonable to expect that mobile devices will only burst here for short periods of time at best. However NVIDIA’s fixed platform DRIVE devices are another story; those can conceivably be powered and cooled well enough that the X1’s GPU can hit and sustain these clockspeeds.

Introduction, CPU, and Uncore GPU Performance Benchmarks
Comments Locked

194 Comments

View All Comments

  • juicytuna - Monday, January 5, 2015 - link

    Well said. Apple's advantage is parallel development and time to market. Their GPU architecture is not that much *better* than their competitors. In fact I'd say that Nvidia has had a significant advantage when it comes to feature set and performance per watt on a given process node since the K1.
  • GC2:CS - Monday, January 5, 2015 - link

    Maybe an adventage in feature set, but performance per watt ?

    So if you want to compare than For example xiaomi miPad, consumes around 7,9W, when running gfx bench battery life test and that is with performance throttled down to around 30,4 fps on screen a very similar tablet, the iPad mini with retina display and it's A7 processor (actually a 28nm part !) consumes just 4,3W and that is running at 22,9 fps for the whole time.

    So I am asking where is that "class leading" efficiency and "significant adventage when it comes to performace per watt" that nvidia is claiming to achieve, because I actually don't see anything like that.
  • Yojimbo - Monday, January 5, 2015 - link

    Looking at the gfxbench website, under "long-term performance" I see 21.4 fps listed for the iPad Mini Retina and 30.4 fps listed for the Mi Pad, maybe this is what you are talking about. That is a roughly 40% advantage in performance for the Mi Pad. I can't find anything that says about throttling or the number of Watts being drawn during this test. What I do see is another category listed immediately below that says "battery lifetime" where the iPad Mini Retina is listed at 303 minutes and the Mi Pad is listed at 193 minutes. The iPad Mini Retina has a 23.8 watt-hour battery and the Mi Pad has a 24.7 watt-hour battery. So this seems to imply that the iPad Mini Retina is drawing about 4.7 watts and the Mi Pad is drawing about 7.7 watts, and it comes out to the Mi Pad using about a 63% more power. 40% more performance for 63% more power is a much closer race than the numbers you quoted (Yours come out to about a 33% increase in performance and an 84% increase in power consumption, which is very different.), and one must remember the circumstances of the comparison. Firstly, it is a comparison at different performance levels (this part is fair, since juicytuna claimed that NVIDIA has had a performance per watt advantage), secondly, it is a long-term performance comparison for a particularly testing methodology, and lastly and most importantly, it is a whole-system comparison, not just comparing the GPU power consumption or even the SOC power consumption.
  • GC2:CS - Monday, January 5, 2015 - link

    Yeah exactly, when you got two similar platforms with different chips, I think it's safe to say that tegra pulls significally more than A7, because those ~3 additional wats (I don't know where you got your numbers, I know xiaomi got 25,46Wh, and that iPad lasts 330 minutes, A7 iPad's also push out T-rex at around 23 fps since iOS8 update) have to go somewhere. What I am trying to say that imagine how low powered the A7 is if the entire iPad mini at half brightness consumes 4,7W, how huge those 3W that more or less come from the SoC actually are.
    You will increase the power draw of the entire tablet by over a half, just to get 40% more performance out of your SoC. The tegra K1 in miPad has a 5W TDP, or more than entire iPad mini ! Yet it can't deliver performance that's competitive enough at that power.
    Like you are a 140 lb man, that can lift a 100 pounds, but you will train a lot untill you will put on 70 pounds of muscles (pump more power intro the soc) to weight 210 or more and you could still only lift like 140 pounds. What a dissapointment !

    What I see is a massive increase in power compustion, with not-so massive gains in performace, which is not typical to efficient architectures like nvidia is claiming Tegra k1 is.
    That's why I think nvidia just kind of failed to deliver on their promise of "revolution" in mobile graphics.
  • Yojimbo - Monday, January 5, 2015 - link

    I got my benchmark and battery life numbers from the gfxbench.com website as I said in my reply. I got the iPad's battery capacity from the Apple website. I got the Mi Pad's battery capacity from a review page that I can't find again right now, but looking from other places it may have been wrong. WCCFtech lists 25.46 W-h like you did. I don't know where you got YOUR numbers. You cannot say they are "two similar platforms" and conclude that the comparison is a fair comparison of the underlying SOCs. Yes the screen resolutions are the same, but just imagine that Apple managed to squeeze an extra .5 watts from the display, memory, and all other parts of the system than the "foolish chinesse manufacteurs (sic)" were able to do. Adding this hypothetical .5 watts back would put the iPad Mini Retina at 5.2 watts, and the Mi Pad would then be operating at 40% more performance for 48% (or 52%, using the larger battery size you gave for the MiPad) more power usage . Since power usage does not scale linearly with performance this could potentially be considered an excellent trade-off.

    Your analogy, btw, is terrible. The Mi Pad does not have the same performance as does the bulked-up man in your analogy, it has a whole 40% more. Your use of inexact words to exaggerate is also annoying: "I see massive increases in power compustion, with not-so massive gains in performace"and "You increase the power draw by over half just to get 40% more performance". You increase the power by 60% to get 40% more performance. That has all the information. But the important point is that it is not an SOC-only measurement and so the numbers are very non-conclusive from an analytical standpoint.
  • GC2:CS - Tuesday, January 6, 2015 - link

    What I see from those numbers is a fact that Tegra is nowhere near 50% more efficient than A7 like nvidia is claiming.

    When Gfx bench battery life test runs the display and the SoC are two major power drawers so I thought is reasonable to make other power using parts neglible.
    So the entire iPad mini pulls 4,9W (I don't know why I should add another 0,5 W if it doesn't pull that much) and miPad pulls 7,9W. Those are your numbers which actually favor nvidia a bit.

    To show you that there is no way around that fact I will lower the compustion of miPad by a W just to favor nvidia even more.

    Now when we got 4,9 and 6,9W for both tablets I will substract around 1,5W for the display power, which should be more or less the same for both tablets.

    So we got 3,4 and 5,4W of all things but the display power compustion, and most of this will be the SoC power. And we got that the tegra k1 uses more or less 50% more power than A7 for 40% more performance in a scenario that favors nvidia so much it's extremelly unfair.

    And even if we take this absurd scenario and scale back the power compustion of tegra K1 down quadratically: 1,5*(1,4)^(-2) we still get that even at A7 level of performance K1 will consume over 75% power of A7 for the same performance.
    That is an number that is way, way, way off in favor of nvidia and yet it still doesn't come close to "50% more efficient" claim that would require the K1 to consume just 2/3 the power for the same performance.

    So please tell me how can you assume that increasing the power draw of the ENTIRE tablet by 60%, just to get 40% more GPU performance out of your SoC, which is a SINGLE part, just a subset of total tablet power draw, can be interpreted as nvidia's SoC is more efficient. Because whatever I will spin that I am not seeing 3x performance and 50% more efficiency from K1 tablets compared to A7 tablets. I see that that K1 tablets throttle to nowhere near 3x faster than A7 iPads and they run down their battery significally faster. And if the same is true for the tegra X1, I don't know why anybody should be excited about those chips.
  • Yojimbo - Tuesday, January 6, 2015 - link

    You don't think it's possible to save power in any other component of the system than the SOC? I think that's a convenient and silly claim. You can't operate under the assumption that the rest of the two very different systems draw the exact same amount of power and so all power difference comes from the SOC. Obviously if you want to compare SOC power draw you look at SOC power draw. Anything else is prone to great error. You can do lots of very exact and careful calculations and you will probably be completely inaccurate.
  • juicytuna - Monday, January 5, 2015 - link

    That's comparing whole SOC power consumption. There's now doubt Cyclone is a much more efficient architecture than A15/A7. Do we know how much this test stresses the CPU? Can it run entirely on the A7s or is it lighting up all 4 A15s? Not enough data.

    Furthermore, the performance/watt curve on these chips is non linear so if the K1 was downclocked to match the performance of the iPad I've no doubt its results would look much more favourable. I suspect that is why they compare the X1 to the A8X at same FPS rather than at the same power consumption.
  • Jumangi - Monday, January 5, 2015 - link

    No it should be done on the actual real world products people can buy. That's the only thing that should matter ever.
  • Yojimbo - Monday, January 5, 2015 - link

    Not if one wants to compare architectures, no. There is no reason why in an alternate universe Apple doesn't use NVIDIA's GPU instand of IMG's. In this alternate universe, NVIDIA's K1 GPU would then benefit from Apple's advantages the same way the Series 6XT GPU benefits in the Apple 8X, and then the supposed point that GC2:CS is trying to make, that the K1 is inherently inferior, would, I think, not hold up.

Log in

Don't have an account? Sign up now