Maxwell 1 Architecture: The Story So Far

Before we dive into the story and architecture of Maxwell 2, we’d like to spend a bit of time recapping what we’ve seen so far with Maxwell 1 and the GM107 GPU. While both GPUs are distinctly Maxwell, Maxwell 2 is essentially a second, more feature-packed version of Maxwell. Meanwhile it retains all of the base optimizations that went into Maxwell 1, implemented on a larger scale for a larger GPU.

Beginning with the Maxwell family of architectures, NVIDIA embarked on a “mobile first” design strategy for GPUs, marking a significant change in NVIDIA’s product design philosophy. As of Maxwell NVIDIA’s top-down philosophy that saw the launch of high-end desktop class GPUs come first has come to an end, and as NVIDIA has chosen to embrace power efficiency and mobile-friendly designs as the foundation of their GPU architectures, this has led to them going mobile first on Maxwell. With Maxwell NVIDIA has made the complete transition from top to bottom, and are now designing GPUs bottom-up instead of top-down.

By going mobile first NVIDIA is aiming to address several design considerations at all once. First and foremost is the fact that NVIDIA is heavily staking the future of their company in mobile, and that means they need GPU designs suitable for such a market. This mobile first view is primarily focused on SoC-class products – the Tegra family – but really it even extends to mobile PC form factors such as laptops, where discrete GPUs can play an important role but are going to have strict thermal requirements. By designing GPUs around mobile first, NVIDIA starts with a design that is already suitable for Tegra and then can scale it up as necessary for laptop and desktop GeForce products. Graphics is – as we like to say – embarrassingly parallel, so if you can build one small module then it’s relatively easy to scale up performance by building chips with more modules and tying them together. This is the mobile first philosophy.

What this means is that NVIDIA is focused on power efficiency more than ever before. The SoC market is brutal for both the demands placed on the hardware and for the competitive nature of that market, and given the fact that SoCs are so heavily constrained by thermal and power considerations, every bit of power saved can be reinvested in additional performance. This in turn calls for a GPU that is especially power efficient, as it is efficiency that will win the market for NVIDIA.

Maxwell then is an interesting take on NVIDIA’s designs that does not radically alter NVIDIA’s architecture, but has had every accommodation made to improve energy efficiency. The result is a Kepler-like architecture with a number of small design tweaks that improve efficiency in some manner. As NVIDIA tells it, there is no single aspect of Maxwell that is disproportionally responsible for NVIDIA’s energy improvements, but rather it is the culmination of these small changes. Through these changes NVIDIA has been able to come close to doubling their performance per watt versus Kepler, which is nothing short of amazing given the fact that all of this is being done on the same 28nm process as Kepler.

Starting with the Maxwell 1 SMM, NVIDIA has adjusted their streaming multiprocessor layout to achieve better efficiency. Whereas the Kepler SMX was for all practical purposes a large, flat design with 4 warp schedulers and 15 different execution blocks, the SMM has been heavily partitioned. Physically each SMM is still one contiguous unit, not really all that different from an SMX. But logically the execution blocks which each warp scheduler can access have been greatly curtailed.

The end result is that in an SMX the 4 warp schedulers would share most of their execution resources and work out which warp was on which execution resource for any given cycle. But on an SMM, the warp schedulers are removed from each other and given complete dominion over a far smaller collection of execution resources. No longer do warp schedulers have to share FP32 CUDA cores, special function units, or load/store units, as each of those is replicated across each partition. Only texture units and FP64 CUDA cores are shared.

Among the changes NVIDIA made to reduce power consumption, this is among the greatest. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They’re wasting space and power if not fed, the crossbar to connect all of them is not particularly cheap on a power or area basis, and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it.

NVIDIA still isn’t sharing hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size.

Moving on, along with the SMM layout changes NVIDIA has also made a number of small tweaks to improve the IPC of the GPU. The scheduler has been rewritten to avoid stalls and otherwise behave more intelligently. Furthermore by achieving higher utilization of their existing hardware, NVIDIA doesn’t need as many functional units to hit their desired performance targets, which in turn saves on space and ultimately power consumption.

NVIDIA has also been focused on memory efficiency, both for performance and power reasons, resulting in the L2 cache size been greatly increased. NVIDIA has from 256KB in GK107 to 2MB on GM107, and from 512KB on GK104 to the same 2MB on GM204. This cache size increase reduces the amount of traffic that needs to cross the memory bus, reducing both the power spent on the memory bus and improving overall performance.

Increasing the amount of cache always represents an interesting tradeoff since cache is something of a known quantity and is rather dense, but it’s only useful if there are memory stalls or other memory operations that it can cover. Consequently we often see cache implemented in relation to whether there are any other optimizations available. In some cases it makes more sense to use the transistors to build more functional units, and in other cases it makes sense to build the cache. The use of 2MB of L2 cache in both GM107 and GM204 – despite the big differences in ROP count and memory bus size – indicates that NVIDIA’s settling on 2MB as their new sweet spot for consumer graphics GPUs.

Finally there’s the lowest of low level optimizations, which is transistor level optimizations. These optimizations are something of a secret sauce for NVIDIA, but they tell us they’ve gone through at the transistor level to squeeze out additional energy efficiency as they could find it. Given that TSMC 28nm is now a very mature process with well understood abilities and quirks, NVIDIA should be able to design and build their circuits to a tighter tolerance now than they would have been able to when working on GK107 and GK104 over 2 years ago.

The NVIDIA GeForce GTX 980 Review Maxwell 2 Architecture: Introducing GM204
Comments Locked

274 Comments

View All Comments

  • squngy - Wednesday, November 19, 2014 - link

    It is explained in the article.

    Because GTX980 makes so many more frames the CPU is worked a lot harder. The W in those charts are for the whole system so when the CPU uses more power it makes it harder to directly compare GPUs
  • galta - Friday, September 19, 2014 - link

    The simple fact is that a GPU more powerful than a GTX 980 does not make sense right now, no matter how much we would love to see it.
    See, most folks are still gaming @ 1080, some of us are moving up to 1440. Under this scenarios, a GTX 980 is more than enough, even if quality settings are maxed out. Early reviews show that it can even handle 4K with moderate settings, and we should expect further performance gains as drivers improve.
    Maybe in a year or two, when 4K monitors become more relevant, a more powerful GPU would make sense. Now they simply don't.
    For the moment, nVidia's movement is smart and commendable: power efficiency!
    I mean, such a powerful card at only 165W! If you are crazy/wealthy enough to have two of them in SLI, you can cut your power demand by 170W, with following gains in temps and/or noise, and and less expensive PSU, if you're building from scratch.
    In the end, are these new cards great? Of course they are!
    Does it make sense to up-grade right now? Only if you running a 5xx or 6xx series card, or if your demands have increased dramatically (multi-monitor set-up, higher res. etc.).
  • Margalus - Friday, September 19, 2014 - link

    A more powerful gpu does make sense. Some people like to play their games with triple monitors, or more. A single gpu that could play at 7680x1440 with all settings maxed out would be nice.
  • galta - Saturday, September 20, 2014 - link

    How many of us demand such power? The ones who really do can go SLI and OC the cards.
    nVidia would be spending billions for a card that would sell thousands. As I said: we would love the card, but still no sense
    Again, I would love to see it, but in the forseeable future, I won't need it. Happier with noise, power and heat efficiency.
  • Da W - Monday, September 22, 2014 - link

    Here's one that demands such power. I play 3600*1920 using 3 screens, almost 4k, 1/3 the budget, and still useful for, you know, working.
    Don't want sli/crossfire. Don't want a space heater either.
  • bebimbap - Saturday, September 20, 2014 - link

    gaming at 1080@144 or 1080 with min fps of 120 for ulmb is no joke when it comes to gpu requirement. Most modern games max at 80-90fps on a OC'd gtx670 you need at least an OC'd gtx770-780. I'd recommend 780ti. and though a 24" 1080 might seem "small" you only have so much focus. You can't focus on periphery vision you'd have to move your eyes to focus on another piece of the screen. the 24"-27" size seems perfect so you don't have to move your eyes/head much or at all.

    the next step is 1440@144 or min fps of 120 which requires more gpu than @ 4k60. as 1440 is about 2x 1080 you'd need a gpu 2x as powerful. so you can see why nvidia must put out a powerful card at a moderate price point. They need it for their 144hz gsync tech and 3dvision

    imo the ppi race isn't as beneficial as higher refresh rate. For TVs manufacturers are playing this game of misinformation so consumers get the short end of the stick, but having a monitor running at 144hz is a world of difference compared to 60hz for me. you can tell just from the mouse cursor moving across the screen. As I age I realize every day that my eyes will never be as good as yesterday, and knowing that I'd take a 27" 1440p @ 144hz any day over a 28" 5k @ 60hz.
  • Laststop311 - Sunday, September 21, 2014 - link

    Well it all depends on viewing distance. I use a 30" 2560x1600 dell u3014 to game on currently since it's larger i can sit further away and still have just as good of an experience as a 24 or 27 thats closer. So you can't just say larger monitors mean you can;t focus on it all cause you can just at a further distance.
  • theuglyman0war - Monday, September 22, 2014 - link

    The power of the newest technology is and has always been an illusion because the creation of games will always be an exercise in "compromise". Even a game like WOW that isn't crippled by console consideration is created by the lowest common denominator demographic in the PC hardware population. In other words... ( if u buy it they will make it vs. if they make it I will upgrade ). Besides the unlimited reach of an openworld's "possible" textures and vtx counts.
    "Some" artists are of the opinion that more hardware power would result in a less aggressive graphic budget! ( when the time spent wrangling a synced normal mapped representation of a high resolution sculpt or tracking seam problems in lightmapped approximations of complex illumination with long bake times can take longer than simply using that original complexity ). The compromise can take more time then if we had hardware that could keep up with an artists imagination.
    In which case I gotta wonder about the imagination of the end user that really believes his hardware is the end to any graphics progress?
  • ppi - Friday, September 19, 2014 - link

    On desktop, all AMD needs to do is to lower price and perhaps release OC'd 290X to match 980 performance. It will reduce their margins, but they won't be irrelevant on the market, like in CPUs vs Intel (where AMD's most powerful beasts barely touch Intel's low-end, apart from some specific multi-threaded cases)

    Why so simple? On desktop:
    - Performance is still #1 factor - if you offer more per your $, you win
    - Noise can be easily resolved via open air coolers
    - Power consumption is not such a big deal

    So ... if AMD card at a given price is as fast as Maxwell, then they are clearly worse choice. But if they are faster?

    In mobile, however, they are screwed big way, unless they have something REAL good in their sleeve (looking at Tonga, I do not think they do, as I am convinced AMD intends to pull off another HD5870 (i.e. be on the new process node first), but it apparently did not work this time around.)
  • Friendly0Fire - Friday, September 19, 2014 - link

    The 290X already is effectively an overclocked 290 though. I'm not sure they'd be able to crank up power consumption reliably without running into heat dissipation or power draw limits.

    Also, they'd have to invest in making a good reference cooler.

Log in

Don't have an account? Sign up now