As someone who analyzes GPUs for a living, one of the more vexing things in my life has been NVIDIA’s Maxwell architecture. The company’s 28nm refresh offered a huge performance-per-watt increase for only a modest die size increase, essentially allowing NVIDIA to offer a full generation’s performance improvement without a corresponding manufacturing improvement. We’ve had architectural updates on the same node before, but never anything quite like Maxwell.

The vexing aspect to me has been that while NVIDIA shared some details about how they improved Maxwell’s efficiency over Kepler, they have never disclosed all of the major improvements under the hood. We know, for example, that Maxwell implemented a significantly altered SM structure that was easier to reach peak utilization on, and thanks to its partitioning wasted much less power on interconnects. We also know that NVIDIA significantly increased the L2 cache size and did a number of low-level (transistor level) optimizations to the design. But NVIDIA has also held back information – the technical advantages that are their secret sauce – so I’ve never had a complete picture of how Maxwell compares to Kepler.

For a while now, a number of people have suspected that one of the ingredients of that secret sauce was that NVIDIA had applied some mobile power efficiency technologies to Maxwell. It was, after all, their original mobile-first GPU architecture, and now we have some data to back that up. Friend of AnandTech and all around tech guru David Kanter of Real World Tech has gone digging through Maxwell/Pascal, and in an article & video published this morning, he outlines how he has uncovered very convincing evidence that NVIDIA implemented a tile based rendering system with Maxwell.

In short, by playing around with some DirectX code specifically designed to look at triangle rasterization, he has come up with some solid evidence that NVIDIA’s handling of tringles has significantly changed since Kepler, and that their current method of triangle handling is consistent with a tile based renderer.


NVIDIA Maxwell Architecture Rasterization Tiling Pattern (Image Courtesy: Real World Tech)

Tile based rendering is something we’ve seen for some time in the mobile space, with both Imagination PowerVR and ARM Mali implementing it. The significance of tiling is that by splitting a scene up into tiles, tiles can be rasterized piece by piece by the GPU almost entirely on die, as opposed to the more memory (and power) intensive process of rasterizing the entire frame at once via immediate mode rendering. The trade-off with tiling, and why it’s a bit surprising to see it here, is that the PC legacy is immediate mode rendering, and this is still how most applications expect PC GPUs to work. So to implement tile based rasterization on Maxwell means that NVIDIA has found a practical means to overcome the drawbacks of the method and the potential compatibility issues.

In any case, Real Word Tech’s article goes into greater detail about what’s going on, so I won’t spoil it further. But with this information in hand, we now have a more complete picture of how Maxwell (and Pascal) work, and consequently how NVIDIA was able to improve over Kepler by so much. Finally, at this point in time Real World Tech believes that NVIDIA is the only PC GPU manufacturer to use tile based rasterization, which also helps to explain some of NVIDIA’s current advantages over Intel’s and AMD’s GPU architectures, and gives us an idea of what we may see them do in the future.

Source: Real World Tech

Comments Locked

191 Comments

View All Comments

  • kn00tcn - Monday, August 1, 2016 - link

    he was with oculus quite a while before facebook... & it doesnt have to be gospel, just a consideration that motivates people to make experiments, benchmarks, code changes
  • wumpus - Monday, August 1, 2016 - link

    Michael Abrash has been insisting that "latency is everything" in VR, even before he joined Oculus. http://blogs.valvesoftware.com/abrash/ for plenty about graphics and VR.

    Using tiling is going to add nearly a frame's worth of latency to VR (well, to everything. But nobody cares for non-VR issues). If you need to collect triangles and sort them into tiles, you are going to have wildly greater latency than if you just draw the triangles as they are called in the API. Vulcan/directX12 should help, but only if you are willing to give up a lot of the benefits of tiling.
  • Scali - Tuesday, August 2, 2016 - link

    "Using tiling is going to add nearly a frame's worth of latency to VR"

    Firstly, no... I think you are confusing deferred rendering with tiling.
    Tiling is just the process of cutting up triangles into a tile-grid, and then you can process the tiles independently (even in parallel). This doesn't have to be buffered, it can be done on-the-fly.

    Secondly, latency and frame rendering time are not the same thing. Just because you have to collect triangles first doesn't mean it takes as long as it would to render them.
    It may take *some* time to collect them, there's a tiny bit of extra overhead there (depending also on how much of the process is hardware/cache-assisted). However, it allows you to then render the triangles more efficiently, so in various cases (especially with a lot of overdraw) you might actually complete the tiling+rendering faster than you would render the whole thing immediately. Which actually LOWERS your latency.

    Also, I don't quite see how Vulkan/DX12 would 'help', or how you would have to 'give up a lot of the benefits of tiling' by using these APIs.
    In terms of rendering, they are still doing exactly the same thing: they pass a set of shaders, textures and batches of triangles to the rendering pipeline.
    Nothing changes there. The difference is in how the data is prepared before it is handed off to the driver.

    A lot of people seem to think DX12/Vulkan are some kind of 'magic' new API, which somehow does things completely differently, and requires new GPUs 'designed for DX12/Vulkan' to benefit.
    Which is all... poppycock.
    The main things that are different are:
    1) Lower-level access to the driver/GPU, so resource-management and synchronization is now done explicitly in the application, rather than implicitly in the driver.
    2) Multiple command queues allow you to run graphics and compute tasks asynchronously/concurrently/in parallel.

    In terms of rasterization, texturing, shading etc, nothing changed. So nothing that would affect your choice of rasterizing, be that immediate, tiled, deferred, or whatever variation you can think of.
  • wumpus - Tuesday, August 2, 2016 - link

    The whole point about buffering is that it *will* take a full frame. And while you can do it in parallel, it won't speed anything up (you just draw in smaller tiles).

    A simpler argument against my brainfart is that you don't want to display the thing until you are done. So it doesn't matter what order you do it.

    The other thing is that eventually this type of thing can seriously improve latency (especially in VR headsets). What nvidia needs to do is twofold:

    1: create some sort of G-sync 2.0 (or G-sync VR, but I'm sure fanboys will run out and buy g-sync 2.0 200Hz displays or something). This should let them display each *line* as it appears, not just each frame. This will of course be mostly fake, since neither device really works on the line level, but the idea is to get them both in sync up to about quarter screens or so. Drawing the screen 1/4 of the time will reduce latency by whatever time it takes to gather up the API calls and arrange them for rasterization + 1/4 of a frame (or however large the tiles are. Pretty darn small for high antialiasing and HDR).
    2. Assuming that "gathering up the API calls" takes roughly as much as the rastering (it should if they don't share hardware, otherwise they are wasting transistors), than get the engines to break the screen into horizontal strips and send the API calls separately. This should be easy at the high level (just change the vertical resolution and render 4-8 times while "looking" further downward, but likely a royal pain at the low level getting rid of all the "out of sight, out of mind" assumptions about memory allocation and caching. But it buys you absolutely nothing if you can't "race the beam" and I doubt you can do that with current hardware with or without G-sync/free-sync (I can't believe they needed g-sync in the first place).

    It might take awhile, but I suspect that the 3rd of 4th generation of VR will absolutely depend on this tech.
  • Scali - Tuesday, August 2, 2016 - link

    "The whole point about buffering is that it *will* take a full frame."

    Firstly, not in terms of time. Buffering the draw calls for a frame is faster than executing the draw calls for a frame.
    Secondly, you are assuming that they buffer a whole frame at a time. But you can buffer arbitrarily large or small subsections of frames, and render out intermediate results (even PowerVR does that, you have to, in order to handle arbitrarily complex frames).

    "And while you can do it in parallel, it won't speed anything up (you just draw in smaller tiles)."

    Actually, it can. If you design the hardware for it, you can split up your triangle into multiple tiles, and then have multiple rasterizers work on individual tiles in parallel. Which would not be too dissimilar to how nVidia currently handles tessellation and simultaneous multi-projection with their PolyMorph-architecture.
  • wumpus - Tuesday, August 2, 2016 - link

    Great post, but nothing like what I meant.

    The reason for assuming a latency hit was that there are two parts to rendering a chip when tiling: breaking down all the APIs into triangles to figure out which go with which tile, and then running the tiles (this gets a little weird in that since it is technically not a "deferred renderer" it has to fake that it isn't exactly doing this).

    The only reason you wouldn't take a latency hit is if the bit that sorts API calls into tiles can't operate while the rasterizer is rendering each tile. Obviously some API calls will stop things dead (anything that needs previously calculated pixels), but I suspect the sorter can simply mark it and keep sorting tiles.

    It still shouldn't be much of a hit (even if the API never asks for something that hasn't been tiled yet) simply because sorting the tiles should be a quick process, much faster than rasterizing them. Basically it should give at absolute worst case one frame times the "length of time to sort into tiles"/"length of time to draw the screen".

    DX12/Vulcan isn't an issue (unless you are playing games creating textures that aren't being directly used), but that likely means writing out some on-chip memory to DRAM and creating the texture when needed, and writing it back to on-chip memory.). It simply is an issue of how long it takes to sort tiles, and if you can increase framerate by overlapping them. I'd be shocked silly if you are claiming that nvidia's somehow did this at a significant framerate penalty because it has to sit on its hands while it sorts tiles.
  • Scali - Tuesday, August 2, 2016 - link

    I think the problem is still your idea of 'latency'.
    The 'latency' people worry about in VR is the time between the start of a frame: taking user input, preparing and sending draw calls to the API, and the time that frame is actually displayed on screen.

    The 'latency' in the case of tile rendering is at a few levels lower in the abstraction. Yes, it may be possible that if you want to draw a single triangle, there's a slight bit of extra latency between sending the triangle to the driver and getting it on screen.
    However, the point of tile-based rendering is not to speed up the rendering of a single triangle, but rather to speed up the rendering of entire frames, which is millions of triangles, with lots of textures and complex shaders to be evaluated.

    So the equation for 'latency' in terms of VR is this:
    Total frame latency = tile preparation time + rendering time.

    Now, for an immediate mode renderer, 'tile preparation time' is 0.
    Let's say an immediate mode renderer takes N time rendering, so total frame latency is 0 + N = N.

    The tile-based renderer will have some non-zero tile preparation time, say K > 0.
    However, because of the tiles removing redundant work, and improving overal cache-coherency, the rendering time goes down. So its rendering time is L < N.
    Now, it may well be possible that K + L < N.
    In which case, it actually has *lower* total frame latency for VR purposes.

    "Obviously some API calls will stop things dead (anything that needs previously calculated pixels), "

    Again, thinking at the wrong level of abstraction.
    Such calls can only use buffers as a whole. So only the result of complete render passes, not of individual pixels or triangles.
    The whole tile-based rendering doesn't even apply here. All drawing is done (hit a fence) before you would take the next step. This is no different for immediate mode renderers.
  • HollyDOL - Monday, August 1, 2016 - link

    There are lots of notes about this technology coming from mobiles to GPUs... but wasn't tile rasterization implemented on PCs already back in times when mobile phones resembled a brick? Or is the current tile rasterization not related to the old PC one at all?
    Not trying to catch for words, just trying to understand why it is being related in articles to mobile version and not the old PC one.
  • jabber - Monday, August 1, 2016 - link

    Yeah...even the Dreamcast used it. The late 1990's was a fun but frustrating time with all sorts of paths for 3D. https://www.youtube.com/watch?v=SJKradGC9ao
  • BrokenCrayons - Monday, August 1, 2016 - link

    The Dreamcast used a PowerVR graphics chip. Those things were tile-based even in PC form and their add-in card did pretty well throwing around original Unreal back in their heyday. When I was still kicking around an aging Diamond Viper V550 (nvidia's TNT graphics chip), I briefly considered a Kyro as an upgrade, but finally settled on a GeForce 256 when the DDR models finally fed them the memory bandwidth that SDR memory couldn't.

Log in

Don't have an account? Sign up now