Building an Optimized Rasterizer for Larrabee

We've touched on the latency focus. We talked about caches and internal memory busses. But what about external memory? To be honest, the answer is that we don't know. But we have an idea of the direction they want to move in. Lower external bandwidth and possibly lower framebuffer size than traditional hardware seems to be the goal. If they can maintain good performance, reducing the amount of memory and the number of traces on the board will reduce the cost to add-in card vendors who may want to sell cards based on Larrabee (and in turn could reduce cost to the end user).

This bit of speculation isn't just based on what we know about the hardware so far. It's also based on the direction they decided to take with their rasterizer: Intel is implementing a tile based rasterizer to support DirectX and OpenGL as well as their own software renderer. Speaking of their software renderer, they did state that it would be available for use by developers so that they don't have to start from nothing. When asked whether it would be available only as a set of binaries or as source, our answer was that this was still under discussion. We put in our two cents and suggested that distributing the source is the way to go.

Anyway, we haven't discussed tile based rasterization in quite a while on AnandTech as the Kyro line didn't stick around on the desktop. To briefly run it down, screen space is broken up into tiles. For each tile, primitives (triangles) are set aside. Fragments are created for a tile based on all the geometry therein. Since none of these fragments are further processed or shaded until the entire tile is finished, only visible fragments are sent on to be shaded (at least, this is how it used to be: some aspects of DX10+ may require occluded fragments to hang around in some cases). Occluded fragments are thrown out during rasterization. Intel does also support Z culling at geometry, fragment and pixel levels, which is also very useful as the actual rasterization, blending etc. must occur in software as well. Cutting down work at every point possible is the modus operandi of optimizing graphics.

This is in stark contrast to immediate mode renderers, which are what ATI and NVIDIA have been building for the past decade. Immediate mode rendering requires more memory bandwidth as it processes every fragment in the scene, sometimes even those that aren't visible (that can't easily be thrown out by pre-shading depth test techniques). Immediate mode renderers have some tricks that can let them know what fragments will be visible in the scene to help cut down on work, but there are still cases where the GPU does extra work that it doesn't need to because the fragment it is processing and shading isn't even visible in the scene. Immediate mode renderers require more memory bandwidth than tile based renderers, but some algorithms and features have been easier to implement with immediate mode.

STMicro had a short run of popular tile (or deferred) renderers in the early 2000s with the Kyro series. This style of rendering still lives on in cell phone/smart phone and other ultra low power devices that need graphics. While performance on this hardware is very low, memory efficiency is important in this space and thus tile based renderers are preferred.

The technique dropped out of the desktop space not because it was inherently unable to perform, but simply because the players that won out in the era didn't choose to make use of it. With smaller process technology, larger on die cache sizes, larger tiles sizes, and smaller geometry (meaning less triangles span multiple tiles), some advantages of tile based rendering have gotten ... well, more advantageous with advancements in technology.

Getting into the details of tile based rendering is a bit beyond where we want to go right now. But the point is that this technique results fewer occluded fragments end up being shaded. Additionally, the grouping of fragments into tiles helps with breaking up the workload and could help to optimize prefetching and caching so that fragments are only ever fetched once from external memory (tiles on Larrabee will fit into less than half the L2 space per core). These and other features help to reduce bandwidth needs compared to immediate mode renderers.

Looking a little deeper, it is both the burden and advantage of Larrabee that it implements all steps of the traditional graphics pipeline in software. While current GPUs have hardware for geometry setup, rasterization, texturing, filtering, compressing, decompressing, blending and much more, Larrabee maintains a minimum of fixed function features (related to texturing). Often, for a specific purpose, fixed function hardware can be more efficient and faster than general purpose hardware. But at the same time, the needs of individual games shift, and allocating greater or fewer resources to a specific component of the rendering pipeline does have advantages over fixed function hardware. Current GPUs can't shift resources to offer faster rasterization if needed. They can't devote more flops to speeding up stenciling or blending.

The flexibility of Larrabee allows it to best fit any game running on it. But keep in mind that just because software has a greater potential to better utilize the hardware, we won't necessarily see better performance than what is currently out there. The burden is still on Intel to build a part that offers real-world performance that matches or exceeds what is currently out there. Efficiency and adaptability are irrelevant if real performance isn't there to back it up.

Thread and Data Management: It's Time to Blow Your Mind Shading Tiles with Larrabee (With Extra Goodies)
Comments Locked

101 Comments

View All Comments

  • DerekWilson - Monday, August 4, 2008 - link

    this is a pretty good observation ...

    but no matter how much potential it has, performance in games is going to be the thing that actually makes or breaks it. it's of no use to anyone if no one buys it. and no one is going to buy it because of potential -- it's all about whether or not they can deliver on game performance.
  • Griswold - Monday, August 4, 2008 - link

    Well, it seems you dont get it either.
  • helms - Monday, August 4, 2008 - link

    I decided to check out the development of this game I heard about ages ago that seemed pretty unique not only the game but the game engine for it. Going to the website it seems Intel acquired them at the end of February.

    http://www.projectoffset.com/news.php">http://www.projectoffset.com/news.php
    http://www.projectoffset.com/technology.php">http://www.projectoffset.com/technology.php

    I wonder how significant this is.
  • iwodo - Monday, August 4, 2008 - link

    I forgot to ask, how will the Software Render works out on Mac? Since all Direct X code are run to Software renderer doesn't that fundamentally mean most of the current Windows based games could be run on Mac with little work?
  • MamiyaOtaru - Monday, August 4, 2008 - link

    Not really. Larrabee will be translating directx to its software renderer. But unless Microsoft ports the directX API to OSX, there will be nothing for Larrabee to translate.
  • Aethelwolf - Monday, August 4, 2008 - link

    I wonder if game devs can write their games in directx then have the software renderer convert it into larrabee's ISA on windows platform, capturing the binary somehow. Distribute the directx on windows and the software ISA for mac. No need for two separate code paths.
  • iwodo - Monday, August 4, 2008 - link

    If anyone can just point out the assumption anand make are false? Then it would be great, because what he is saying is simply too good to be true.

    One point to mention the 4Mb Cache takes up nearly 50% of the die size. So if intel could rely more on bandwidth and saving on cache they could put in a few more core.

    And am i the only one who think 2010 is far away from Introduction. I think 2009 summer seems like a much better time. Then they will have another 6 - 8 months before they move on to 32nm with higher clock speed.

    And for the Game developers, with the cash intel have, 10 Million for every high profile studio like Blizzard, 50 Million to EA to optimize for Intel. It would only cost them 100 million of pocket money.
  • ZootyGray - Monday, August 4, 2008 - link

    I was thinking of all the p90's I threw away - could have made a cpu sandwich, with a lil peanut software butter, and had this tower of babel thing sticking out the side of the case with a fan on top, called lazarus, or something - such an opportunity to utilize all that old tek - such imagery.

    griswold u r funny :)
  • Griswold - Monday, August 4, 2008 - link

    You definitely are confused. Time for a nap.
  • paydirt - Monday, August 4, 2008 - link

    STFU Griswald. It's not helpful for you to grade every comment. Grade the article if you like... Anandtech, is it possible to add an ignore user function for the comments?

Log in

Don't have an account? Sign up now