Building an Optimized Rasterizer for Larrabee

We've touched on the latency focus. We talked about caches and internal memory busses. But what about external memory? To be honest, the answer is that we don't know. But we have an idea of the direction they want to move in. Lower external bandwidth and possibly lower framebuffer size than traditional hardware seems to be the goal. If they can maintain good performance, reducing the amount of memory and the number of traces on the board will reduce the cost to add-in card vendors who may want to sell cards based on Larrabee (and in turn could reduce cost to the end user).

This bit of speculation isn't just based on what we know about the hardware so far. It's also based on the direction they decided to take with their rasterizer: Intel is implementing a tile based rasterizer to support DirectX and OpenGL as well as their own software renderer. Speaking of their software renderer, they did state that it would be available for use by developers so that they don't have to start from nothing. When asked whether it would be available only as a set of binaries or as source, our answer was that this was still under discussion. We put in our two cents and suggested that distributing the source is the way to go.

Anyway, we haven't discussed tile based rasterization in quite a while on AnandTech as the Kyro line didn't stick around on the desktop. To briefly run it down, screen space is broken up into tiles. For each tile, primitives (triangles) are set aside. Fragments are created for a tile based on all the geometry therein. Since none of these fragments are further processed or shaded until the entire tile is finished, only visible fragments are sent on to be shaded (at least, this is how it used to be: some aspects of DX10+ may require occluded fragments to hang around in some cases). Occluded fragments are thrown out during rasterization. Intel does also support Z culling at geometry, fragment and pixel levels, which is also very useful as the actual rasterization, blending etc. must occur in software as well. Cutting down work at every point possible is the modus operandi of optimizing graphics.

This is in stark contrast to immediate mode renderers, which are what ATI and NVIDIA have been building for the past decade. Immediate mode rendering requires more memory bandwidth as it processes every fragment in the scene, sometimes even those that aren't visible (that can't easily be thrown out by pre-shading depth test techniques). Immediate mode renderers have some tricks that can let them know what fragments will be visible in the scene to help cut down on work, but there are still cases where the GPU does extra work that it doesn't need to because the fragment it is processing and shading isn't even visible in the scene. Immediate mode renderers require more memory bandwidth than tile based renderers, but some algorithms and features have been easier to implement with immediate mode.

STMicro had a short run of popular tile (or deferred) renderers in the early 2000s with the Kyro series. This style of rendering still lives on in cell phone/smart phone and other ultra low power devices that need graphics. While performance on this hardware is very low, memory efficiency is important in this space and thus tile based renderers are preferred.

The technique dropped out of the desktop space not because it was inherently unable to perform, but simply because the players that won out in the era didn't choose to make use of it. With smaller process technology, larger on die cache sizes, larger tiles sizes, and smaller geometry (meaning less triangles span multiple tiles), some advantages of tile based rendering have gotten ... well, more advantageous with advancements in technology.

Getting into the details of tile based rendering is a bit beyond where we want to go right now. But the point is that this technique results fewer occluded fragments end up being shaded. Additionally, the grouping of fragments into tiles helps with breaking up the workload and could help to optimize prefetching and caching so that fragments are only ever fetched once from external memory (tiles on Larrabee will fit into less than half the L2 space per core). These and other features help to reduce bandwidth needs compared to immediate mode renderers.

Looking a little deeper, it is both the burden and advantage of Larrabee that it implements all steps of the traditional graphics pipeline in software. While current GPUs have hardware for geometry setup, rasterization, texturing, filtering, compressing, decompressing, blending and much more, Larrabee maintains a minimum of fixed function features (related to texturing). Often, for a specific purpose, fixed function hardware can be more efficient and faster than general purpose hardware. But at the same time, the needs of individual games shift, and allocating greater or fewer resources to a specific component of the rendering pipeline does have advantages over fixed function hardware. Current GPUs can't shift resources to offer faster rasterization if needed. They can't devote more flops to speeding up stenciling or blending.

The flexibility of Larrabee allows it to best fit any game running on it. But keep in mind that just because software has a greater potential to better utilize the hardware, we won't necessarily see better performance than what is currently out there. The burden is still on Intel to build a part that offers real-world performance that matches or exceeds what is currently out there. Efficiency and adaptability are irrelevant if real performance isn't there to back it up.

Thread and Data Management: It's Time to Blow Your Mind Shading Tiles with Larrabee (With Extra Goodies)
Comments Locked

101 Comments

View All Comments

  • ocyl - Monday, August 4, 2008 - link

    Larrabee will be shipped when Diablo III is, and it will mark the beginning of the end for DirectX.

    Calling it first here at AnandTech.

    Thanks go to Anand and Derek for the very well written article. You are the ones who keep tech journalism alive.
  • erikespo - Monday, August 4, 2008 - link

    "At 143 mm^2, Intel could fit 10 Larrabee-like cores so let's double that. Now we're at 286mm^2 (still smaller than GT200 and about the size of AMD's RV770) and 20-cores. Double that once more and we've got 40-cores and have a 572mm^2 die, virtually the same size as NVIDIA's GT200 but on a 65nm process. "

    this math is way off

    143 mm^2 is 20449mm.. if they fit 10 there that is 2044.9 per core
    286mm^2 is 81796mm.. that is 4X the space so 40 cores in 286^2
    and 572mm^2 is 327184mm is 160 cores..

    double length will double area.. doubling length and width will quadruple area.
  • bauerbrazil - Monday, August 4, 2008 - link

    Hahahaha, YOUR math is way off!!!

    Jesus.
  • erikespo - Monday, August 4, 2008 - link

    I see where the article and you got your math..
    you both did 143mm^2 / 10 and got 14.3 then divided 286^2 by 14.3 and got 20.. this math is only acting on the one number..

    I know this because the area of 14.3 is 204.49 mm. 10 of those would be 2044.9mm. but the area of 143mm^2 is 20449mm.
  • WeaselITB - Monday, August 4, 2008 - link

    Wow ... No.
    143mm^2 is NOT equivalent to 143^2 mm ... Your analysis is flawed.

    If we use your example, 2mm^2 is NOT 2mm x 2mm ... it's actually root(2)mm x root(2)mm ... 4mm^2 is 2mm x 2mm, not 4mm x 4mm (that'd be 16mm).

    Maybe you should examine in depth that Wikipedia article you linked earlier ...

    Thanks,
    -Weasel
  • MamiyaOtaru - Monday, August 4, 2008 - link

    143mm^2 is NOT equivalent to 143^2 mm

    ^^THIS

    That's it in a nutshell. mm² doesn't mean you square 143, it refers to Square Millimeters, a unit of area (unlike Millimeters, a unit of distance).

    Revised mspaint illustration: http://img379.imageshack.us/my.php?image=squaremmh...">http://img379.imageshack.us/my.php?image=squaremmh...
  • erikespo - Monday, August 4, 2008 - link

    Anandtech Comment Section.. Forever record of my retardedness
  • erikespo - Monday, August 4, 2008 - link

    Dang.. Many apologies..
    got my square area and squared numbers confused..
  • WeaselITB - Monday, August 4, 2008 - link

    [quote]4mm^2 is 2mm x 2mm, not 4mm x 4mm (that'd be 16mm).[/quote]

    Dang, that was supposed to read "(that'd be 16mm^2)."

    Thanks,
    -Weasel
  • erikespo - Monday, August 4, 2008 - link

    another way to look as it is how man 143mm^2 squares does it take to make up 286mm^2?

    only 2 would only be 143mm x 286mm

    since 10 cores fit into 143 x 143, 20 will fit into 143 x 286mm
    286 x 286 (which is double that of 143 x 286mm) the 286mm^2 would fit 40

Log in

Don't have an account? Sign up now