Drilling Deeper and Making the AMD/NVIDIA Comparison

Don't be fooled by the initial diagram, this simple x86 core gets far more complex. In the image below, the block to the left is the Larrabee core we mentioned earlier, to the right we've blown up the vector unit and its associated parts:

The vector unit is key and within that unit you've got a ton of registers and a very wide vector ALU, which leads us to the fundamental building block of Larrabee. NVIDIA's GT200 is built out of Streaming Processors, AMD's RV770 out of Stream Processing Units and Larrabee's performance comes from these 16-wide vector ALUs:

The vector ALU can behave as a 16-wide single precision ALU or an 8-wide double precision, although that doesn't necessarily translate into equivalent throughput (which Intel would not at this point clarify). Compared to ATI and NVIDIA, here's how Larrabee looks at a basic execution unit level:

NVIDIA's SPs work on a single operation, AMD's can work on five, and Larrabee's vector unit can work on sixteen. NVIDIA has a couple hundred of these SPs in its high end GPUs, AMD has 160 and Intel is expected to have anywhere from 16 - 32 of these cores in Larrabee. If NVIDIA is on the tons-of-simple-hardware end of the spectrum, Intel is on the exact opposite end of the scale.

We've already shown that AMD's architecture requires a lot of help from the compiler to properly schedule and maximize the utilization of its execution resources within one of its 5-wide SPs, with Larrabee the importance of the compiler is tremendous. Luckily for Larrabee, some of the best (if not the best) compilers are made by Intel. If anyone could get away with this sort of an architecture, it's Intel.

At the same time, while we don't have a full understanding of the details yet, we get the idea that Larrabee's vector unit is sort of a chameleon. From the information we have, these vector units could exectue atomic 16-wide ops for a single thread of a running program and can handle register swizzling across all 16 exectution units. This implies something very AMD like and wide. But it also looks like each of the 16 vector execution units, using the mask registers can branch independently (looking very much more like NVIDIA's solution).

We've already seen how AMD and NVIDIA architectural differences show distinct advantages and disadvantages against eachother in different games. If Intel is able to adapt the way the vector unit is used to suit specific situations, they could have something huge on their hands. Again, we don't have enough detail to tell what's going to happen, but things do look very interesting.

Not Quite a Pentium, Not Quite an Atom: The Larrabee Core Putting it all Together - Return of the Ring Bus
Comments Locked

101 Comments

View All Comments

  • Griswold - Monday, August 4, 2008 - link

    You seem to be confused. Time for a nap.
  • MDme - Monday, August 4, 2008 - link

    but AMD will have Cinema 2.0. did you see that demo? by 2010, AMD will have the RV990 or whatever...and Nvidia will have GT400?
  • phaxmohdem - Monday, August 4, 2008 - link

    Considering how long it took nVidia to release a single GPU significantly faster than G80, I'd be shocked if we wee GT300 by 2009/2010. however a GTX 295GT X2 ULTRA OC is not out of the question ;)
  • shuffle2 - Monday, August 4, 2008 - link

    mm², how hard is that to write? >.>
  • 1prophet - Monday, August 4, 2008 - link

    They need to hit one out of the park with the drivers (software)as well.
  • jltate - Tuesday, August 5, 2008 - link

    I've got a bunch of comments, so I'll just list them all here.

    SSE doesn't have fused multiply-add operations. Larrabee does -- thus that 10 core processor could perform a peak of 320 floating point operations per cycle (it's mentioned in the SIGGRAPH paper).

    Larrabee's programming model is variable width -- the hardware can and likely will be augmented in the future to perform more than just 16 operations in parallel.

    The ring bus between cores was stated to be for each group of 16. Intel stated that for more than 16 cores they'd use "multiple short-linked rings".

    Also, the diagram only shows one memory controller on one side with fixed function logic on the other, not two memory controllers as you showed on page 5 of your article. However, Intel stated in the paper that the configuration and number of processors, fixed function blocks and I/O controllers would be implementation dependent. So in effect it could very well have a half-dozen 64-bit interfaces like G80.

    My forecast? This thing will rock. I for one simply cannot wait.
  • Laura Wilson - Monday, August 4, 2008 - link

    that's the truth

    they say they know this. it sounds like they know this ... we'll see what happens :-)
  • gigahertz20 - Monday, August 4, 2008 - link

    I'm going to predict Larrabee will provide a huge boost of performance over Intel's current crappy integrated graphic solutions, but will not be able to compete with AMD/ATI's and Nvidia's high end GPU's when it (Larrabee) finally launches. If Intel can deliver a monster that can push 100+ FPS in Crysis and doesn't cost so much that it breaks the bank like the current Nvidia GTX 280's, then they will have a real winner! When it finally launches though, who knows what AMD/ATI and Nvidia will have out to compete against it, wonder if Intel is just trying to push out a mainstream chip or go high end as well...guess I need to read the rest of the article :)
  • JEDIYoda - Tuesday, August 5, 2008 - link

    dreaming again huh??? you people who want top notch performance without having to pay for it....rofl..hahaha
  • FITCamaro - Monday, August 4, 2008 - link

    This isn't mean to compete with their IGPs. At least not initially.

Log in

Don't have an account? Sign up now