Cache and Memory Hierarchy: Architected for Low Latency Operation

Intel has had a lot of experience building very high performance caches. Intel's caches are more dense than what AMD has been able to produce on the x86 microprocessor front, and as we saw in our Nehalem preview - Intel is also able to deliver significantly lower latency caches than the competition as well. Thus it should come as no surprise to anyone that Larrabee's strengths come from being built on fully programmable x86 cores, and from having very large, very fast coherent caches.

Each Larrabee core features 4x the L1 caches of the original Pentium. The Pentium had an 8KB L1 data cache and an 8KB L1 instruction cache, each Larrabee core has a 32KB/32KB L1 D/I cache. The reasoning is that each Larrabee core can work on 4x the threads of the original Pentium and thus with a 4x as large L1 the architecture remains balanced. The original Pentium didn't have an integrated L2 cache, but each Larrabee core has access to its own L2 cache partition - 256KB in size.

Larrabee's L2 pool increases with each core. An 8-core Larrabee would have 2MB of total L2 cache (256KB per core x 8 cores), a 32-core Larrabee would have an 8MB L2 cache. Each core only has access to its L2 cache partition, it can read/write to its 256KB portion of the pool and that's it. Communication with other Larrabee cores happens over the ring bus; a single core will look for data in its L2 cache, if it doesn't find it there it will place the request on the ring bus and will eventualy find the data in its L2.

Intel doesn't attempt to hide latency nearly as much as NVIDIA does, instead relying on its high speed, low latency caches. The ratio of compute resources to cache size is much lower with Larrabee than either AMD or NVIDIA's architectures.

  AMD RV770 NVIDIA GT200 Intel Larrabee
Scalar ops per L1 Cache 80 24 16
L1 Cache Size 16KB unknown 32KB
Scalar ops per L2 Cache 100 30 16
L2 Cache Size unknown unknown 256KB

 

While both AMD and NVIDIA are very shy on giving out cache sizes, we do know that RV670 had a 256KB L2 for the entire chip cache and can expect that RV770 to have something larger, but not large enough to come close to what Intel has with Larrabee. NVIDIA is much closer in the compute-to-cache ratio than AMD, which makes sense given its approach to designing much larger GPUs, but we have no reason to believe that NVIDIA has larger caches on the GT200 die than Intel with Larrabee.

The caches are fully coherent, just like they are on a multi-core desktop CPU. The fully coherent caches makes for some interesting cases when looking at multi-GPU configurations. While Intel wouldn't get specific with multi-GPU Larrabee plans, it did state that with a multi-GPU Larrabee setup Intel doesn't "expect to have quite as much pain as they [AMD/NVIDIA] do".

We asked whether there was any limitation to maintaining cache coherence across multiple chips and the anwswer was that it could be possible with enough bandwidth between the two chips. While NVIDIA and AMD are still adding bits and pieces to refine multi-GPU rendering, Intel could have a very robust solution right out of the gate if desired (think shared framebuffer and much more efficient work load division for a single frame).

How Many Cores in a Larrabee? Programming for Larrabee
Comments Locked

101 Comments

View All Comments

  • Shinei - Monday, August 4, 2008 - link

    Some competition might do nVidia good--if Larrabee manages to outperform nvidia, you know nvidia will go berserk and release another hammer like the NV40 after R3x0 spanked them for a year.

    Maybe we'll start seeing those price/performance gains we've been spoiled with until ATI/AMD decided to stop being competitive.

    Overall, this can only mean good things, even if Larrabee itself ultimately fails.
  • Griswold - Monday, August 4, 2008 - link

    Wake-up call dumbo. AMD just started to mop the floor with nvidias products as far as price/performance goes.
  • watersb - Monday, August 4, 2008 - link

    great article!

    You compare the Larrabee to a Core 2 duo - for SIMD instructions, you multiplied by a (hypothetical) 10 cores to show Larrabee at 160 SIMD instructions per clock (IPC). But you show non-vector IPC as 2.

    For a 10-core Larrabee, shouldn't that be x10 as well? For 20 scalar IPC
  • Adamv1 - Monday, August 4, 2008 - link

    I know Intel has been working on Ray Tracing and I'm really curious how this is going to fit into the picture.

    From what i remember Ray Tracing is a highly parallel and scales quite well with more cores and they were talking about introducing it on 8 core processors, it seems to me this would be a great platform to try it on.
  • SuperGee - Thursday, August 7, 2008 - link

    How it fit's.
    GPU from ATI and nV are called HArdware renderers. Stil a lot of fixed funtion. Rops TMU blender rasterizer etc. And unified shader are on the evolution to get more general purpouse. But they aren't fully GP.
    This larrabee a exotic X86 massive multi core. Will act as just like a Multicore CPU. But optimised for GPU task and deployed as GPU.
    So iNTel use a Software renderer and wil first emulate DirectX/OpenGL on it with its drivers.
    Like nv ATI is more HAL with as backup HEL
    Where Larrabee is pure HEL. But it's parralel power wil boost Software method as it is just like a large bunch of X86 cores.
    HEL wil runs fast, as if it was 'HAL' with LArrabee. Because the software computing power for such task are avaible with it.

    What this means is that as a GFX engine developer you got full freedom if you going to use larrabee directly.

    Like they say first with a DirectX/openGL driver. Later with also a CPU driver where it can be easy target directly. thus like GPGPU task. but larrabee could pop up as extra cores in windows.
    This means, because whatever you do is like a software solution.
    You can make a software rendere on Ratracing method, but also a Voxel engine could be done to. But this software rendere will be accelerated bij the larrabee massive multicore CPU with could do GPU stuf also very good. But will boost any software renderer. Offcourse it must be full optimised for larrabee to get the most out of it. using those vector units and X86 larrabee extention.

    Novalogic could use this to, for there Voxel game engine back in the day's of PIII.

    It could accelerate any software renderer wich depend heavily on parralel computing.
  • icrf - Monday, August 4, 2008 - link

    Since I don't play many games anymore, that aspect of Larrabee doesn't interest me any more than making economies of scale so I can buy one cheap. I'm very interested in seeing how well something like POV-Ray or an H.264 encoder can be implemented, and what kind of speed increase it'd see. Sure, these things could be implemented on current GPUs through Cuda/CTM, but that's such an different kind of task, it's not at all quick or easy. If it's significantly simpler, we'd actually see software sooner that supports it.
  • cyberserf - Monday, August 4, 2008 - link

    one word: MATROX
  • Guuts - Monday, August 4, 2008 - link

    You're going to have to use more than one word, sorry... I have no idea what in this article has anything to do with Matrox.
  • phaxmohdem - Monday, August 4, 2008 - link

    What you mean you DON'T have a Parhelia card in your PC? WTF is wrong with you?
  • TonyB - Monday, August 4, 2008 - link

    but can it play crysis?!

Log in

Don't have an account? Sign up now