The Front End

Sandy Bridge’s CPU architecture is evolutionary from a high level viewpoint but far more revolutionary in terms of the number of transistors that have been changed since Nehalem/Westmere.

In Core 2 Intel introduced a block of logic called the Loop Stream Detector (LSD). The LSD would detect when the CPU was executing a software loop turn off the branch predictor and fetch/decode engines and feed the execution units through micro-ops cached by the LSD. This approach saves power by shutting off the front end while the loop executes and improves performance by feeding the execution units out of the LSD.

In Sandy Bridge, there’s now a micro-op cache that caches instructions as they’re decoded. There’s no sophisticated algorithm here, the cache simply grabs instructions as they’re decoded. When SB’s fetch hardware grabs a new instruction it first checks to see if the instruction is in the micro-op cache, if it is then the cache services the rest of the pipeline and the front end is powered down. The decode hardware is a very complex part of the x86 pipeline, turning it off saves a significant amount of power. While Sandy Bridge is a high end architecture, I feel that the micro-op cache would probably benefit Intel’s Atom lineup down the road as the burden of x86 decoding is definitely felt in these very low power architectures.

The cache is direct mapped and can store approximately 1.5K micro-ops, which is effectively the equivalent of a 6KB instruction cache. The micro-op cache is fully included in the L1 i-cache and enjoys approximately an 80% hit rate for most applications. You get slightly higher and more consistent bandwidth from the micro-op cache vs. the instruction cache. The actual L1 instruction and data caches haven’t changed, they’re still 32KB each (for total of 64KB L1).

All instructions that are fed out of the decoder can be cached by this engine and as I mentioned before, it’s a blind cache - all instructions are cached. Least recently used data is evicted as it runs out of space.

This may sound a lot like Pentium 4’s trace cache but with one major difference: it doesn’t cache traces. It really looks like an instruction cache that stores micro-ops instead of macro-ops (x86 instructions).

Along with the new micro-op cache, Intel also introduced a completely redesigned branch prediction unit. The new BPU is roughly the same footprint as its predecessor, but is much more accurate. The increase in accuracy is the result of three major innovations.

The standard branch predictor is a 2-bit predictor. Each branch is marked in a table as taken/not taken with an associated confidence (strong/weak). Intel found that nearly all of the branches predicted by this bimodal predictor have a strong confidence. In Sandy Bridge, the bimodal branch predictor uses a single confidence bit for multiple branches rather than using one confidence bit per branch. As a result, you have the same number of bits in your branch history table representing many more branches, which can lead to more accurate predictions in the future.

Branch targets also got an efficiency makeover. In previous architectures there was a single size for branch targets, however it turns out that most targets are relatively close. Rather than storing all branch targets in large structures capable of addressing far away targets, SNB now includes support for multiple branch target sizes. With smaller target sizes there’s less wasted space and now the CPU can keep track of more targets, improving prediction speed.

Finally we have the conventional method of increasing the accuracy of a branch predictor: using more history bits. Unfortunately this only works well for certain types of branches that require looking at long patterns of instructions, and not well for shorter more common branches (e.g. loops, if/else). Sandy Bridge’s BPU partitions branches into those that need a short vs. long history for accurate prediction.

Introducing Sandy Bridge Physical Register File & Execution Improvements
POST A COMMENT

62 Comments

View All Comments

  • FXi - Tuesday, September 14, 2010 - link

    Only thing I am saddened by is that hybrid graphics apparently won't be "working" on the mobile high end chipset with the dual pci-e x8 lanes. It's extremely nice to have 2x a good modern mobile GPU, but still be able to switch to the Intel built in GPU when you want longer battery life on the road.

    That ability, in the 2920 was something I was truly hoping for.

    The rest of its abilities are quite nice and very welcome. USB 3 really is something to be sure they didn't miss. But otherwise kudos Intel.
    Reply
  • Drazick - Tuesday, September 14, 2010 - link

    Anand, few questions with your permission:

    I wonder If we could use a Discrete Graphics Card and enable the Media Engine.
    What about the DMI bus, Hasn't it become a bottleneck with SSD Drives and USB3?
    Does Intel have planes to address it?

    Thanks.
    Reply
  • EricZBA - Tuesday, September 14, 2010 - link

    Someone please release a decent 13.3 inch laptop using Sandy Bridge please. Reply
  • bitcrazed - Tuesday, September 14, 2010 - link

    I have a sneaking suspicion that Intel will be at the core of Apple's next laptop platform refresh with both SandyBridge and LightPeak.

    Apple's MacBook lineup is starting to feel a little pressure from the other PC laptop vendors who are starting to produce some nicely designed tin and will need to stay current in order to continue to sell their products at such high premiums.

    I'm imagining the next MacBook Pro lineup to offer 13" MBP's running i3 2120's and the 15" and 17" models running i5 2400/2500's or i7 2600's.

    Apple already have their own dynamic integrated/discrete GPU switching technology (as do nVidia) and can make even better use of SB's integrated GPU augmented by a modest discrete GPU to deliver the performance that most users need but with much reduced power drain.

    So how to differentiate themselves? LightPeak. Apple was the instigator of LightPeak to start with and Intel claimed at CES 2010 that it'd appear around a year later. That's next spring.

    One thing's for sure: 2011 is going to be a VERY interesting year for new laptop and desktop devices :)
    Reply
  • name99 - Tuesday, September 14, 2010 - link

    LightPeak WITHOUT USB3 will go over like a lead zeppelin.
    There are already plenty of USB3 peripherals available. I have never in my life seen a LightPeak peripheral, or even a review or sneak peek of one. Light Peak is coming, but I'm not sure that 2011 is its year.

    The rate at which CPU speeds now increases is low enough that very few buyers feel any sort of pressure to upgrade the machine they bough 3 years ago. Apple can't deal with that by simply offering new iMacs and MacBooks with the newest Intel offering, since no normal person is much excited by another 10% CPU boost.

    They have done an adequate job of dealing with this so far by boosting battery life, something (some) portable users do care about.

    They have done a mixed job of making more cores, hyperthreading and better GPUs a reason to upgrade. We have some low-level infrastructure in Snow Leopard, but we have fsckall user level apps that take advantage of this. Where is the multi-threaded Safari? Where is the iTunes that utilizes multiple cores, and the GPU for transcoding audio? Does FileVault use AES-NI --- apparently not.

    But Apple has done an truly astonishingly lousy job of tracking the one remaining piece of obvious slowness --- IO. Still no TRIM, still no eSATA, still no USB3.

    My point is that I don't know the Apple politics, but I do know that they are doing a very very bad job of shipping machines that compel one to upgrade. There is no need for me to upgrade my 3+yr old Penryn iMac, for example --- I'd get a replacement with more cores (not used by any of my software), a better GPU (but what I have plays video just fine), and most importantly, NO FASTER IO.

    Adding LightPeak to this mix without USB3 is not going to help any. People are still going to hold off on upgrades until USB3 is available, and no-one is going to rush to buy a LightPeak system so that they can then NOT run any of the many unavailable LightPeak peripherals on the shelves at Fry's.
    Reply
  • NaN42 - Tuesday, September 14, 2010 - link

    On page 3: "Compared to an 8-core Bulldozer a 4-core Sandy Bridge has twice the 256-bit AVX throughput."
    WTF? 8*128 = 4*256. Based on the premise that the fp-scheduler of one Bulldozer module (two cores) can schedule e.g. one add and one mul avx-instruction per clock cycle, they have the same throughput. I think both architectures will have a delay for e.g. shuffling ymm-registers (compared to current xmm-instructions) because data has to be exchanged between different pipelines/ports (Hopefully the picture provided by Intel is correct). Perhaps the delay is smaller in Sandy Bridge cores. I expect some delays when one mixes floating-point and integer instructions on Sandy Bridge. (Currently I don't know, whether there exists a VEX prefix for xmm integer instructions. If there's no VEX prefix the delays will be great on both platforms.)
    Reply
  • gvaley - Tuesday, September 14, 2010 - link

    "...you get two 256-bit AVX operations per clock."

    "AMD sees AVX support in a different light than Intel. Bulldozer features two 128-bit SSE paths that can be combined for 256-bit AVX operations. "

    So it's actually 8*256 = 4*2*256. At least this is how I see it.
    Reply
  • NaN42 - Tuesday, September 14, 2010 - link

    "So it's actually 8*256 = 4*2*256. At least this is how I see it. "

    Ok, my calculation was a bit different. 4*2*256 will be true, but only if you mix additions and multiplications. Whether AMD is 8*2*128 depends on the fp-scheduler (based on the premise that one SIMD unit consists of a fmul, fadd and fmisc unit or something similar)
    Reply
  • NaN42 - Tuesday, September 14, 2010 - link

    ... one can do another floating point operation which goes through port 5, but the peak performance of additions and multiplications is more relevant in applications. Reply
  • Spacksack - Tuesday, September 14, 2010 - link

    I think you are right. I would think bulldozer can manage the same theoreticakl throughput by issuing one combined FMA instruction (16 flop) / clock and module.

    More importantly Bulldozer will achieve hight throughput for all the existing SSE code by having two independent FMA units. I have no idea how Anand could make such a mistake.
    Reply

Log in

Don't have an account? Sign up now