A Physical Register File

Just like AMD announced in its Bobcat and Bulldozer architectures, in Sandy Bridge Intel moves to a physical register file. In Core 2 and Nehalem, every micro-op had a copy of every operand that it needed. This meant the out-of-order execution hardware (scheduler/reorder buffer/associated queues) had to be much larger as it needed to accommodate the micro-ops as well as their associated data. Back in the Core Duo days that was 80-bits of data. When Intel implemented SSE, the burden grew to 128-bits. With AVX however we now have potentially 256-bit operands associated with each instruction, and the amount that the scheduling/reordering hardware would have to grow to support the AVX execution hardware Intel wanted to enable was too much.

A physical register file stores micro-op operands in the register file; as the micro-op travels down the OoO engine it only carries pointers to its operands and not the data itself. This significantly reduces the power of the out of order execution hardware (moving large amounts of data around a chip eats tons of power), it also reduces die area further down the pipe. The die savings are translated into a larger out of order window.

The die area savings are key as they enable one of Sandy Bridge’s major innovations: AVX performance.

The AVX instructions support 256-bit operands, which as you can guess can eat up quite a bit of die area. The move to a physical register file enabled Intel to increase OoO buffers to properly feed a higher throughput floating point engine. Intel clearly believes in AVX as it extended all of its SIMD units to 256-bit wide. The extension is done at minimal die expense. Nehalem has three execution ports and three stacks of execution units:

Sandy Bridge allows 256-bit AVX instructions to borrow 128-bits of the integer SIMD datapath. This minimizes the impact of AVX on the execution die area while enabling twice the FP throughput, you get two 256-bit AVX operations per clock (+ one 256-bit AVX load).

Granted you can’t mix 256-bit AVX and 128-bit integer SSE ops, however remember SNB now has larger buffers to help extract more ILP.

The upper 128-bits of the execution hardware and paths are power gated. Standard 128-bit SSE operations will not incur an additional power penalty as a result of Intel’s 256-bit expansion.

AMD sees AVX support in a different light than Intel. Bulldozer features two 128-bit SSE paths that can be combined for 256-bit AVX operations. Compared to an 8-core Bulldozer a 4-core Sandy Bridge has twice the 256-bit AVX throughput. Whether or not this is an issue going forward really depends on how well AVX is used in applications.

The improvements to Sandy Bridge’s FP performance increase the demands on the load/store units. In Nehalem/Westmere you had three LS ports: load, store address and store data.

In SNB, the load and store address ports are now symmetric so each port can service a load or store address. This doubles the load bandwidth which is important as Intel doubled the peak floating point performance in Sandy Bridge.

There are some integer execution improvements in Sandy Bridge, although they are more limited. Add with carry (ADC) instruction throughput is doubled, while large scale multiplies (64 * 64) see a ~25% speedup.

The Front End The Ring Bus & System Agent
Comments Locked

62 Comments

View All Comments

  • yuhong - Tuesday, September 14, 2010 - link

    There is no VEX.256 for 256-bit integer ops, but there is a VEX.128 prefix that zeros the upper part of YMM registers to reduce the delays..
  • NaN42 - Tuesday, September 14, 2010 - link

    Well, I found a summary of the prefixes. Interestingly there are some exception, like I guessed, e.g. a VEX.128 prefix does not exist for conversion of packed floating points<->packed integers and for CRC32c + POPCNT.
  • CSMR - Tuesday, September 14, 2010 - link

    Anand:
    The best info available on an exciting platform, good job.

    I wonder if for the next article you could test DirectX / OpenGL compatibility? Intel advertises compliance for a lot of its products, but in reality the support is partial, and some applications that use DirectX / OpenGL entirely correctly are not supported by Intel graphics, including the current HD graphics.

    I've found this with fastpictureviewer (DirectX, I think 9) and Photoshop CS5 (OpenGL 2)

    This is quite shocking. Given that Intel is doing this currently, it would be great if reviewers could prod it into action, but unfortunately they tend to place speed first, correctness second or nowhere.
  • marass31 - Thursday, September 16, 2010 - link

    Hi CSMR,
    Could you please write more details about problems with DX and OGL on Intel HD graphics( including gfx driver version, system config ...). You mentioned about two applications: Fastpictureviewer and PSCS5, so could you please write some steps to reproduce to each of them - THX a lot.
  • ssj4Gogeta - Tuesday, September 14, 2010 - link

    What's the point of extreme editions if we're going to have affordable K SKUs?
    Or will socket 2011 not have any K SKUs? I'm guessing they'll leave the BCLCK unlocked on the 2011, and only have normal and extreme processors (no K processors). Or maybe extreme editions will just have more cores like 980X?
  • DanNeely - Tuesday, September 14, 2010 - link

    The extreme editions have always been for people who buy retail or who're playing with LN2 and need the most insanely binned part available. They've never been a mainstream OCer part.
  • MonkeyPaw - Tuesday, September 14, 2010 - link

    I have a bad feeling about the "k" chips and the future of overclocking. Sure, intel gave us turbo mode, but that almost seems like appeasement before the last shoe drops. First, limited turbo with good overclockng, then better turbo and less overclocking, and now it's sounding like slightly better turbo and even less overclocking. It looks like we are moving to intel-controlled overclocking. There's virtually no value left for the enthusiast--a user that is already just a small part of the market. Intel just decided what the enthusiast needs, but I don't think they get what those users actually want.

    I just don't buy that these limits are to prevent fraud. Mom and Pop stores are virtually all gone now, and I'd hate to think what Intel would do to a Dell or HP if they got caught overclocking desktops.

    I guess this leaves another door open for AMD. Sad, cause SnB looks like a great design.
  • This Guy - Wednesday, September 15, 2010 - link

    Hopefully Intel will allow the 'energy budget' to be increased when an extreme edition processer detects less thermal resistance (i.e. a bloody big heat sink). This would allow an EE CPU to either run with a higher multiplier or run at it's turbo frequency longer. (I'ld like this feature on all CPU's)

    This would make EE CPU's interesting if K CPU's catch up in terms of cores.
  • Shadowmaster625 - Tuesday, September 14, 2010 - link

    What are the prospects for using Intel's transcoder to convert DVDs to 700MB avi files? Either DivX, Xvid, or H.264? Or anything else better than MPEG-2?
  • Dfere - Tuesday, September 14, 2010 - link

    Since this seems to be, overall, a refinement, and not so much an improvement with new capabilities, and Anand's comments about the scalability of GPU related enhancements, that Intel is taking a two step approach towards CPU releases, in addition to its fab strategies? E.G, we see a new CPU, then it gets shrunk, then it gets improved (like this), then it gets bells and whistles (like a GPU etc), then we start over again with a really new architecture.....

Log in

Don't have an account? Sign up now