Decoupled L3 Cache

With Nehalem Intel introduced an on-die L3 cache behind a smaller, low latency private L2 cache. At the time, Intel maintained two separate clock domains for the CPU (core + uncore) and a third for what was, at the time, an off-die integrated graphics core. The core clock referred to the CPU cores, while the uncore clock controlled the speed of the L3 cache. Intel believed that its L3 cache wasn't incredibly latency sensitive and could run at a lower frequency and burn less power. Core CPU performance typically mattered more to most workloads than L3 cache performance, so Intel was ok with the tradeoff.

In Sandy Bridge, Intel revised its beliefs and moved to a single clock domain for the core and uncore, while keeping a separate clock for the now on-die processor graphics core. Intel now felt that race to sleep was a better philosophy for dealing with the L3 cache and it would rather keep things simple by running everything at the same frequency. Obviously there are performance benefits, but there was one major downside: with the CPU cores and L3 cache running in lockstep, there was concern over what would happen if the GPU ever needed to access the L3 cache while the CPU (and thus L3 cache) was in a low frequency state. The options were either to force the CPU and L3 cache into a higher frequency state together, or to keep the L3 cache at a low frequency even when it was in demand to prevent waking up the CPU cores. Ivy Bridge saw the addition of a small graphics L3 cache to mitigate this situation, but ultimately giving the on-die GPU independent access to the big, primary L3 cache without worrying about power concerns was a big issue for the design team.

When it came time to define Haswell, the engineers once again went to Nehalem's three clock domains. Ronak (Nehalem & Haswell architect, insanely smart guy) tells me that the switching between designs is simply a product of the team learning more about the architecture and understanding the best balance. I think it tells me that these guys are still human and don't always have the right answer for the long term without some trial and error.

The three clock domains in Haswell are roughly the same as what they were in Nehalem, they just all happen to be on the same die. The CPU cores all run at the same frequency, the on-die GPU runs at a separate frequency and now the L3 + ring bus are in their own independent frequency domain.

Now that CPU requests to L3 cache have to cross a frequency boundary there will be a latency impact to L3 cache accesses. Sandy Bridge had an amazingly fast L3 cache, Haswell's L3 accesses will be slower.

The benefit is obviously power. If the GPU needs to fire up the ring bus to give/get data, it no longer has to drive up the CPU core frequency as well. Furthermore, Haswell's power control unit can dynamically allocate budget between all areas of the chip when power limited.

Although L3 latency is up in Haswell, there's more access bandwidth offered to each slice of the L3 cache. There are now dedicated pipes for data and non-data accesses to the last level cache.

Haswell's memory controller is also improved, with better write throughput to DRAM. Intel has been quietly telling the memory makers to push for even higher DDR3 frequencies in anticipation of Haswell.

Feeding the Beast: 2x Cache Bandwidth in Haswell TSX
Comments Locked

245 Comments

View All Comments

  • tipoo - Sunday, October 7, 2012 - link

    I don't think so, doesn't the HD4000 have more bandwidth to work with than AMDs APUs yet offers worse performance? They still had headroom there. I think it's just for TDP, they limit how much power the GPUs can use since the architecture is oriented at mobile.
  • magnimus1 - Friday, October 5, 2012 - link

    Would love to hear your take on how Intel's latest and greatest fares against Qualcomm's latest and greatest!
  • cosmotic - Friday, October 5, 2012 - link

    Ah, an MPEG2 encoder. Just in time!
  • jamyryals - Friday, October 5, 2012 - link

    This made me :)
  • name99 - Friday, October 5, 2012 - link

    We laugh but one possibility is that Intel hopes to sell Haswell's inside US broadcast equipment.
    There isn't much broadcast equipment sold, but the costs are massive, and there's no obvious reason not to replace much of that custom hardware with intel chips.
    And much of the existing broadcast hardware (at least the MPEG2-encoding part) is obviously garbage --- the artifacts I see on broadcast TV are bad even for the prime-time networks, and are truly awful for the budget independent operators.

    Much like they have written a cell-tower stack to run on i7's to replace the similarly grossly over-priced custom hardware that lives in cell towers, and are currently deploying in China. Anand wrote about this about two weeks ago.
  • vt1hun - Friday, October 5, 2012 - link

    Do you have an idea when Intel will move to DDR4 ? Not with Haswell according to this article.

    Thank you
  • tipoo - Friday, October 5, 2012 - link

    Haswell EX for servers will support DDR4, but even Broadwell on desktops is only DDR3, we won't see DDR4 in desktops until 2015.
  • jwcalla - Friday, October 5, 2012 - link

    We'll probably see DDR4 in the ARM space before we have it on Intel.

    Maybe this should be AMD's focus of attack: if they can't compete on performance, at least try on chipset features.

    Perhaps Intel's biggest concern would be if somebody comes along with a super-efficient x86 emulator for ARM. Going forward, "legacy applications" is going to be an increasingly important selling point to prevent ARM inroads on the low end.

    Microsoft keeping their Windows ARM version locked-down is a key to that too, and likely a deference to their relationship with Intel. But Apple is less likely to similarly constrain themselves.
  • meloz - Saturday, October 6, 2012 - link

    >We'll probably see DDR4 in the ARM space before we have it on Intel.

    >Maybe this should be AMD's focus of attack: if they can't compete on performance, at least try on chipset features.

    The problem with DDR4 is likely going to be the price. We all know how the memory industry likes to jack up the prices whenever a new spec comes out. Remember how expensive DDr3 was when it started to replace DDR2?

    Some people joke that this transition is the only time they make any money in the RAM business, and considering the low prices of DDR3 you have to wonder.

    DDR4 might offer some performance and power advantage on release, but it will likely be more expensive and take time (12-18 months?) to offer a compelling performance / $ advantage over cheap DDR3 variants.

    If AMD is trying to position itself as 'value' brand, chaining themselves to DDR4 (before Intel's volume brings down the prices for everyone) could spell their doom.
  • Kevin G - Friday, October 5, 2012 - link

    Intel is set to launch Ivy Bridge EX on a new socket late in 2013 on a new socket. The on-die controller will likely use memory buffering similar to what Nehalem-EX and Westmere-EX use. The buffer chips may initially use DDR3 but this would allow for a trivial migration to DDR4 since the on-die controller doesn't communicate directly with the memory chips.

    Come to think of it, Intel could migration Nehalem-EX/Westmere-EX to DDR4 with a chipset upgrade. Vendors like HP put the buffer chips and memory slots on a daughter card so only that part would need replacement.

Log in

Don't have an account? Sign up now