The Haswell Front End

Conroe was a very wide machine. It brought us the first 4-wide front end of any x86 micro-architecture, meaning it could fetch and decode up to 4 instructions in parallel. We've seen improvements to the front end since Conroe, but the overall machine width hasn't changed - even with Haswell.

Haswell leaves the overall pipeline untouched. It's still the same 14 - 19 stage pipeline that we saw with Sandy Bridge depending on whether or not the instruction is found in the uop cache (which happens around 80% of the time). L1/L2 cache latencies are unchanged as well. Since Nehalem, Intel's Core micro-architectures have supported execution of two instruction threads per core to improve execution hardware utilization. Haswell also supports 2-way SMT/Hyper Threading.

The front end remains 4-wide, although Haswell features a better branch predictor and hardware prefetcher so we'll see better efficiency. Since the pipeline depth hasn't increased but overall branch prediction accuracy is up we'll see a positive impact on overall IPC (instructions executed per clock). Haswell is also more aggressive on the speculative memory access side.

The image below is a crude representation I put together of the Haswell front end compared to the two previous tocks. If you click the buttons below you'll toggle between Haswell, Sandy Bridge and Nehalem diagrams, with major changes highlighted.


In short, there aren't many major, high-level changes to see here. Instructions are fetched at the top, sent through a bunch of steps before getting to the decoders where they're converted from macro-ops (x86 instructions) to an internally understood format known to Intel as micro-ops (or µops). The instruction fetcher can grab 4 - 5 x86 instructions at a time, and the decoders can output up to 4 micro-ops per clock.

Sandy Bridge introduced the 1.5K µop cache that caches decoded micro-ops. When future instruction fetch requests are made, if the instructions are contained within the µop cache everything north of the cache is powered down and the instructions are serviced from the µop cache. The decode stages are very power hungry so being able to skip them is a boon to power efficiency. There are also performance benefits as well. A hit in the µop cache reduces the effective integer pipeline to 14 stages, the same length as it was in Conroe in 2006. Haswell retains all of these benefits. Even the µop cache size remains unchanged at 1.5K micro-ops (approximately 6KB in size).

Although it's noted above as a new/changed block, the updated instruction decode queue (aka allocation queue) was actually one of the changes made to improve single threaded performance in Ivy Bridge.

The instruction decode queue (where instructions go after they've been decoded) is no longer statically partitioned between the two threads that each core can service.

The big changes in Haswell are at the back end of the pipeline, in the execution engine.

CPU Architecture Improvements: Background Prioritizing ILP
POST A COMMENT

248 Comments

View All Comments

  • sean.crees - Friday, October 05, 2012 - link

    I also worry about AMD. AMD has been 1-2 steps behind Intel for a while now, and now it seems Intel is at least 1 or 2 steps behind ARM and the future. Is that going to mean AMD is just too far behind to stay relevant now? If nothing else, i suppose AMD can fall back on graphic cards with it's ATI acquisition. Reply
  • Da W - Friday, October 05, 2012 - link

    If Haswell keeps x86 relevant in the tablet space and thus Windows 8 has the upper edge over Windows RT and Windows tablets can grab +-50% market share from the iPad, then it can be good for AMD, provided they survive that long. Reply
  • RedemptionAD - Friday, October 05, 2012 - link

    If AMD can create a team to focus on increasing IPC with a goal to one up Intel and have the ATI graphics people keep doing what they do with a time goal of say 2 years, (Note: Portables/Notebooks/Desktops should all be x64 by then), then I think that AMD will be able to return to their Athlon 64 glory days or better. Reply
  • Da W - Friday, October 05, 2012 - link

    AMD spend 1/10th of Intel in R&D. There are things they just cant do, i suspect pursuing higher x86 single trend performance is one of them. Reply
  • StevoLincolnite - Saturday, October 06, 2012 - link

    However, allot of the R&D Intel spends is on lithography type technologies, AMD doesn't have to spend Billions on such things anymore.

    Besides, a simple way for AMD to beat Intel when Intel is a node ahead is to throw more transistors at the problem which they have succeeded very well at doing in the past.
    Mind you, that comes at the cost of power and die size, however with stuff like clock mesh it can negate some of that.
    Reply
  • Kevin G - Friday, October 05, 2012 - link

    Being four steps behind ARM isn't necessarily a bad thing unless you're trying to leap frog them. AMD appears to be content with letting Intel spearhead the effort to get into the ultramobile market. With Intel two steps behind of ARM and they couldn't leap frog over ARM, there is little chance that AMD would be able to do the same. It isn't just knowing what battles to fight but also when to fight them. Reply
  • abufrejoval - Friday, October 05, 2012 - link

    It was only when I was reading Jana Rutkowska's notes on the current UI limitations within Qubes, that I finally understood (I believe!) the message which AMD has been pushing for quite a few years now: GPU compute will truly be an integral part of their future APUs in one or two generations, becoming almost an augmented instruction set instead of just a SoC.

    Currently all Qubes "user" applications, that is everything except the Dom0, can't use the GPU to render their graphics: It's basically software rendering into an off-screen composition buffer and then GPU assisted composition of these software buffers onto the visible screen (this time with all the wobble and transition effects we've all come to expect and love ;-)

    That's because although the GPU is on the same die even on the newest Trinity class APUs, it's still logically very separate, sharing only some stuff but bypassing, I believe, the ordinary page tables (not the IOMMU ones) and the snooping logic for caches. So even if GPU and CPU sit on the same die and use the same phyiscal DRAM bus, doing GPU compute implies using a dedicated part of that RAM in a way, which doesn't mesh seamlessly with CPU compute.

    But the roadmap seems to imply, that this limitation will go away, which would allow e.g. Qubes to use GPU assisted rendering anywhere in user space memory and thus also into a per DomU virtual framebuffer composed of quite ordinary paged virtual memory, which could then be assembled by the Dom0 for the visible screen or for video encoding and streaming to a remote display device e.g. for cloud gaming.

    This easy feeding of GPU "results" into another software layer is currently either impossible or requires major fiddling with device drivers so it's limited to the GPU vendors and bilateral deals such as nVidia and Splashtop. Once the GPU becomes more of an augmented instruction set, allowing OpenCL or even hardware primitives on ordinary user space paged virtual memory, this becomes as natural as running virtual machines with hardware virtualization.

    And at that point even the new 256bit FMA may look pretty lame compared to what hundreds of APU EUs could do. That to me explains rather well, why AMD isn't spending more transistors on a vastly improved CPU only x86 ISA: It truly belives it's a dead end for both personal and scientific workloads.

    It's a very daring bet and I very much admire them for having the vision and the balls to tie the company's survival to it. Over the last 40 years Intel seems to have failed with most of its visions (80432, i860, Itanium), but excelled on evolving x86. AMD, however, seems better on vision and noticably 2nd rate on execution.

    APUs are potentially quite dangerous both to nVidia and to Intel, because both can't easily duplicate them: The AMD/Intel cross licensing deal IMHO won't cover the GPU portion. Unless nVidia and Intel join, which would only happen if either of the two is in truly dire straights.

    But quite a few things need to fall in place over the next couple of years and AMD needs to survive them for that potential to develop. And it looks like all ther other players aren't standing still.

    Events like Apple potentially using Samsung augmented cash billions to turn TMSC into a private provider of 1x nm ARM SoCs are sending shock waves into the market, which may force "strange" alliances.

    These days when even trival things like "swipe to unlock" can be patented and used to bloodlet competitors I'm surprised to see IBM and Intel use things like transactional memory, which saw silocon first with Sun's Rock, I believe, or Intel turning to eDRAM for caches and frame buffers, which IBM's been implementing first on the p-Series.

    That leads me to an open question on the commercial workloads, which is almost the only domain, where I have difficulties seeing the immediate benefit of APUs, at least after Oracle's grab on Java and their expressed intent to make commercial workloads a SPARC exclusive (please see Larry's opening remarks on Openworld 2012): How can AMD make APUs the better Java and database engines? How can they make search, big data, map reduced or JavaScript run better on APUs?

    I can only guess that having managed CPU+GPU AMD would be in a better position to add xPU for all of the above.
    Reply
  • ltcommanderdata - Friday, October 05, 2012 - link

    A great, detailed description of Haswell's architecture. I do have some questions though.

    You mentioned that Intel will be including up to 1 redundant EU in the GPU array. Does that mean only GT3 will have the 1 redundant EU (41 total, 40 usable) with GT2 having no redundancy? Or is it 1 redundant EU per sub-slice, so GT2 will have 1 and GT3 will have 2?

    Will the embedded DRAM be implemented PoP like in SoC? When you say we'll see a version of Haswell with embedded DRAM do will all GT3 have embedded DRAM or will only some GT3 have embedded DRAM (kind of a GT4)?

    Given the long timescales of CPU design, there would be overlap between the Haifa team working on Sandy Bridge/Ivy Bridge (particularly Ivy Bridge) and the Hillsboro team working on Haswell. I was wondering if you knew how much opportunity there is for learning between consecutive designs in terms of magnitude of changes possible and timescales before things are pretty much fixed? I'm in no position to judge, but I was also wondering based on your knowledge of the architectures and/or interactions with members of the design teams if you sense any distinct difference in design philosophies between the Haifa and Hillsboro teams. Afterall, the Haifa team's background was in power-efficient, mobile-oriented designs whereas Hillsboro was high-performance, desktop/server oriented. You mentioned in the article that Haswell goes back to Nehalem's 3 clock domains due to lessons learned from Sandy Bridge/Ivy Bridge. While I don't doubt that's the primary reason, I wonder if design philosophy played a role too since Nehalem and Haswell are both Hillsboro designs and maybe they like 3 clock domains.
    Reply
  • Anand Lal Shimpi - Friday, October 05, 2012 - link

    Unfortunately that's all the info I have on redundancy in the GPU array. I think we'll have to wait until we're closer to launch to know more. The same goes for the nature of the on-package memory.

    I wondered the same thing about the correlation between design teams and decisions in Nehalem/Haswell, I refrained from speculating on it in the article because I didn't necessarily see any reason to doing so, but I definitely noticed the same correlation. It could just be a coincidence though. Nothing else beyond the L3 cache frequency really stood out to me as being an obvious common thread between Nehalem and Haswell though.

    Take care,
    Anand
    Reply
  • ltcommanderdata - Friday, October 05, 2012 - link

    Thanks again for your insights. Reply

Log in

Don't have an account? Sign up now