Haswell's Wide Execution Engine

Conroe introduced the six execution ports that we've seen used all the way up to Ivy Bridge. Sandy Bridge saw significant changes to the execution engine to enable 256-bit AVX operations but without increasing the back end width. Haswell does a lot here.

Just as before, I put together a few diagrams that highlight the major differences throughout the past three generations for the execution engine.


The reorder buffer is one giant tracking structure for all of the micro-ops that are in various stages of execution. The size of this buffer is directly impacted by the accuracy of the branch predictor as that will determine how many instructions can be kept in flight at a given time.

The reservation station holds micro-ops as they wait for the data they need to begin execution. Both of these structures grow by low double-digit percentages in Haswell.

Simply being able to pick from more instructions to execute in parallel is one thing, we haven't seen an increase in the number of parallel execution ports since Conroe. Haswell changes that.

From Conroe to Ivy Bridge, Intel's Core micro-architecture has supported the execution of up to six micro-ops in parallel. While there are more than six execution units in the system, there are only six ports to stacks of execution units. Three ports are used for memory operations (loads/stores) while three are on math duty. Over the years Intel has added additional types and widths of execution units (e.g. Sandy Bridge added 256-bit AVX operations) but it hasn't strayed from the 6 port architecture.

Haswell finally adds two more execution ports, one for integer math and branches (port 6) and one for store address calculation (port 7). Including both additional compute and memory hardware is a balanced decision on Intel's part.

The extra ALU and port does one of two things: either improve performance for integer heavy code, or allow integer work to continue while FP math occupies ports 0 and 1. Remember that Haswell, like its predecessors, is an SMT design meaning each core will see instructions from up to two threads at the same time. Although a single app is unlikely to mix heavy vector FP and integer code, it's quite possible that two applications running at the same time may produce such varied instructions. Having more integer ALUs is never a bad thing.

Also using port 6 is another unit that can handle x86 branch instructions. Branch heavy code can now enjoy two independent branch units, or if port 0 is occupied with other math the machine can still execute branches on port 6. Haswell moved the original Core branch unit from port 5 over to port 0, the most capable port in the system, so a branch unit on a lightly populated port makes helps ensure there's no performance regression as a result of the change.

Sandy Bridge made ports 2 & 3 equal class citizens, with both capable of being used for load or store address calculation. In the past you could only do loads on port 2 and store addresses on port 3. Sandy Bridge's flexibility did a lot for load heavy code, which is quite common. Haswell's dedicated store address port should help in mixed workloads with lots of loads and stores.

The other major addition to the execution engine is support for Intel's AVX2 instructions, including FMA (Fused Multiply-Add). Ports 0 & 1 now include newly designed 256-bit FMA units. As each FMA operation is effectively two floating point operations, these two units double the peak floating point throughput of Haswell compared to Sandy/Ivy Bridge. A side effect of the FMA units is that you now get two ports worth of FP multiply units, which can be a big boon to legacy FP code.

Fused Multiply-Add operations are incredibly handy in all sorts of media processing and 3D work. Rather than having to independently multiply and add values, being able to execute both in tandem via a single execution port increases the effective execution width of the machine. Note that a single FMA operation takes 5 cycles in Haswell, which is the same latency as a FP multiply from Sandy/Ivy Bridge. In the previous generation a floating point multiply+add took 8 cycles, so there's a good latency improvement here as well as the throughput boost from having two FMA units.

Intel focused a lot on adding more execution horsepower in Haswell without creating a power burden for legacy use cases. All of the new units can be shut off when not in use. Furthermore, Intel went in and ensured that this applied to the older execution units as well: in Haswell if you're not doing work, you're not consuming power.

Prioritizing ILP Feeding the Beast: 2x Cache Bandwidth in Haswell
POST A COMMENT

245 Comments

View All Comments

  • Astarael - Monday, October 15, 2012 - link

    Then get out of the comments section. Reply
  • Old_Fogie_Late_Bloomer - Tuesday, October 9, 2012 - link

    I finally made it through this article...hell, I took a course in orgnization and architecture earlier this year and I didn't come close to understanding everything written here.

    Still, it was a great read. Thanks for going to the trouble, Anand. :-)
    Reply
  • IKeelU - Friday, October 5, 2012 - link

    What's great is that Anand's been doing this for 15 years, has hired new editors along the way, and the quality hasn't wavered. I'm glad they haven't polluted their front page with shallow tech blogging like other sites I once enjoyed.

    I can't imagine this hobby without this site. I got into PC building just as it came online and have depended on it ever since.
    Reply
  • TheJian - Monday, October 8, 2012 - link

    I disagree. Ryan Smith's 660TI article had some ridiculous conclusions and went on and on about a bandwidth issue that isn't an issue at 1920x1200. As evidenced by the fact that in their own tests it beat the 7950B in 6 games by OVER 20% but lost in one game by less than 10 at 1920x1200. Read the comments section where I reduced his arguments to rubble. He went on about a dumb Korean monitor you'd have to EBAY to get (or amazon from a guy with ONE review, no phone, no faq page, no domain, and a gmail account for help...LOL), and runs in 2560x1440. If his conclusions were based on 1920x1200 like he said (which he repeated to me in the comments yet touts some "enthusiast 2560x1440" korean monitor as an excuse for his conclusions), he would have been forced to say the truth which was as his benchmarks showed and hardocp stated. It wipes the floor with the 7950B, just as the 680 does with the 7970ghz (yea, even in MSAA 8x) where they also proved only 1 in 4 games was even above 30fps...@2560x1600 with high AA which is why its pointless to draw conclusions based on 2560x1600 as Ryan did. Heck 2 of the 4 games at hardocp's high AA article didn't even reach above 20fps (15 & 17, and if bandwidth is an issue how come the 660TI won anyway?...LOL)

    Ryan was reduced to being a fool when I was done with him, and then Jarred W. came in and insinuated I was a Ahole & uninformed...ROFL. I used all of his own data from the 660TI & 7970B & 7970ghz edition articles (all by Ryan!) to point out how ridiculous his conclusions were. When a card loses 6 out of 7 games, you leave out Starcraft 2 (which you used for 2 previous articles 1 & 2 months before, then again IMMEDIATELY after) which would have shown it beating even the 7970ghz edition (as all the nv cards beat it in that game, hence he left it out), you claim some Korean Ebay'd monitor as a reason for your asinine conclusions (clear bias to me), in the 6 games it loses by an avg of 20% or more at the ONLY res 68 24in monitors on newegg use (or below, most 1920x1080, not even 1920x1200, only <2% in steampowered.com hardware survey have above 1920x1200 and most with dual cards in that case), you've clearly WAVERED in your QUALITY since Anand took up mac's/phones.

    I'm all for trying to save AMD (quit lowering your prices idiots, maybe you'll make some money), but stooping to dumb conclusions when all of your own evidence points in the exact opposite direction is really shady. Worse it was BOTH editors, as Ryan gave up (the evidence was voluminous, he wisely ran and hid) Jarred stepped in to personally attack me instead of the data...ROFLMAO. You know you've lost when you say nothing about my numbers at all, and resort to personal attacks. Ryan nor Jarred are dumb. They should have just admitted the article was full of bias or just changed the conclusion and moved on. With all the evidence I pointed out I wouldn't have wanted it to be in print any longer. It's embarrassing if you read the comments section after the article. You go back and realize what they did and wonder what the heck Ryan was thinking. He said that same crap in his next article. Either he loves AMD, gets money/hardware or something or maybe he just isn't as smart as I thought :)

    Anand's last hardware article on haswell said it would be a "MONSTER" but it's graphics won't catch AMD's integrated gpu and we only get 5-15% on the cpu side for a TOCK release. 2x gpu doesn't mean much with it being 9 months away and won't even catch AMD if they sit still. OUCH. So basically much ado about nothing on the desktop side, with a hope they can do something with it in mobile below 10w (only a tablet even then). I was pondering waiting for the "MONSTER" but now I know I'll just buy an Ivy at black friday...ROFL. What monster? In this article he says Broadwell is now the "monster"...heh. Bah...At least I got to read this before black friday. I would have been ticked had I read this after it hoping for the desktop monster. Since AMD now sucks on the cpu side we get speed bin bumps for microarchitecure TOCK's instead of 25-40% like the old days. I pray AMD stops the price war with NV and starts taking profits soon.

    If it wasn't for their advantage on the integrated gpu, they'd be bankrupt already and they will be there by xmas 2014 at the current burn of 650mil/year losses (they only have 1.5Bil in the bank and billions in debt compared to 3.5B cash for NV and no debt, never mind giving up the race to Intel who dwarfs NV by 10x on all fronts). AMD's only choice will be to further reduce their stock value by dilution of shares (AGAIN!) which will finally put them out to pasture. Hopefully someone will pick up their IP, put a few billion in it and compete again with Intel (samsung, ibm, NV if amd stock drops to $1 by then, even they could do it). Otherwise, my next card/cpu upgrade after black friday will cost $1000 each as NV/INTC suck us all dry. There stock is already WAY down in credit rating (B+ last I checked, FAR from NV AAA), and they are listed as 50% chance of bankruptcy vs. all their competitors at 1% chance (intc, qcom, nvda, samsung etc). The idea they'll take over mobile is far fetched at best. I see nowhere but down for their share price. That sucks. I hate apple, but at this point I wouldn't even mind if they picked them up and ran with AMD's cpu mantle. We might start getting ivy 3770's (or the next king) at prices less than $329 then! The first sale I've seen was $309 in my email from newegg this weekend and that sucks in 7 months. No speed upgrades, no price drops, just the same thing for 7 months with no pressure from a competitive-less AMD. Their gpu sucks compared to 660ti (hotter, noisy, less perf), so no black friday discount. You either go AMD for worse but savings or pay through the nose for NV. Same with Intel and the cpu. In that respect I guess I get Ryan trying to save them...ROFL. But prolonging the inevitable isn't helping, I'd rather have them go belly up now and someone buy the cpu and run with it before it's so far behind Intel they can't fix it no matter who buys the IP. I digress...
    Reply
  • Spunjji - Thursday, October 18, 2012 - link

    God that was painful to even attempt to read. :/ Comparing AMD vs. nVidia to AMD vs. Intel is foolish in the extreme (there's a rather significant difference in the cost/performance balance, where AMD and nVidia are actually competitors) so I feel justified in not reading most of that screed. Reply
  • ananduser - Friday, October 5, 2012 - link

    Yes...Anand's quite the loss for the PC crowd. He's reviewing macs nowadays. Reply
  • A5 - Friday, October 5, 2012 - link

    If you owned a site and could delegate reviews you don't find interesting (oooh boy, another 15-pound overpriced gaming laptop!), wouldn't you do the same thing? Reply
  • Kepe - Friday, October 5, 2012 - link

    Mmh, I've also noticed how Anand seems to have become quite an Apple fan. Don't get me wrong, I love his reviews, and Anandtech as a whole. But the fact that Anand always keeps talking about Apple is an eyesore to me. Particularly annoying in this article was how he mentioned "iPad form factor" as if it was the only tablet out there. Why not say "tablet form factor" instead? Would have been a lot more neutral. Also it seemed to confuse someone in to thinking Apple might be putting Haswell in to a new iPad. Reply
  • meloz - Friday, October 5, 2012 - link

    Agreed. The Apple devotion has gone too far and the editorial balanced has been lost. The podcasts -in particular- are basically an advertising campaign for Apple and a thinly disguised excuse for Anand & Friends to praise everything Apple. So I do not listen to them.

    The articles though -like this one about Haswell- are still worth reading. You still get as much gratuitous Apple references as Anand can throw in but there is also plenty of substance for everyone else.
    Reply
  • ravisurdhar - Friday, October 5, 2012 - link

    It's not "devotion", it's simply an accurate description of the market. How many iPads are out there? 100 million. One tenth of a BILLION. One for every 70 people on the planet. Well over half of Fortune 500 companies use them. Hospitals use them. Pilots use them. Name one other tablet that comes close to that sort of market penetration. When Apple decides to make their own silicon for their devices, it's a big, big deal.

    For the record, I don't have one. I just understand the significance of the 800 pound gorilla.
    Reply

Log in

Don't have an account? Sign up now