Core: Decoding, and Two Goes Into One

The role of the decoder is to decipher the incoming instruction (opcode, addresses), and translate the 1-15 byte variable length instruction into a fixed-length RISC-like instruction that is easier to schedule and execute: a micro-op. The Core microarchitecture has four decoders – three simple and one complex. The simple decoder can translate instructions into single micro-ops, while the complex decoder can convert one instruction into four micro-ops (and long instructions are handled by a microcode sequencer). It’s worth noting that simple decoders are lower power and have a smaller die area to consider compared to complex decoders. This style of pre-fetch and decode occurs in all modern x86 designs, and by comparison AMD’s K8 design has three complex decoders.

The Core design came with two techniques to assist this part of the core. The first is macro-op fusion. When two common x86 instructions (or macro-ops) can be decoded together, they can be combined to increase throughput, and allows one micro-op to hold two instructions. The grand scheme of this is that four decoders can decode five instructions in one cycle.

According to Intel at the time, for a typical x86 program, 20% of macro-ops can be fused in this way. Now that two instructions are held in one micro-op, further down the pipe this means there is more decode bandwidth for other instructions and less space taken in various buffers and the Out of Order (OoO) queue. Adjusting the pipeline such that 1-in-10 instructions are fused with another instruction should account for an 11% uptick in performance for Core. It’s worth noting that macro-op fusion (and macro-op caches) has become an integral part of Intel’s microarchitecture (and other x86 microarchitectures) as a result.

The second technique is a specific fusion of instructions related to memory addresses rather than registers. An instruction that requires an addition of a register to a memory address, according to RISC rules, would typically require three micro-ops:

Pseudo-code Instructions
read contents of memory to register2 MOV EBX, [mem]
add register1 to register2 ADD EBX, EAX
store result of register2 back to memory MOV [mem], EBX

However, since Banias (after Yonah) and subsequently in Core, the first two of these micro-ops can be fused. This is called micro-op fusion. The pre-decode stage recognizes that these macro-ops can be kept together by using smarter but larger circuitry without lowering the clock frequency. Again, op fusion helps in more ways than one – more throughput, less pressure on buffers, higher efficiency and better performance. Alongside this simple example of memory address addition, micro-op fusion can play heavily in SSE/SSE2 operations as well. This is primarily where Core had an advantage over AMD’s K8.

AMD’s definitions of macro-ops and micro-ops differ to that of Intel, which makes it a little confusing when comparing the two:

However, as mentioned above, AMD’s K8 has three complex decoders compared to Core’s 3 simple + 1 complex decoder arrangement. We also mentioned that simple decoders are smaller, use less power, and spit out one Intel micro-op per incoming variable length instruction. AMD K8 decoders on the other hand are dual purpose: it can implement Direct Path decoding, which is kind of like Intel’s simple decoder, or Vector decoding, which is kind of like Intel’s complex decoder. In almost all circumstances, the Direct Path is preferred as it produces fewer ops, and it turns out most instructions go down the Direct Path anyway, including floating point and SSE instructions in K8, resulting in fewer instructions over K7.

While extremely powerful in what they do, AMD’s limitation for K8, compared to Intel’s Core, is two-fold. AMD cannot perform Intel’s version of macro-op fusion, and so where Intel can pack one fused instruction to increase decode throughput such as the load and execute operations in SSE, AMD has to rely on two instructions. The next factor is that by virtue of having more decoders (4 vs 3), Intel can decode more per cycle, which expands with macro-op fusion – where Intel can decode five instructions per cycle, AMD is limited to just three.

As Johan pointed out in the original article, this makes it hard for AMD’s K8 to have had an advantage here. It would require three instructions to be fetched for the complex decoder on Intel, but not kick in the microcode sequencer. Since the most frequent x86 instructions map to one Intel micro-op, this situation is pretty unlikely.

Core: It’s all in the Prefetch, and More Cache Please Core: Out of Order and Execution
Comments Locked

158 Comments

View All Comments

  • Hrel - Thursday, July 28, 2016 - link

    10 years to double single core performance, damn. Honestly thought Sandy Bridge was a bigger improvement than that. Only 4 times faster in multi-core too.

    Glad to see my 4570S is still basically top of the line. Kinda hard to believe my 3 year old computer is still bleeding edge, but I guess that's how little room for improvement there is now that Moore's law is done.

    Guess if Windows 11 brings back normal functionality to the OS and removes "apps" entirely I'll have to upgrade to a DX12 capable card. But I honestly don't think that's gonna happen.

    I really have no idea what I'm gonna do OS wise. Like, I'm sure my computers won't hold up forever. But Windows 10 is unusable and Linux doesn't have proper support still.

    Computer industry, once a bastion of capitalism and free markets, rife with options and competition is now become truly monastic. Guess I'm just lamenting the old days, but at the same time I am truly wondering how I'll handle my computing needs in 5 years. Windows 10 is totally unacceptable.
  • Michael Bay - Thursday, July 28, 2016 - link

    I like how desperate you anti-10 shills are getting.
    More!
  • Namisecond - Thursday, July 28, 2016 - link

    I do not think that word means what you think it means...
  • TormDK - Thursday, July 28, 2016 - link

    You are right - there is not going to be a Windows 11, and Microsoft is not moving away from "apps".

    So you seems stuck between a rock in a hard place if you don't want to go on Linux or a variant, and don't want to remain in the Microsoft ecosystem.
  • mkaibear - Thursday, July 28, 2016 - link

    >Windows 10 is unusable

    Now, just because you're not capable of using it doesn't mean everyone else is incapable. There are a variety of remedial computer courses available, why not have a word with your local college?
  • AnnonymousCoward - Thursday, July 28, 2016 - link

    4570S isn't basically top of the line. It and the i5 are 65W TDP. The latest 91W i7 is easily 33% faster. Just run the benchmark in CPU-Z to see how you compare.
  • BrokenCrayons - Thursday, July 28, 2016 - link

    Linux Mint has been my primary OS since early 2013. I've been tinkering with various distros starting with Slackware in the late 1990s as an alternative to Windows. I'm not entirely sure what you mean my "doesn't have proper support" but I don't encourage people to make a full conversion to leave Windows behind just because the current user interface isn't familiar.

    There's a lot more you have to figure out when you switch from Windows to Linux than you'd need to learn if going from say Windows 7 to Windows 10 and the transition isn't easy. My suggestion is to purchase a second hand business class laptop like a Dell Latitude or HP Probook being careful to avoid AMD GPUs in doing so and try out a few different mainstream distros. Don't invest a lot of money into it and be prepared to sift through forums seeking out answers to questions you might have about how to make your daily chores work under a very different OS.

    Even now, I still keep Windows around for certain games I'm fond of but don't want to muck around with in Wine to make work. Steam's Linux-friendly list had gotten a lot longer in the past couple of years thanks to Valve pushing Linux for the Steam Box and I think by the time Windows 7 is no longer supported by Microsoft, I'll be perfectly happy leaving Windows completely behind.

    That said, 10 is a good OS at its core. The UI doesn't appeal to everyone and it most certainly is collecting and sending a lot of data about what you do back to Microsoft, but it does work well enough if your computing needs are in line with the average home user (web browsing, video streaming, gaming...those modest sorts of things). Linux can and does all those things, but differently using programs that are unfamiliar...oh and GIMP sucks compared to Photoshop. Just about every time I need to edit an image in Linux, I get this urge to succumb to the Get Windows 10 nagware and let Microsoft go full Big Brother on my computing....then I come to my senses.
  • Michael Bay - Thursday, July 28, 2016 - link

    GIMP is not the only, ahem, "windows ecosystem alternative" that is a total piece of crap on loonixes. Anything outside of the browser window sucks, which tends to happen when your code maintainers are all dotheads and/or 14 years old.
  • Arnulf - Thursday, July 28, 2016 - link

    I finally relegated my E6400-based system from its role as my primary computer and bought a new one (6700K, 950 Pro SSD, 32 GB RAM) a couple of weeks ago.

    While the new one is certainly faster at certain tasks the biggest advantage for me is significantly lower power consumption (30W idle, 90W under load versus 90W idle and 160-180W under load for the old one) and consequently less noise and less heat generation.

    Core2 has aged well for me, especially after I added a Samsung 830 to the system.
  • Demon-Xanth - Thursday, July 28, 2016 - link

    I still run an i5-750, NVMe is pretty much the only reason I want to upgrade at all.

Log in

Don't have an account? Sign up now