Core: Load Me Up

When discussing the size of the reorder buffer, I mentioned that for some ops relying on the data of others, the order in which they need to be processed has to remain consistent – the load for the second op has to follow the store from the first in order for the calculation to be correct. This works for data that is read from and written to the same location in the same data stream, however with other operations, the memory addresses for loads and stores are not known until they pass the address generation units (AGUs).

This makes reordering a problem at a high level. You ultimately do not want a memory location to be written and stored by two different operations at the same time, or for the same memory address to be used by different ops while one of those ops is sitting in the reorder queue. When a load micro-op enters the buffer, the memory addresses of previous stores are not known until they pass the AGUs. Note, that this applies to memory addresses in the caches as well as main memory. However, if one can speed up loads and load latency in the buffers, this typically has a positive impact in most software scenarios.

With Core, Intel introduced a ‘New Memory Disambiguation’. For lack of a better analogy, this means that the issue of loads preceding stores is given a ‘do it and we’ll clean up after’ approach. Intel stated at the time that the risk that a load will load a value out of an address that is being written to by a store that has yet to be finished is pretty small (1-2%), and the chance decreases with larger caches. So by allowing loads to go head of stores, this allows a speedup but there has to be a catch net for when it goes wrong. To avoid this, a predictor is used to help. The dynamic alias predictor tries to spot this issue. If it happens, the load will have to be repeated, with a penalty of about 20-cycles.

Unofficial AnandTech Diagram

The predictor gives permission for a load to move ahead of a store, and after execution the conflict logic scans the buffer in the Memory reOrder Buffer (MOB) to detect an issue. If it happens, then the load is reprocessed back up the chain. In the worst case scenario, this might reduce performance, but as Johan said back in 2006: ‘realistically it is four steps forward and one step back, resulting in a net performance boost’.

Using this memory disambiguation technique, Intel reported a 40% performance boost purely on allowing loads to be more flexible in a number of synthetic loads (or 10-20% in real world), along with L1 and L2 performance boosts. It is worth noting that this feature affects INT workloads more often than FP workloads, purely on the basis that FP workloads tend to be more ordered by default. This is why AMD’s K8 lost ground to Intel on INT workloads, despite having a lower latency memory system and more INT resources, but stayed on track with FP.

Core: No Hyper-Threading, No Integrated Memory Controller

In 2016, HT and an integrated memory controller (IMC) are now part of the fundamental x86 microarchitecture in the processors we can buy. It can be crazy to think that one of the most fundamental upticks in x86 performance in the last decade lacked these two features. At the time, Intel gave reasons for both.

Simultaneous Hyper-Threading, the act of having two threads funnel data through a single core, requires large buffers to cope with the potential doubling of data and arguably halves resources in the caches, producing more cache pressure. However, Intel gave different reasons at the time – while SMT gave a 40% performance boost, it was only seen as a positive by Intel in server applications. Intel said that SMT makes hotspots even hotter as well, meaning that consumer devices would become power hungry and hot without any reasonable performance improvement.

On the IMC, Intel stated at the time that they had two options: an IMC, or a larger L2 cache. Which one would be better is a matter for debate, but Intel in the end went with a 4 MB L2 cache. Such a cache uses less power than an IMC, and leaving the IMC on the chipset allows for a wider support range of memory types (in this case DDR2 for consumers and FB-DIMM for servers). However, having an IMC on die improves memory latency significantly, and Intel stated that techniques such as memory disambiguation and improved prefetch logic can soak up this disparity.

As we know know, on-die IMCs are the big thing.

Core: Out of Order and Execution Core: Performance vs. Today
Comments Locked

158 Comments

View All Comments

  • perone - Friday, July 29, 2016 - link

    My E6300 is still running fine in a PC I have donated to a friend.
    It was set to 3GHz within a few days from purchase and never moved from that speed.
    Once or twice I changed the CPU fan as it was getting noisy.

    Great CPU and great motherboard the Asus P5B
  • chrizx74 - Saturday, July 30, 2016 - link

    These PCs are still perfectly fine if you install an SSD. I did it recently on an Acer Aspire t671 desktop. After modding the bios to enable AHCI I put an 850 evo (runs at sata 2 speed) and a pretty basic Nvidia GFX card. The system turned super fast and runs Windows 10 perfectly fine. You don't need faster processors all you need is get rid of the HDDs.
  • Anato - Saturday, July 30, 2016 - link

    I'm still running AMD Athlon x2 4850 2.5GHz as a file server + MythTV box. It supports ECC, is stable and has enough grunt to do its job so why replace. Yes, I could get bit energy efficiency but in my climate >50% of time heating is needed and new hardware has its risks of compatibility issues etc.

    +10 for anandtech again, article was great as always!
  • serendip - Sunday, July 31, 2016 - link

    I'm posting this on a Macbook with an E6600 2.4 GHz part. It's still rockin' after six years of constantly being tossed into a backpack. The comparisons between C2D and the latest i5 CPUs don't show how good these old CPUs really are - they're slow for hard number crunching and video encoding but they're plenty fast for typical workday tasks like Web browsing and even running server VMs. With a fast SSD and lots of RAM, processor performance ends up being less important.

    That's too bad for Intel and computer manufacturers because people see no need to upgrade. A 50% performance boost may look like a lot on synthetic benchmarks but it's meaningless in the real world.
  • artifex - Monday, August 1, 2016 - link

    "With a fast SSD and lots of RAM, processor performance ends up being less important."

    I remember back when I could take on Icecrown raids in WoW with my T7200-based Macbook.
    And I actually just stopped using my T7500-based Macbook a few months ago. For a couple years I thought about seeing if an SSD would perk it back up, but decided the memory bandwidth and size limitation, and graphics, was just not worth the effort. Funny that you're not impressed by i5s; I use a laptop with an i5-6200U, now. (Some good deals with those right now, especially if you can put up with the integrated graphics instead of a discrete GPU.) But then, my Macbooks were about 3 years older than yours :)
  • abufrejoval - Sunday, July 31, 2016 - link

    Replaced three Q6600 on P45 systems with socket converted Xeon X5492 at $60 off eBay each. Got 3.4GHz Quads now never using more than 60 Watts under Prime95 (150 Watts "official" TDP), with 7870/7950 Radeon or GTX 780 running all modern games at 1080p at high or ultra. Doom with Vulkan is quite fun at Ultra. Got my kids happy and bought myself a 980 ti off the savings. If you can live with 8GB (DDR2) or 16GB (DDR3), it's really hard to justify an upgrade from this 10 year old stuff.

    Mobile is a different story, of course.
  • seerak - Monday, August 1, 2016 - link

    My old Q6600 is still working with a friend.

    The laugher is that he (used to) work for Intel, and 6 months after I gave it to him in lieu of some owed cash, he bought a 4790K through the employee program (which isn't nearly as good as you'd think) and built a new system with it.

    The Q6600 works so well he's never gotten around to migrating to the new box - so the 4790k is still sitting unused! I'm thinking of buying it off him. I do 3D rendering and can use the extra render node.
  • jeffry - Monday, August 1, 2016 - link

    Thats a good point. Like, answering a question "are you willing to pay $800 for a new CPU to double the computers speed?" Most consumers say no. It all comes down to the mass market price.
  • wumpus - Thursday, August 4, 2016 - link

    Look up what Amazon (and anybody else buying a server) pays for the rest of the computer and tell me they won't pay $800 (per core) to double the computer's speed. It isn't a question of cost, Intel just can't do it (and nobody else can make a computer as fast as Intel, although IBM seems to be getting close, and AMD might get back in the "almost as good for cheap" game).
  • nhjay - Monday, August 1, 2016 - link

    The Core 2 architecture has served me well. Just last year I replaced my server at home which was based on a Core 2 Duo E6600 on a 965 chipset based motherboard. The only reason for the upgrade is that the CPU was having a difficult time handling transcoding jobs to several Plex clients at once.

    The desktop PC my kids use is Core 2 based, though slightly newer. Its a Core 2 Quad Q9400 based machine. It is the family "gaming" PC if you dare call it that. With a GT 730 in it, it runs the older games my kids play very well and Windows 10 hums along just fine.

Log in

Don't have an account? Sign up now