...and Now We Understand Why: Hyper Threading

Years ago I asked Pat Gelsinger what he was most excited about in the microprocessor industry, he answered: threading. This was during the early days of Hyper Threading but since the demise of the Pentium 4, we never really got another look at HT on any of the subsequent microprocessors based on Core.

Hyper Threading was Intel’s marketing name for simultaneous multi-threading (SMT), the ability to fetch instructions from two threads at the same time instead of just a single one. The OS sees a HT enabled processor as multiple processors, in this case two, and send two threads of instructions to the CPU.

Nehalem sees the return of Hyper Threading and it is met with much, much better performance than we ever saw with Pentium 4 for a number of reasons:

1) Nehalem has much more memory bandwidth and larger caches than Pentium 4, giving it the ability to get data to the core faster and more predictably

2) Nehalem is a much wider architecture than Pentium 4, taking advantage of it demands the use of multiple threads per core.

Just as the first Pentium 4 didn’t ship with Hyper Threading support, the architectures that Nehalem was derived from didn’t have HT either. The time is right with Nehalem for the reasons mentioned above and because there are simply many more applications that can take advantage of HT today than ever before.

The chart below shows the performance gain from enabling HT on a Nehalem processor, no other variables were changed:

Just like on Atom, enabling HT on Nehalem required very little die area. Only the register state, renamed return stack buffer and large page instruction TLBs are duplicated. The rest of the data structures are either partitioned in half when HT is enabled or are “competitively” shared, in that they dynamically determine how much of each resource is allocated per thread:

Enabling HT was a very power efficient move for Nehalem, and you can expect that it will improve performance in a number of applications - far more regularly than it ever did with the Pentium 4.

Now it should also make sense why Intel increased buffer sizes in Nehalem, because many of those buffers now have to deal with ops coming from two threads instead of just one. That’s more instructions to decode, better utilization of the front end, more micro-ops in the pipeline, more micro-ops in the re-order engine and more micro-ops executing concurrently.

New TLBs, Faster Unaligned Cache Accesses Cache Hierarchy
POST A COMMENT

36 Comments

View All Comments

  • steveyballme - Monday, November 03, 2008 - link

    All you really need to know is that it was designed to run Vista only!

    http://fakesteveballmer.blogspot.com">http://fakesteveballmer.blogspot.com
    Reply
  • rflcptr - Wednesday, September 24, 2008 - link

    The origin of Turbo Mode isn't Penryn, but rather Intel's DAT first found in the Santa Rosa mobile chipset, of which both Merom and Penryn were compatible (and functioning with the tech activated). Reply
  • hoohoo - Wednesday, September 10, 2008 - link

    Outside of games the only area where raw performance matters to me is running high performance code - for me this is graphics processing and 3D rendering code, and it's just a hobby. I have some dealings with the HPC crowd though.

    In the HPC biz memory bandwidth is a big issue. AMD has won hands down on that metric against Intel until perhaps the past six months. Nehalem server chip looks like it will beat Opteron on this metric.

    Another important metric for HPC is linear algebra performance. GPUs are very good at linear algebra, but GPUs have strange programming requirements for the people who understand scientific programming - these people want to worry about the science or engineering and not about the specific cache architecture of an Nvidia or ATI GPU.

    Just because I could, last winter I wrote some 2D graphics processing routines for an 8800GT+CUDA+AthlonX2-5200: gaussian blur, sharp filter, the like. I achieved on the order of 20x speed improvement on the 8800GT vs the AthlonX2 all on Linux - but it was a moderately brutal programming experience and I doubt your average researcher will do it. And, well, the PCIe bandwidth bottleneck would be a problem for large scale batch processing for such a simple calculation.

    I don't know about ATI GPUs yet. I got a 3870 eight months ago and installed the AMD HPC GPU SDK ('nuff acromyms for ya?), but I can't face the pain of using it if after booting all the way into XP I could be fragging away in HL2 or Q4, or conquering the world in Civ2 Gold instead - and nobody really uses Windows servers for HPC clusters anyway. I think about write Brook+ code on my 3870 sometimes but honestly I don't care that much. It'll be similar performance to the 8800, and it'll be *Windows* code.

    If Intel can produce a chip that is somewhere between Larabee and Nehalem, matching memory bandwidth with an easily programmed but highly parallel chip then Intel will have an opportunity to define a new sub-market: HPC processors.

    It is indicative of the deficiencies of AMD marketing that it has a good GPGPU and the only way to program it is on the one OS that HPC shies away from: Windows. Clusters mostly run on Linux or UNIX.

    But, AMD is working on a CPU+GPU product that could compete in that market.

    Which of AMD or Intel will realize that there is money to be made with a chip that combines 2 or 4 CPU cores + 4 or 8 GPU style linear algebra cores, all with IEEE double precision ability?

    Whither the Cell?

    :-)
    Reply
  • hooflung - Monday, November 03, 2008 - link

    I think you are minimizing what an operating system is. While it is true that Linux, AIX and Solaris account for a large number of HPC and cluster environments that doesn't mean Windows is poor in this regard.

    There are solid options for windows HPC where infiniband is very, very solid. Microsoft helped define the spec. However, it isn't done often enough in the public's eye like Linux. Remember, AIX and Windows were the most solid platforms for J2EE for a long, long time.

    Also, Windows Clusters can be the better TCO solution for people. EVE-Online used Windows 2000 (now 2003 x86 and x64 ) and wrote their own load balancing software in Stackless ( and now have their own async IO stackless library ) which holds 33k at any given time of the day.

    You really just have to decide what market you are trying to reach when considering OS choices. They all can provide similar performance.


    Reply
  • Pixy - Friday, September 05, 2008 - link

    All this sounds nice... but I have a question: when will laptops become fanless? The CPU is fast enough, work on turning down the heat! Reply
  • Davinchy - Tuesday, August 26, 2008 - link

    I thought I read somewhere that if the other processor cores where not working then they shut down and the one that was working got more juice and overclocked.. So wouldn't that suggest that for the average consumer This chip will game much faster than a penryn?

    Dunno Maybe I read it wrong
    Reply
  • jediknight - Sunday, August 24, 2008 - link

    For desktop builders.
    My aging S754 Athlon64 is dying.. so it's time to start thinking of building a new one. My laptop can only hold me out for so long, though..

    Will I be able to buy a quad-core Nehalem processor in about the $250-300 range by the end of the year?
    Reply
  • UnlimitedInternets36 - Saturday, August 23, 2008 - link

    Core i7 wins big time for 3D rendering, modeling, and CAD programs.

    Turbo is the best feature. I hope, at least in the Extreme Edition we can set the Turbo headroom to like 5Ghz!!! and have a totally dynamic over-clock scaling FTW!

    Zbrush can utilize 256 processors, so I think a 2 socket Core i7 will help me out just fine. Sure is don't automatic boost FPS in games, but but that's partially a programming issue as well. Sooner or later the coding will catch up.
    Reply
  • munyaka - Friday, August 22, 2008 - link

    I have always stuck with amd but this is the final Nail in the coffin. Reply
  • X1REME - Friday, August 22, 2008 - link

    Is there anything you see that we don't, please explain why? Reply

Log in

Don't have an account? Sign up now