Dual Core and Hyper Threading: Detriment or Not?

A question that we've always had is whether or not the inclusion of Hyper Threading support on Intel's dual-core Extreme Edition processors actually improves performance.  To answer that question, we have to look at two separate situations: multithreaded application performance and multitasking performance. 

For multithreaded application performance, we can now turn to a number of benchmarks.  We'll start off with 3dsmax 7 (higher numbers are better for the composite score, lower numbers are better for the rest of the numbers):

 3dsmax 7   Composite Score 3dsmax 5 rays CBALLS2 SinglePipe2 UnderWater
HT Enabled 3.0 12.922s 17.297s 83.515s 119.641s
HT Disabled 2.51 14.937s 21.141s 102.734s 141.641s

Here, the performance advantage is clear - enabling Hyper Threading provides Intel with another 14-19% over the base dual core Presler.  The same applies to almost all of the media encoding tests (if minutes or seconds are specified, lower numbers mean better performance):

 Media Encoding  DVD Shrink WME9 H.264 iTunes
HT Enabled 7.1m 46.5fps 9.96m 38s
HT Disabled 8.0m 38.6fps 8.53m 40s

Our Quicktime 7 H.264 encoding test is, generally speaking, an outlier from what we've seen of the impact of HT on multithreaded applications.  The rest of the applications show a clear benefit to being able to execute four threads simultaneously, even if the execution resources of the cores are shared with the remaining two threads. 

Armed with the latest SMP patches for Call of Duty 2 and Quake 4 (SMP was enabled in both games), we can also take a look at the impact of HT on Presler:

 Gaming   Call of Duty 2 Quake 4
HT Enabled 68.4 142.3
HT Disabled 69.3 142.3

Call of Duty 2 is another example where HT actually reduces performance, but given that enabling SMP itself reduces performance, we'd venture a guess that you shouldn't really be drawing any conclusions based on its data.  Quake 4, on the other hand, shows no difference in performance with SMP on or off. 

From what we've seen, with most individual multithreaded applications, enabling HT will improve performance even if, you have a dual core processor.  The degree of performance improvement will vary from application to application, but generally speaking, it's going to be positive (if anything at all). 

The more interesting situation is what happens when you're multitasking - does Hyper Threading really help on top of the inherent benefits of a dual core processor?  To find out, we put together a couple of multitasking scenarios aided by a tool that Intel provided us to help all of the applications start at the exact same time.  We're not necessarily concerned with the actual performance of these applications, but rather with the impact that the number of simultaneous applications has on each other and how that varies with HT being enabled or not. 

We took five applications (Grisoft AVG Anti-Virus 7, Lame MP3 Encoder 3.97a, Windows Media Encoder 9, Info-ZIP extraction utility and Splinter Cell: Chaos Theory) and used various combinations of them to try to figure out if there are multitasking benefits to a dual core processor with Hyper Threading enabled.  Note that some of these applications are multithreaded themselves, so just because we chose five applications doesn't mean that there are only five threads of execution; in reality, there are many more. 

We tested four different scenarios:
  1. A virus scan + MP3 encode
  2. The first scenario + a Windows Media encode
  3. The second scenario + unzipping files, and
  4. The third scenario + our Splinter Cell: CT benchmark.
The graph below compares the total time in seconds for all of the timed tasks (everything but Splinter Cell) to complete during the tests:

 AMD Athlon 64 X2 4800+   AVG LAME WME ZIP Total
AVG + LAME 22.9s 13.8s     36.7s
AVG + LAME + WME 35.5s 24.9s 29.5s   90.0s
AVG + LAME + WME + ZIP 41.6s 38.2s 40.9s 56.6s 177.3s
AVG + LAME + WME + ZIP + SCCT 42.8s 42.2s 46.6s 65.9s 197.5s

 Intel Pentium EE 955 (no HT)   AVG LAME WME ZIP Total
AVG + LAME 24.8s 13.7s     38.5s
AVG + LAME + WME 39.2s 22.5s 32.0s   93.7s
AVG + LAME + WME + ZIP 47.1s 37.3s 45.0s 62.0s 191.4s
AVG + LAME + WME + ZIP + SCCT 40.3s 47.7s 58.6s 83.3s 229.9s

 Intel Pentium EE 955 (HT Enabled)   AVG LAME WME ZIP Total
AVG + LAME 25.0s 13.3s     38.3s
AVG + LAME + WME 34.4s 21.6s 30.2s   86.2s
AVG + LAME + WME + ZIP 41.5s 28.1s 37.7s 54.2s 161.5s
AVG + LAME + WME + ZIP + SCCT 51.4s 33.0s 45.3s 71.1s 200.8s

As you can see, the Presler setup with HT enabled takes less time to complete the tasks as soon as you get beyond two simultaneous applications than the Presler system without HT enabled.  However, including the Athlon 64 X2 4800+ in the picture, we see that despite only being able to execute two threads at the same time, it does just as good of a job as the Presler HT system that can execute twice as many threads.  But to get the full picture, we have to measure one last data point: Splinter Cell performance. 

In the fourth scenario, we ran a total of five applications: AVG, Lame, WME, InfoZip and Splinter Cell.  The first four applications took a total of 197.5 seconds to complete on the Athlon 64 X2 4800+ system, ever so slightly quicker than the 200.8 seconds of the Presler HT system.  However, that does not take into account Splinter Cell performance - now let's see how our fifth application fared:

 Splinter Cell: CT   Average Min Max
Intel Pentium EE 955 (no HT) 71.0 fps 27.8 fps 128.1 fps
Intel Pentium EE 955 (HT enabled) 77.2 fps 32.5 fps 139.6 fps
AMD Athlon 64 X2 4800+ 66.9 fps 10.5 fps 185.0 fps

The Athlon 64 X2 4800+ actually is faster in the Splinter Cell: CT benchmark without anything else running, but here we see a very different story.  Although its 66 fps average frame rate is reasonably competitive with the Presler HT system, its minimum frame rate is barely over 10 fps - approximately 1/3 that of the Presler HT. 

While the regular Presler setup without HT managed to pull in higher frame rates than the AMD system, it did so while performing significantly worse in the remaining four applications.  The Presler HT vs. Athlon 64 X2 comparison is important because the two are virtually tied in the performance of the first four applications - but juggling all five of the applications is better done on the Presler HT system. 

We would say that if implemented properly, the benefits of a SMT system like Hyper Threading are definitely a good companion to a dual core desktop processor.  The usable limit, even for today's applications and usage models, is far from just two threads.  

Multi-Core Support in Games? Overall Performance using Winstone 2004
Comments Locked

84 Comments

View All Comments

  • Betwon - Saturday, December 31, 2005 - link

    NO, 2. is wrong.

    We need to know the end time of all tasks.

    The sum of each task's time will mislead.

    Because it can not show the real time spend to complete those tasks. (Time is overlayed)
  • Viditor - Saturday, December 31, 2005 - link

    quote:

    The sum of each task's time will mislead

    That's what I thought you meant...it's not misleading to me (nor to most of the other readers I gather, since nobody else has come forward). If you want to know the time to complete all tasks, then just take the largest time number of what ever test you wish.
    The reason that the setup they used appeals to me is that it helps me understand how an individual application is affected under those conditions, and the totals give me a relative picture of each of the apps as a whole. They haven't said that the time listed in the "Total" is actually how long things took in reality, they said it was the total of the times.
    I understand that the difference in those two phrases is perhaps a difficulty that many have when understanding a foriegn language...

    In the future, you might want to be less confrontational about your questions...
    Phrases like "There are still many knowledge about CPU that anandtech need to learn" are considered quite inhospitable...
  • fitten - Saturday, December 31, 2005 - link

    No. What is being mentioned here is "Wall Clock Time" vs. summation of execution times. You start a stopwatch at the instant you start your task bundle and when the last task in the bundle is finished, you stop your stopwatch. That's the wall clock time. Measuring CPU utilization time is quite easily seen to be false. with two CPUs, two tasks may take 20s each to finish, but they may start and finish at the same time after 20s of wall clock time... not 20s + 20s = 40s (each task will see 20s of CPU utilization time, but those sets of 20s are simultaneously used... 20s on one CPU and 20s on the other CPU at the same time - for a wall clock finish time of 20s, not 40s).

    And, you cannot simply take the largest time number. For example, suppose a task that runs for 1s is blocked by a second task which takes 10s, then the first task takes another 1s to finish, while 10s is larger than 2s, the wall clock time for this bundle is actually 12s (1s + 10s + 1s), not 10s or 2s.
  • bldckstark - Monday, January 2, 2006 - link

    Ummmmm, all of the times you are screaming about are listed. You can work it out for yourself. Although, when you look at the concurrent timing for each app, you will find that the AMD posted a better score. Concurrent timing results -
    AMD 4800+ - 65.9s
    955EE No HT - 83.3
    955EE With HT - 71.1

    Consecutive times of course show a different picture, and most of all, SPCC is a wreck during all of this for AMD.

    I have to say, I can't remember when I last opened 4 huge memory and CPU hogging programs at exactly the same time that I tried to play a game. These CPU's may be great at doing this many activities at once, but I can only do one thing at a time. Each of these programs would be started separately, and when they are on their way, I might start gaming. This is a great test, but not realistic.

  • Betwon - Friday, December 30, 2005 - link

    Your test of the SMP game --Quake4
    Your result is diffirent with the result of the more detail test from FiringSquad.
    http://www.firingsquad.com/hardware/quake_4_dual-c...">http://www.firingsquad.com/hardware/qua...-core_pe...

    We find that both HT and multi-core will improve the fps. P4 540 HT is about 1x % improvement.

    We need your explains. Why you say that HT will not help the in the the SMP game --Quake4?

    And we do not find that AthlonX2 have the more excellent improvement than PD, when they work (change from single-core-work to multi-core-work).

    Where is the benefits of on-die communication? 101ns latency? why is it slower the lateny of the memory? Is your cache2cache test software wrong?

    The test shows that
    SMPon/SMPoff PD840 102.9 fps/74.8 fps --> 37.6% improvement
    SMPon/SMPoff X2 3800+ 101.1 fps/74.4 fps --> 35.9% improvement
    SMPon/SMPoff X2 4800+ 103.2 fps/87.7 fps --> 17.7% improvement
    AMD test:
    http://www.firingsquad.com/hardware/quake_4_dual-c...">http://www.firingsquad.com/hardware/qua...al-core_...
    Intel test:
    http://www.firingsquad.com/hardware/quake_4_dual-c...">http://www.firingsquad.com/hardware/qua...-core_pe...

    The improvement ratio of PD is better than that of athlonX2.
  • psychobriggsy - Saturday, December 31, 2005 - link

    > SMPon/SMPoff PD840 102.9 fps/74.8 fps --> 37.6% improvement
    > SMPon/SMPoff X2 3800+ 101.1 fps/74.4 fps --> 35.9% improvement
    > SMPon/SMPoff X2 4800+ 103.2 fps/87.7 fps --> 17.7% improvement

    Looks like the issue is an upper performance limit around the 103 fps mark that probably isn't caused by the CPU - e.g., GPU or something else.

    If it is a memory bandwidth issue (which should be easy to test for by using faster memory and running the tests again) then there isn't much that can be done. Then again, the Intel processor uses DDR2 so ...

    If the 4800+ improved by 36% like the 3800+ then it would achieve around 120fps.

    In the end it just shows that the lower-priced dual-cores are still a better deal ... especially as they can be overclocked quite nicely.
  • Viditor - Friday, December 30, 2005 - link

    quote:

    The improvement ratio of PD is better than that of athlonX2.

    I would hope so, since the patch was partially written by Intel...
    quote:

    the 1.0.5 patch mentions Intel by name as a collaborator with no word on AMD...While it isn’t optimized for AMD64, frame rates on a dual-core Athlon 64 X2 3800+ are 63 percent faster at 800x600 with threading enabled. The 4800+ also feeds back good gains

    http://firingsquad.com/hardware/quake_4_dual-core_...">http://firingsquad.com/hardware/quake_4_dual-core_...
  • Betwon - Saturday, December 31, 2005 - link

    PD840 139.1fps/83fps --> 67.6%
    PD840 are 67.6 percent faster at 800x600 with threading enabled.

    67.6% > 63%

    Patch was partially written by Intel...?
    But the patch is very excellent!

    This patch is the most improvement game patch for SMP CPU.
    We can not find that another SMP game patch can improvement the game performent so much.

    Good quality of the codes!
  • Betwon - Saturday, December 31, 2005 - link

    PD840 139.1fps/83fps --> 67.6%
    PD840 are 67.6 percent faster at 800x600 with threading enabled.

    67.6% > 63%

    Patch was partially written by Intel...?
    But the patch is very excellent!

    This patch is the most improvement game patch for SMP CPU.
    We can not find that another SMP game patch can improvement the game performent so much.

    Good quality of the codes!
  • Viditor - Saturday, December 31, 2005 - link

    quote:

    But the patch is very excellent!

    Possibly, but Intel is well known for creating an imbalance in performance for their processors using software (e.g. the Intel Compiler). Most likely, future versions of the patch will correct for this. Either way, it really says less about the CPU than it does the patch...

Log in

Don't have an account? Sign up now