More Sockets, but Lower Performance?

When AMD briefed us on Quad FX, the performance focus was on heavy multitasking (AMD calls this "Megatasking") or very multi-threaded tests. We figured it was an innocent attempt to make sure we didn't run a bunch of single threaded benchmarks on Quad FX and proclaim it a failure. Given that the vast majority of our CPU test suite is multi-threaded to begin with, we didn't think there would be any problems showcasing where four cores is better than two, much like we did in our Kentsfield review.

However when running our SYSMark 2004SE tests we encountered a situation that didn't make total sense to us at first, and somewhat explained AMD's desire for us to strongly focus on megatasking/multithreaded tests. If we pulled one of the CPUs out of the Quad FX system, we actually got higher performance in SYSMark than with both CPUs in place. In other words, four cores was slower than two.

CPU SYSMark 2004SE Internet Content Creation Office Productivity
2 Sockets (4 cores) 261 373 182
1 Socket (2 cores) 288 393 211

You'll see that in some of the individual tests there is an advantage to having both CPUs installed, but in the vast majority of them performance goes down with four cores. It turns out that there are two explanations for the anomaly.

CPU Internet Content Creation 3D Creation 2D Creation Web Publication
2 Sockets (4 cores) 373 245 514 411
1 Socket (2 cores) 393 364 453 369

First, in Internet Content Creation SYSMark 2004SE, there appears to be an issue with having two physical CPUs in the system that results in the 3dsmax rendering test only spawning a single thread, lowering performance below that of a normal dual-core processor. This problem may be caused by a licensing violation within the benchmark where it is expecting to see one physical CPU with multiple cores and isn't prepared to deal with multiple CPUs. Regardless of the exact cause of the problem, it doesn't appear to be anything more than a benchmark issue. It's the performance in the Office Productivity suite that is far more worrisome because there is no issue with the benchmark that's causing the problem.

CPU Office Productivity Communication Document Creation Data Analysis
2 Sockets (4 cores) 182 171 259 137
1 Socket (2 cores) 211 187 285 176

The Office Productivity suite of SYSMark 2004SE wasn't the only situation where we saw lower performance on Quad FX than with a single dual core setup. 3D games seemed to suffer the most; take a look at what happens in our Oblivion and Half Life 2: Episode One tests:

CPU Oblivion - Bruma Oblivion - Dungeon Half Life 2: Episode One
2 Sockets (4 cores) 67.3 78.3 155.8
1 Socket (2 cores) 75.2 90.9 165.7

Once again, populate both sockets in the Quad FX system and performance goes down. The explanation for these anomalies lies in the result of one more benchmark, CPU-Z's memory latency test:

CPU CPU-Z Latency (8192KB, 128-byte)
2 Sockets (4 cores) 55.3 ns
1 Socket (2 cores) 43.3 ns

With both sockets populated, memory latency goes up by around 27% and thus in applications that are more latency sensitive and don't necessarily need all four cores, you get worse performance than with a single dual-core CPU. The added latency comes from the additional probing over the HT bus that's done for coherency whenever a memory request is made to see where the latest copy of the data resides.

It's a problem that will go away if you have a single quad-core CPU with one memory controller, but one that makes Quad FX a tougher pill to swallow compared to Intel's quad-core offerings.

How does a 3GHz Athlon 64 X2 Perform? Four cores, 1 Socket or Four cores, 2 Sockets?
Comments Locked

88 Comments

View All Comments

  • Nighteye2 - Thursday, November 30, 2006 - link

    I'm interested in that as well. NUMA will be an important part of 4x4 performance - so why isn't NUMA used in the benchmark, or at least mentioned. NUMA is the advantage of having 2 sockets - having NUMA disabled in this benchmark by using an OS that does not support it unfairly cripples the 4x4 performance.
  • Viditor - Thursday, November 30, 2006 - link

    quote:

    NUMA will be an important part of 4x4 performance - so why isn't NUMA used in the benchmark, or at least mentioned

    Agreed...I think that one of the reasons that AMD delayed release of this so long is that they wanted to show it on Vista instead of WinXP. It seems to me that there would be a substantial difference between the 2...
  • Viditor - Thursday, November 30, 2006 - link

    As a follow up on just how important NUMA is for 4x4, check out http://babelfish.altavista.com/babelfish/trurl_pag...">this review which actually compares the 2...
    There is a DRASTIC difference between performance on XP and Vista!
  • Accord99 - Friday, December 1, 2006 - link

    Most of the difference is running in 64-bit mode. The extra bandwidth didn't help the FX-74 in the megatasking bench. They didn't do any game benchmarks but based on past reviews of NUMA, the FX-74 will probably keep on losing to the FX-62 in games.
  • Viditor - Friday, December 1, 2006 - link

    quote:

    Most of the difference is running in 64-bit mode

    I'm not sure I agree...there's a 22.5% increase in performance there, and I haven't seen anything like that on the 64 bit version of 3DS Max before...
    Not to mention that Vista isn't known as a real speed demon (quite the opposite) for these apps...
    What the 64bit version does is allow for larger scene use and stability, not so much faster rendering.
  • photoguy99 - Friday, December 1, 2006 - link

    quote:

    I'm not sure I agree...there's a 22.5% increase in performance there, and I haven't seen anything like that on the 64 bit version of 3DS Max before...


    Sorry totally wrong -

    64-bit can make a big difference in performance depending on the app. Remember you can process 64 bits of data in a typical instruction instead of 32, so theoretically twice as much pixel data at a time for rendering.

    Some apps may not show the full benefit it depends on how they are coded and compiled, but it's definitely a real potential for speedup.

    Bottom line is 64-bit could easily account for a bigger performance increase than NUMA.
  • Kiijibari - Friday, December 1, 2006 - link

    quote:

    64-bit can make a big difference in performance depending on the app. Remember you can process 64 bits of data in a typical instruction instead of 32, so theoretically twice as much pixel data at a time for rendering.


    quote:

    I'm not sure I agree...there's a 22.5% increase in performance there, and I haven't seen anything like that on the 64 bit version of 3DS Max before...


    You see that he refers already to 3DS MAX .. I have not investigated this, but if he refers to it, then I trust him on that one ...

    Futhermore I miss synthetical Sandra Mem bandwidth benches .. these should easily show what is going on there ...

    Anyways a 4x4 review without mentioning the XP - NUMA problem is just not worth reading it ... Sorry Anand ...

    cheers

    Kiijibari
  • Anand Lal Shimpi - Friday, December 1, 2006 - link

    The performance deficit seen when running latency sensitive single and dual threaded applications exists even in a NUMA-aware OS (I've confirmed this under Vista). I'm still running tests under Vista but as far as I see, running in a NUMA-aware OS doesn't seem to change the performance picture at all.

    Take care,
    Anand
  • Kiijibari - Saturday, December 2, 2006 - link

    Hi Anand,

    first of all, thanks for your reply.

    Then, if there is really no performance difference, then I would double check the BIOS, if you have really disabled node interleave.

    Furthermore there seems to be a BIOS bug, with the SRAT ACPI tables, which are necessary for NUMA. It would be nice, if you can dig up some more information about that topic.

    Clearly, that would be not your fault, but AMD's.

    cheers

    Kiijibari
  • Anand Lal Shimpi - Saturday, December 2, 2006 - link

    From what I can tell the Node Interleave option in the BIOS is doing something. Disabling it (enabling NUMA) results in lower latencies than leaving it enabled, but still not as slow as running with a single socket.

    CPU-Z offers the following latencies for the three configurations:

    2S, NUMA On: 168 cycles
    2S, NUMA Off: 205 cycles
    1S: 131 cycles

    From my discussions with AMD last week, this behavior is expected. I will do some more digging to see if there's anything else I'm missing though.

    Take care,
    Anand

Log in

Don't have an account? Sign up now