Faster Unaligned Cache Accesses & 3D Rendering Performance

3dsmax r9

Our benchmark, as always, is the SPECapc 3dsmax 8 test but for the purpose of this article we only run the CPU rendering tests and not the GPU tests.

3dsmax 9

The results are reported as render times in seconds and the final CPU composite score is a weighted geometric mean of all of the test scores.

CPU / 3dsmax Score Breakdown Radiosity Throne Shadowmap CBALLS2 SinglePipe2 Underwater SpaceFlyby UnderwaterEscape
Nehalem (2.66GHz) 12.891s 11.193s 5.729s 20.771s 24.112s 30.66s 27.357s
Penryn (2.66GHz) 19.652s 14.186s 13.547s 30.249s 32.451s 33.511s 31.883s


The CBALLS2 workload is where we see the biggest speedup with Nehalem, performance more than doubles. It turns out that CBALLS2 calls a function in the Microsoft C Runtime Library (msvcrt.dll) that can magnify the Core architecture's performance penalty when accessing data that is not aligned with cache line boundaries. Through some circuit tricks, Nehalem now has significantly lower latency unaligned cache accesses and thus we see a huge improvement in the CBALLS2 score here. The CBALLS2 workload is the only one within our SPECapc 3dsmax test that really stresses the unaligned cache access penalty of the current Core architecture, but there's a pretty strong performance improvement across the board in 3dsmax.

Nehalem is just over 40% faster than Penryn, clock for clock, in 3dsmax.

Cinebench R10

A benchmarking favorite, Cinebench R10 is designed to give us an indication of performance in the Cinema 4D rendering application.

Cinebench R10

Cinebench also shows healthy gains with Nehalem, performance went up 20% clock for clock over Penryn.

We also ran the single-threaded Cinebench test to see how performance improved on an individual core basis vs. Penryn (Updated: The original single-threaded Penryn Cinebench numbers were incorrect, we've included the correct ones):

Cinebench R10 - Single Threaded Benchmark

Cinebench shows us only a 2% increase in core-to-core performance from Penryn to Nehalem at the same clock speed. For applications that don't go out to main memory much and can stay confined to a single core, Nehalem behaves very much like Penryn. Remember that outside of the memory architecture and HT tweaks to the core, Nehalem's list of improvements are very specific (e.g. faster unaligned cache accesses).

The single thread to multiple thread scaling of Penryn vs. Nehalem is also interesting:

 Cinebench R10 1 Thread N-Threads Speedup
Nehalem (2.66GHz) 3015 12596 4.18x
Core 2 Quad Q9450 - Penryn - (2.66GHz) 2931 10445 3.56x

 

The speedup confirms what you'd expect in such a well threaded FP test like Cinebench, Nehalem manages to scale better thanks to Hyper Threading. If Nehalem had the same 3.56x scaling factor that we saw with Penryn it would score a 10733, virtually inline with Penryn. It's Hyper Threading that puts Nehalem over the edge and accounts for the rest of the gain here.

While many 3D rendering and video encoding tests can take at least some advantage of more threads, what about applications that don't? One aspect of Nehalem's performance we're really not stressing much here is its IMC performance since most of these benchmarks ended up being more compute intensive. Where HT doesn't give it the edge, we can expect some pretty reasonable gains from Nehalem's IMC alone. The Nehalem we tested here is crippled in that respect thanks to a premature motherboard, but gains on the order of 20% in single or lightly threaded applications is a good expectation to have.

 

POV-Ray 3.7 Beta 24

POV-Ray is a popular raytracer, also available with a built in benchmark. We used the 3.7 beta which has SMP support and ran the built in multithreaded benchmark.

POV-Ray 3.7 Beta 24

Finally POV-Ray echoes what we've seen elsewhere, with a 36% performance improvement over the 2.66GHz Core 2 Q9450. Note that Nehalem continues to be faster than even the fastest Penryns available today, despite the lower clock speed of this early sample.

Nehalem's Media Encoding Performance Power Consumption
Comments Locked

108 Comments

View All Comments

  • kilkennycat - Thursday, June 5, 2008 - link

    Isn't 6GB of RAM a pretty sweet spot for desktop 64-bit applications, whatever about servers?
  • jimmysmitty - Thursday, June 5, 2008 - link

    Well I have been waiting for Nehalem. I gave in and decided to build a rig with the Q6600 but kinda sad now.

    Anwways. Crank the Planet, hes not showing fanboyism. He stated Intel has been promising 20-30% increase with Nehalem. They are seeing 20-50% from these benchmarks. Take 21 and divide it by 14 that gives you 1.5. That means that the AMD Phenoms latency is about 50% slower.

    If anything you are showing fanboyism. Nehalem is showing to be one hell of a chip and you are just angry that AMD has nothing to compare to it. Even after AMD finishes absorbing ATI whats next, K10.5 aka Deneb? Thats just a 45nm refresh (just like Penryn was for Conroe). Unless there are some major changes in the architecture it will just, hopefully, make Phenom run at higher clocks and cooler.

    Other than that I can't wait to see what this does for games. I know that most games are more GPU dependant but I myself play mainly Valve games using Source and thats very CPU dependant and already runs great on my Q6600 but I want to see what this game will do for their particle and physics system...
  • Nehemoth - Thursday, June 5, 2008 - link

    Please, Please, Please Intel I would to have this monsters chip in our servers without the annoying FBD, I don't want hoty FBD bring me normal DDR2 (without FBD) or DDR3.

    Just what I ask.

  • Griswold - Thursday, June 5, 2008 - link

    I'm a big fan of multi-core systems, but I'm not blind to reality: Why no single threaded benchmarks, but only benchmarks that scale very good with more cores/SMT? By the time these things will be on the market, most applikations will still be single threaded and you know it...

    I just want to know how much faster it is per clock per core.
  • Anand Lal Shimpi - Thursday, June 5, 2008 - link

    Interestingly enough, none of our standard CPU benchmarks are single threaded at all - even the most benign ones are multithreaded (including the games). I did run some single thread Cinebench numbers though:

    Nehalem - 3015
    Q9450 - 2396
  • bradley - Thursday, June 5, 2008 - link

    Why is there such a large discrepancy between previous single-threaded Cinebench tests from six months ago: where the Q9450 scored a 2944, or a mere 2.4% decrease, compared to the current 2396, or a more substantial 20.5% decrease.

    http://www.anandtech.com/printarticle.aspx?i=3153">http://www.anandtech.com/printarticle.aspx?i=3153

    I too believe single-threaded benches should be the foundation of any meaningful and relevant cpu review, if time indeed was permitting. To me this is the greatest objective real-world equalizer. There just isn't enough multi-threaded software out there, much less software able to run all eight cores. I would also like to emphasize that unlike server chips, desktop Nehalems will only have two memory channels. And as I understand, hyper threading also will only make an appearance in server and enthusiast chipsets. So already this makes an accurate comparison difficult enough.

    Finally, I understand the avg visitor will treat this like any good entertainment, where one is meant to suspend his-her disbelief. Still I have a hard time believing anyone has the ability to abscond away such important chips from a huge corporation like Intel. "Without Intel's approval, supervision, blessing or even desire - we went ahead and snagged us a Nehalem (actually, two) and spent some time with them." That initial premise does make anything coming after less impactful, or seemingly less than straightforward.

    Certainly if history has taught us anything, we know final shipping silicon is sometimes quite different from test chips. We should also assume it's a lot easier to create ond one chip than manufacture hundreds of thousands on a large scale. Nothing is ever a given, which makes it hard to draw much of a conclusion. Interesting preview nonetheless.
  • SiliconDoc - Monday, July 28, 2008 - link

    Shhhhh... gosh we have to have core hype ... and the multicore testers have to optimize for the coming chips... geeze they have to make a living somehow...
    ( You sir, are exactly correct, but we live in a strange world nowadays where the truth is so evident it must be hidden most of the time for various other reasons... )
    Gosh, you want to crash the whole economy with that sane and rational talk ?
    What are you an anarchist ? ( yes I'm kidding, that was a big high five to you)
  • Anand Lal Shimpi - Thursday, June 5, 2008 - link

    Ignore those numbers (check page 6 of the comments for an explanation), the Q9450 comes in at 2931 vs. Nehalem's 3015.

    -A
  • pnyffeler - Thursday, June 5, 2008 - link

    I'm not a Mac person, but I think Mac's may benefit from this technology even more than Vista. As I recall from a previous Anandtech article, Mac's have an excellent memory management system, which very direct benefit in increasing memory size. The increased bandwidth could make the snazzy OS even better...
  • Visual - Thursday, June 5, 2008 - link

    It is great that your "clock for clock" comparisons to the penryn in encoding and rendering are showing an improvement... but could that improvement be from the doubled amount of virtual processors that are visible? Are all of these benchmarks using eight or four threads on the nehalem?

Log in

Don't have an account? Sign up now