Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche

Name: Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche
Item: Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche
Author: Anand Lal Shimpi

by Anand Lal Shimpi on February 4, 2008 5:00 AM EST

Posted in
CPUs

30 Comments | Add A Comment

30 Comments

Fully Buffered DIMM: An Unnecessary Requirement

The Intel D5400XS is quite possibly the most impressive part of the entire Skulltrail platform. Naturally it features two LGA-771 sockets, connected to Intel's 5000 chipset via two 64-bit FSB interfaces. The chipset supports the 1600MHz FSB required by the QX9775 but it will work with all other LGA-771 Xeon processors, in case you happen to have some laying around your desk too.

Thanks to the Intel 5400 chipset, the D5400XS can only use Fully Buffered DIMMs. If you're not familiar with FBD, here's a quick refresher taken from our Mac Pro review:

Years ago, Intel saw two problems happening with most mainstream memory technologies: 1) As we pushed for higher speed memory, the number of memory slots per channel went down, and 2) the rest of the world was going serial (USB, SATA and more recently, Hyper Transport, PCI Express, etc...) yet we were still using fairly antiquated parallel memory buses.

The number of memory slots per channel isn't really an issue on the desktop; currently, with unbuffered DDR2-800 we're limited to two slots per 64-bit channel, giving us a total of four slots on a motherboard with a dual channel memory controller. With four slots, just about any desktop user's needs can be met with the right DRAM density. It's in the high end workstation and server space that this limitation becomes an issue, as memory capacity can be far more important, often requiring 8, 16, 32 or more memory sockets on a single motherboard. At the same time, memory bandwidth is also important as these workstations and servers will most likely be built around multi-socket multi-core architectures with high memory bandwidth demands, so simply limiting memory frequency in order to support more memory isn't an ideal solution. You could always add more channels, however parallel interfaces by nature require more signaling pins than faster serial buses, and thus adding four or eight channels of DDR2 to get around the DIMMs per channel limitation isn't exactly easy.

Intel's first solution was to totally revamp PC memory technology, instead of going down the path of DDR and eventually DDR2, Intel wanted to move the market to a serial memory technology: RDRAM. RDRAM offered significantly narrower buses (16-bits per channel vs. 64-bits per channel for DDR), much higher bandwidth per pin (at the time a 64-bit wide RDRAM memory controller would offer 6.4GB/s of memory bandwidth, compared to a 64-bit DDR266 interface which at the time could only offer 2.1GB/s of bandwidth) and of course the ease of layout benefits that come with a narrow serial bus.

Unfortunately, RDRAM offered no tangible performance increase, as the demands of processors at the time were no where near what the high bandwidth RDRAM solutions could deliver. To make matters worse, RDRAM implementations were plagued by higher latency than their SDRAM and DDR SDRAM counterparts; with no use for the added bandwidth and higher latency, RDRAM systems were no faster, if not slower than their SDR/DDR counterparts. The final nail in the RDRAM coffin on the PC was the issue of pricing; your choices at the time were this: either spend $1000 on a 128MB stick of RDRAM, or spend $100 on a stick of equally performing PC133 SDRAM. The market spoke and RDRAM went the way of the dodo.

Intel quietly shied away from attempting to change the natural evolution of memory technologies on the desktop for a while. Intel eventually transitioned away from RDRAM, even after its price dropped significantly, embracing DDR and more recently DDR2 as the memory standards supported by its chipsets. Over the past couple of years however, Intel got back into the game of shaping the memory market of the future with this idea of Fully Buffered DIMMs.

The approach is quite simple in theory: what caused RDRAM to fail was the high cost of using a non-mass produced memory device, so why not develop a serial memory interface that uses mass produced commodity DRAMs such as DDR and DDR2? In a nutshell that's what FB-DIMMs are, regular DDR2 chips on a module with a special chip that communicates over a serial bus with the memory controller.

The memory controller in the system stops having a wide parallel interface to the memory modules, instead it has a narrow 69 pin interface to a device known as an Advanced Memory Buffer (AMB) on the first FB-DIMM in each channel. The memory controller sends all memory requests to the AMB on the first FB-DIMM on each channel and the AMBs take care of the rest. By fully buffering all requests (data, command and address), the memory controller no longer has a load that significantly increases with each additional DIMM, so the number of memory modules supported per channel goes up significantly. The FB-DIMM spec says that each channel can support up to 8 FB-DIMMs, although current Intel chipsets can only address 4 FB-DIMMs per channel. With a significantly lower pin-count, you can cram more channels onto your chipset, which is why the Intel 5000 series of chipsets feature four FBD channels.

The AMB has two major roles, to communicate with the chipset's memory controller (or other AMBs) and to communicate with the memory devices on the same module.

When a memory request is made the first AMB in the chain then figures out if the request is to read/write to its module, or to another module, if it's the former then the AMB parallelizes the request and sends it off to the DDR2 chips on the module, if the request isn't for this specific module, then it passes the request on to the next AMB and the process repeats.

As we've seen, the AMB translation process introduces a great deal of latency to all memory accesses (it also adds about 3-6W of power per module), negatively impacting performance. The tradeoff is generally worth it in workstation and server platforms because the ability to use even more memory modules outweighs the latency penalty. The problem with the D5400XS motherboard is that it only features one memory slot per FBD channel, all but ruining the point of even having FBD support in the first place.

Four slots, great. We could've done that with DDR3 guys.

You do get the benefit of added bandwidth since Intel is able to cram four FBD channels into the 5400 chipset, the problem is that the two CPUs on the motherboard can't use all of the bandwidth. Serial busses inherently have more overhead than their parallel counterparts, but the 38.4GB/s of memory bandwidth offered by the chipset is impressive sounding for a desktop motherboard. You only get that full bandwidth if all four memory slots are populated, but you do increase latency as well.

Some quick math will show you that peak bandwidth between the CPUs and the chipset is far less than the 38.4GB/s offered between the chipset and memory. Even with a 1600MHz FSB we're only talking about 25.6GB/s of bandwidth. We've already seen that the 1333MHz FSB doesn't really do much for a single processor, so a good chunk of that bandwidth will go unused by the four cores connected to each branch.

The X38/X48 dual channel DDR3-1333 memory controller would've offered more than enough bandwidth for the two CPUs, without all of the performance and power penalties associated with FBD. Unfortunately a side effect of choosing to stick with a Xeon chipset is that FBD isn't optional - you're stuck with it. As you'll soon see, this is a side effect that does really hurt Skulltrail.

The CPUs Intel's D5400XS: The Best Multi-GPU Motherboard?

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

30 Comments

View All Comments

SiliconDoc - Thursday, February 7, 2008 - link
You're onto something there, just make it ten times worse, and you'll have the real picture. I've seen hardware years ago that puts current harddrives to shame. So for whatever reasons, things are limited, like the 56k modem was.
A friend just bought an HD2900 Pro (got it yesterday), 256bit 512mb. There were rumors about that there was a 512 bit version, he swore he saw it advertised. Well, to make a long story short, the $163 512bit version was getting bios flashed and overclocked up to the $400 XT 2900 whatever...(it was the same core apparently ) and it got sold real quickly and then pulled.... it's a ghost now...
I looked for one since I just found out, and saw one at some place online for nearly $400, at one at a music store online posted but not in stock - special order only, likely a mere pic-e-presence.
In other words they can pump those things out like mad, and depending on how much turkey they want in their bank...they start doing calculations, and when the consumer "catches" them, it's like anything else.
Let's face it, prices have gone a bit wild lately, and the big boys must have ringing cash registers in their eyes.
If they can pump 300 or 500 or 2 grand out of people instead of Disney World or Vegas, they'ell do it, and they see the drooling...,
slobberer out
Anonymous Freak - Tuesday, February 5, 2008 - link
Have you checked the prices of Xeons vs. equivalent Core 2 Extreme recently?

According to Pricewatch, the Xeon 5472 (3.0 GHz, 1600 MHz bus,) is about $1029/1050. The Core 2 Extreme QX9650 (3.0 GHz, 1600 MHz bus,) is $1038/$1166. I can't find the QX9770 on Pricewatch, but other searches find it is about $1600, while the equivalent Xeon is $1400.

The Core 2 "Extreme" line has, since its inception, been more expensive than equivalent Xeons. Heck, it might be cheaper to pick up the Xeon equivalent of the 9775 than to pick up the 9775 itself.
Anonymous Freak - Tuesday, February 5, 2008 - link
You state: "We tested Skulltrail with only two FB-DIMMs installed, but even in this configuration memory latency was hardly optimal:"

This is a major flaw in your benchmarking. As [url=http://www.anandtech.com/mac/showdoc.aspx?i=2816&a...">http://www.anandtech.com/mac/showdoc.aspx?i=2816&a...]your own[/url] Mac Pro review shows, quad-channel FB-DIMMs have lower latency, and higher bandwidth, than dual-channel. You should have filled all four FB-DIMM sockets. The latency penalty on multiple AMBs only applies to multiple AMBs on the same channel. For example, in a 5400-based server with four sockets per channel, having four total FB-DIMMs (one per channel, say 4 GB each,) produces better results than eight total FB-DIMMs (two per channel, 2 GB each.) And a sixteen FB-DIMM total (four per channel, 1 GB each,) system fares worst of all. Of course, that is assuming the TOTAL amount of RAM remains the same for each configuration. If you have an application that can benefit from massive amounts of RAM, having the extra RAM will far outweigh the performance penalty of the extra AMBs per channel. (In my example, moving from 16 GB of RAM using four 4 GB FB-DIMMs to 64 GB by having sixteen 4 GB FB-DIMMs, would produce performance benefits to certain applications just from the amount of RAM.)

In addition, the new chipset, and newer FB-DIMM modules with newer AMBs, produces better results than the first-generation counterparts. For example, your Mac Pro benchmark showed CPU-Z latencies of 87 ns (quad-channel) and 92 ns (dual-channel, worse,) for the Mac Pro, vs. 52 ns for a Core 2 Duo with DDR-2 800; the new benchmark shows 79 ns for the 5400 chipset in dual-channel (assuming the same %, quad-channel should show 74 ns,) vs. 55 ns for a Core 2 Quad with DDR-2 800. Yeah, 74 is still slower than 55, but it's better than the 87 ns the (original) Mac Pro scored. (The new Mac Pro should see an improvement on par with this Skulltrail board over the old Mac Pro.)
Anand Lal Shimpi - Tuesday, February 5, 2008 - link
You are correct on the FBD/latency issue. We didn't have small enough FB-DIMMs on hand to run a 4x1GB configuration, but the difference in latency is still not enough to change the situations where Skulltrail is outperformed by its desktop counterparts. The situation will be improved a bit but the point that I was trying to make is that in applications that can't take advantage of all 8 cores, Skulltrail will be slower thanks to its higher latency memory subsystem.

Take care,
Anand
Googer - Monday, February 4, 2008 - link
For being a premium enthusiast product with a $500 price tag and server DNA, this thing better come with an intergrated SAS controller too. There are plenty of other server/workstation motherboards in this price range that offer SAS, if performance is the purpose for Skulltrails existance, there's no reason for it to be left out. 15,000 RPM drives for the win.
dansus - Monday, February 4, 2008 - link
I would imagine you would see more difference if you used the multi threaded .dll (mt.dll) with x264 when encoding.

Especially if your doing a 2+ pass encode where the first pass typically uses 50% cpu.

I can see myself buying one later in the year as prices come down. At the very least, i can do two quad core encodes at once.
JKing76 - Monday, February 4, 2008 - link
It's no secret that at a certain point, the computer "enthusiast" market is more about bragging than performance. But this is the most absurd and pointless release of pure e-penis waggling I've ever seen, and as a computer engineer I am literally embarrassed that a legitimate company like Intel is responsible. The EPA should fine Intel for this debacle, penalize them for each machine sold, and confiscate the computers of anyone selfish and stupid enough to buy one.
SiliconDoc - Thursday, February 7, 2008 - link
Wow. I get a kick out the bloggers that so often find so many problems with the really high end machines. Strangely enough they never seem to post their "rig" stats when they are having a big fit of complaints.
I suspect the real problem is massive e-penis envy, and expecting the government to shutdown a private firms product, confiscate purchasers products unless they pass your "needs" test, and maybe give them a greenpeace fine and carbon tax (I know it crossed someones mind) seems to me to be the biggest green streak of jealousy I've yet to witness.
The bottom line is, more than 99% of the freaks reading this review would wet their pants and float off into heavenly bliss if they found the "Skulltrail" ( that's what I find offensive - the sick name ) on their desk in the morning.
I find the whole thing much like let's say a bunch of guys at an auto show putting down the swing-up door 10/80 stainless brand new XXX sports car, when deep down inside not a one of them would turn down the set of keys, no matter how often they'ed claim otherwise.
Suddenly all that extra cpu-horsepower here would be the prudent reserve for the upcoming releases that no doubt very soon will make use of it all, since dou and quads are now getting to be commonplace.
It's just all so amusing, when Jones' hate the new McMansion, basically because they aren't living it.
Nihility - Monday, February 4, 2008 - link
Why didn't AMD make this available with phenom? would have won them the performance crown (sorta since this apparantly doesn't scale very well).
legoman666 - Monday, February 4, 2008 - link
the scalability has nothing to do with the platform, it has to do with the apps themselves. 2x phenoms will scale no better than 2x Intel quad. There are simply few programs out there designed to take advantage of >1 core, much less 8.

Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche

Fully Buffered DIMM: An Unnecessary Requirement

Post Your Comment

30 Comments

View All Comments

SiliconDoc - Thursday, February 7, 2008 - link

Anonymous Freak - Tuesday, February 5, 2008 - link

Anonymous Freak - Tuesday, February 5, 2008 - link

Anand Lal Shimpi - Tuesday, February 5, 2008 - link

Googer - Monday, February 4, 2008 - link

dansus - Monday, February 4, 2008 - link

JKing76 - Monday, February 4, 2008 - link

SiliconDoc - Thursday, February 7, 2008 - link

Nihility - Monday, February 4, 2008 - link

legoman666 - Monday, February 4, 2008 - link

Log in

Don't have an account? Sign up now