Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche

Name: Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche
Item: Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche
Author: Anand Lal Shimpi

by Anand Lal Shimpi on February 4, 2008 5:00 AM EST

Posted in
CPUs

30 Comments | Add A Comment

30 Comments

Fully Buffered DIMM: An Unnecessary Requirement

The Intel D5400XS is quite possibly the most impressive part of the entire Skulltrail platform. Naturally it features two LGA-771 sockets, connected to Intel's 5000 chipset via two 64-bit FSB interfaces. The chipset supports the 1600MHz FSB required by the QX9775 but it will work with all other LGA-771 Xeon processors, in case you happen to have some laying around your desk too.

Thanks to the Intel 5400 chipset, the D5400XS can only use Fully Buffered DIMMs. If you're not familiar with FBD, here's a quick refresher taken from our Mac Pro review:

Years ago, Intel saw two problems happening with most mainstream memory technologies: 1) As we pushed for higher speed memory, the number of memory slots per channel went down, and 2) the rest of the world was going serial (USB, SATA and more recently, Hyper Transport, PCI Express, etc...) yet we were still using fairly antiquated parallel memory buses.

The number of memory slots per channel isn't really an issue on the desktop; currently, with unbuffered DDR2-800 we're limited to two slots per 64-bit channel, giving us a total of four slots on a motherboard with a dual channel memory controller. With four slots, just about any desktop user's needs can be met with the right DRAM density. It's in the high end workstation and server space that this limitation becomes an issue, as memory capacity can be far more important, often requiring 8, 16, 32 or more memory sockets on a single motherboard. At the same time, memory bandwidth is also important as these workstations and servers will most likely be built around multi-socket multi-core architectures with high memory bandwidth demands, so simply limiting memory frequency in order to support more memory isn't an ideal solution. You could always add more channels, however parallel interfaces by nature require more signaling pins than faster serial buses, and thus adding four or eight channels of DDR2 to get around the DIMMs per channel limitation isn't exactly easy.

Intel's first solution was to totally revamp PC memory technology, instead of going down the path of DDR and eventually DDR2, Intel wanted to move the market to a serial memory technology: RDRAM. RDRAM offered significantly narrower buses (16-bits per channel vs. 64-bits per channel for DDR), much higher bandwidth per pin (at the time a 64-bit wide RDRAM memory controller would offer 6.4GB/s of memory bandwidth, compared to a 64-bit DDR266 interface which at the time could only offer 2.1GB/s of bandwidth) and of course the ease of layout benefits that come with a narrow serial bus.

Unfortunately, RDRAM offered no tangible performance increase, as the demands of processors at the time were no where near what the high bandwidth RDRAM solutions could deliver. To make matters worse, RDRAM implementations were plagued by higher latency than their SDRAM and DDR SDRAM counterparts; with no use for the added bandwidth and higher latency, RDRAM systems were no faster, if not slower than their SDR/DDR counterparts. The final nail in the RDRAM coffin on the PC was the issue of pricing; your choices at the time were this: either spend $1000 on a 128MB stick of RDRAM, or spend $100 on a stick of equally performing PC133 SDRAM. The market spoke and RDRAM went the way of the dodo.

Intel quietly shied away from attempting to change the natural evolution of memory technologies on the desktop for a while. Intel eventually transitioned away from RDRAM, even after its price dropped significantly, embracing DDR and more recently DDR2 as the memory standards supported by its chipsets. Over the past couple of years however, Intel got back into the game of shaping the memory market of the future with this idea of Fully Buffered DIMMs.

The approach is quite simple in theory: what caused RDRAM to fail was the high cost of using a non-mass produced memory device, so why not develop a serial memory interface that uses mass produced commodity DRAMs such as DDR and DDR2? In a nutshell that's what FB-DIMMs are, regular DDR2 chips on a module with a special chip that communicates over a serial bus with the memory controller.

The memory controller in the system stops having a wide parallel interface to the memory modules, instead it has a narrow 69 pin interface to a device known as an Advanced Memory Buffer (AMB) on the first FB-DIMM in each channel. The memory controller sends all memory requests to the AMB on the first FB-DIMM on each channel and the AMBs take care of the rest. By fully buffering all requests (data, command and address), the memory controller no longer has a load that significantly increases with each additional DIMM, so the number of memory modules supported per channel goes up significantly. The FB-DIMM spec says that each channel can support up to 8 FB-DIMMs, although current Intel chipsets can only address 4 FB-DIMMs per channel. With a significantly lower pin-count, you can cram more channels onto your chipset, which is why the Intel 5000 series of chipsets feature four FBD channels.

The AMB has two major roles, to communicate with the chipset's memory controller (or other AMBs) and to communicate with the memory devices on the same module.

When a memory request is made the first AMB in the chain then figures out if the request is to read/write to its module, or to another module, if it's the former then the AMB parallelizes the request and sends it off to the DDR2 chips on the module, if the request isn't for this specific module, then it passes the request on to the next AMB and the process repeats.

As we've seen, the AMB translation process introduces a great deal of latency to all memory accesses (it also adds about 3-6W of power per module), negatively impacting performance. The tradeoff is generally worth it in workstation and server platforms because the ability to use even more memory modules outweighs the latency penalty. The problem with the D5400XS motherboard is that it only features one memory slot per FBD channel, all but ruining the point of even having FBD support in the first place.

Four slots, great. We could've done that with DDR3 guys.

You do get the benefit of added bandwidth since Intel is able to cram four FBD channels into the 5400 chipset, the problem is that the two CPUs on the motherboard can't use all of the bandwidth. Serial busses inherently have more overhead than their parallel counterparts, but the 38.4GB/s of memory bandwidth offered by the chipset is impressive sounding for a desktop motherboard. You only get that full bandwidth if all four memory slots are populated, but you do increase latency as well.

Some quick math will show you that peak bandwidth between the CPUs and the chipset is far less than the 38.4GB/s offered between the chipset and memory. Even with a 1600MHz FSB we're only talking about 25.6GB/s of bandwidth. We've already seen that the 1333MHz FSB doesn't really do much for a single processor, so a good chunk of that bandwidth will go unused by the four cores connected to each branch.

The X38/X48 dual channel DDR3-1333 memory controller would've offered more than enough bandwidth for the two CPUs, without all of the performance and power penalties associated with FBD. Unfortunately a side effect of choosing to stick with a Xeon chipset is that FBD isn't optional - you're stuck with it. As you'll soon see, this is a side effect that does really hurt Skulltrail.

The CPUs Intel's D5400XS: The Best Multi-GPU Motherboard?

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

30 Comments

View All Comments

chizow - Monday, February 4, 2008 - link

quote:
we don't have a problem recommending it, assuming you are running applications that can take advantage of it. Even heavy multitasking won't stress all 8 cores, you really need the right applications to tame this beast.

Not sure how you could come to that conclusion unless you posted some caveats like 1) you're getting it for free from Intel or 2)you're not paying for it yourself or have no concern about costs.

Besides the staggering price tag associated with it ($500 + 2 x Xeon 9770 @$1300-1500 + FB-DIMM premium) there's some real concerns with how much benefit this set-up would yield over the best performing single socket solutions. In games, there's no support for Tri-SLI and beyond for NV parts although 3-4 cards may be an option with ATI. 3 seems more realistic as that last slot will be unusable with dual-cards.

Then there's the actual benefit gained on a practical basis. In games, looks like its not even worth bothering with as you'd most likely see a bigger boost from buying another card for SLI or CrossFire. For everything else, they're highly input intensive apps, so you spend most of your work day preparing data to shave a few seconds off compute time so you can go to lunch 5 minutes sooner or catch an earlier train home.

I guess in the end there's a place for products like this, to show off what's possible but recommending it without a few hundred caveats makes little sense to me.
chinaman1472 - Monday, February 4, 2008 - link
The systems are made for an entirely different market, not the average consumer or the hardcore gamer.

Shaving off a few minutes really adds up. You think people only compile or render one time per project? Big projects take time to finish, and if you can shave off 5 minutes every single time and have it happen across several computers, the thousands of dollars invested comes back. Time is money.
chizow - Monday, February 4, 2008 - link
I didn't focus on real-world applications because the benefits are even less apparent. Save 4s on calculating time in Excel? Spend an hour formatting records/spreadsheets to save 4s...ya that's money well spent. The same is true for many real world applications. Sad reality is that for the same money you could buy 2-3x as many single-CPU rigs and in that case, gain more performance and productivity as a result.
Cygni - Monday, February 4, 2008 - link
As we both noted, 'real world' isnt just Excel. Its also AutoCAD and 3dsmax. These are arenas where we arent talking about shaving 4 seconds, we are talking shaving whole minutes and in extreme cases even hours on renders.

This isnt an office computer, this isnt a casual gamers machine. This is a serious workstation or extreme enthusiast rig, and you are going to pay the price premium to get it. Like I said, this is a CAD and 3D artists dream machine... not for your secretary to make phonetrees on. ;)

In this arena? I cant think of any machines that are even close to it in performance.
chizow - Monday, February 4, 2008 - link
Again, in both AutoCAD and 3DSMax, you'd be better served putting that extra money into another GPU or even workstation for a fraction of the cost. 2-3x the cost for uncertain increases over a single-CPU solution or a second/third workstation for the same price. But for a real world example, ILM said it took @24 hours or something ridiculous to render each Transformer frame. Say it took 24 hours with a single Quad Core with 2 x Quadro FX. Say Skulltrail cut that down to 18 or even 20 hours. Sure, nice improvement, but you'd still be better off with 2 or even 3 single CPU workstations for the same price. If it offered more GPU support and non-buffered DIMM support along with dual CPU support it might be worth it but it doesn't and actually offers less scalability than cheaper enthusiast chipsets for NV parts.
martin4wn - Tuesday, February 5, 2008 - link
You're missing the point. Some people need all the performance they can get on one machine. Sure batch rendering a movie you just do each frame on a separate core and buy roomfulls of blade servers to run them on. But think of an individual artist on their own workstation. They are trying to get a perfect rendering of a scene. They are constantly tweaking attributes and re-rendering. They want all the power they can get in their own box - it's more efficient than trying to distribute it across a network. Other examples include stuff like particles or fluid simulations. They are done best on a single shared memory system where you can load the particles or fluid elements into a block of memory and let all the cores in your system loose on evaluating separate chunks of it.

I write this sort of code for a living, and we have many customers buying up 8 core machines for individual artists doing exactly this kind of thing.
Chaotic42 - Tuesday, February 5, 2008 - link
Anyone can come up with arbitrary workflows that don't use all of the power of this system. There are, however, some workflows which would use this system.

I'm a cartographer, and I deal with huge amounts of data being processed at the same time. I have mapping program cutting imagery on one monitor, Photoshop performing image manipulation on a second, Illustrator doing TIFF separates on a third, and in the background I have four Excel tabs and enough IE tabs to choke a horse.

Multiple systems makes no sense because you need so much extra hardware to run them (In the case of this system, two motherboards, two cases, etc) and you'll also need space to put the workstations (assuming you aren't using a KVM). You would also need to clog the network with your multi-gigabyte files to transfer them from one system to another for different processing.

That seems a bit more of a hassle than a system like the one featured in the article.
Cygni - Monday, February 4, 2008 - link
I dont see any problem with what he said there.

All you talked about was gaming, but lets be honest here, this is not a system thats going to appeal to gamers, and this isnt a system setup for anyone with price concerns.

In reality, this is a CAD/CAM dream machine, which is a market where $4-5,000 rigs are the low end. In the long run for even small design or production firms, 5 grand is absolute peanuts and WELL worth spending twice a year to have happy engineers banging away. The inclusion of SLI/Crossfire is going to move these things like hotcakes in this sector. There is nothing that will be able to touch it. And thats not even mentioning its uses for rendering...

I guess what im saying is try to realize the world is a little bit bigger than gaming.
Knowname - Sunday, February 10, 2008 - link
On that note, is there any studies on the gains you get in CAD applications by upgrading your videocard?? How much does the gpu really play in the process?? The only significant gain I can think of for CAD is quad desktop monitors per card with Matrox vid cards. I don't see how the GPU (beyond the RAMDAC or whatever it's called) really makes a difference. Pls tell me this, I keep wasting my money on ATI cards (not mention my G550 wich I like, but it wasn't worth the money I spent on it when I could have gotten a 6600gt...) just on the hunch they'd be better than nvidea due to the 2d filtering and such (not really a big deal now, but...)
HilbertSpace - Monday, February 4, 2008 - link
A lot of the 5000 intel chipsets let you use riser cards for more memory slots. Is that possible with skully?

Intel's 8-core Skulltrail Platform: Close to Perfecting the Niche

Fully Buffered DIMM: An Unnecessary Requirement

Post Your Comment

30 Comments

View All Comments

chizow - Monday, February 4, 2008 - link

chinaman1472 - Monday, February 4, 2008 - link

chizow - Monday, February 4, 2008 - link

Cygni - Monday, February 4, 2008 - link

chizow - Monday, February 4, 2008 - link

martin4wn - Tuesday, February 5, 2008 - link

Chaotic42 - Tuesday, February 5, 2008 - link

Cygni - Monday, February 4, 2008 - link

Knowname - Sunday, February 10, 2008 - link

HilbertSpace - Monday, February 4, 2008 - link

Log in

Don't have an account? Sign up now