ASUS ROG Rampage Formula: Why we were wrong about the Intel X48
by Kris Boughton on January 25, 2008 4:30 AM EST- Posted in
- Motherboards
The Clock Crossing Procedure and MCH Read Delay (tRD)
Have you ever stopped to wonder how the Memory Control Hub (MCH) manages to transfer data from the system bus (FSB) to the memory bus and vice versa? Maybe you questioned why it was not possible to run your memory at any speed you desired, independent of the selected FSB. While there have been some chipsets that provide this capability, performance was rather spotty at times depending on the particular settings in use. In fact, these systems often resulted in the creation of rather large ranges of bus speed and memory speed ratios that refused to function, stable or otherwise. The concept was certainly fair enough, but at the time a well-engineered implementation turned out to be far more difficult that anyone may have been willing to concede. It should not come as a surprise that Intel chipsets have only allow the use of certain validated memory ratios - those same numbers we have grown to love (or hate) like 1:1, 5:4, 3:2, and so forth.
Back in the days of relatively slow-paced memory performance improvements, Intel had no problems pumping out processors with default FSB values that made running memory at synchronous speeds an expensive proposition. Thus they created the "down" divider that allowed the memory subsystem to run at a bus speed below that of the system bus. This allowed for a period of innovation where processor bus architecture was able to improve without placing an undue burden on the memory industry to create reasonably priced modules capable of these increased speeds.
In a lot of ways this also prevented cost-prohibitive aspects of validating higher data transfer rates from ultimately affecting the market as a whole. Because Intel was not obligated to allocate additional funding for the completion of this expensive process, the savings (however small) could be passed on to the consumer. When coupled with the added benefit of allowing for the purchase and use of lower-binned memory, we can easily see why a move such as this helped stimulate the industry into a period of unprecedented growth.
Today's market is much different. Older signaling technologies like Advanced Gunning Transceiver Logic (AGTL+), used almost exclusively by Intel in their monolithic CPU-to-MCH interface architecture, have just about reached the practical limits. As such, "up" dividers, those that allow for memory bus speeds in excess of the system bus speed, are used almost exclusively by high-performance systems. Indeed, it would be hard to imagine ever being satisfied with a system able to make use of memory operating at transfer rates of up to DDR2-667, and no higher. As odd as it may seem, the next generation of memory (DDR3) is based entirely on the concept of running at speeds mandating the use of at least a 2:1 memory bus to system bus frequency ratio.
When examined from this aspect it becomes clear what DDR3 really is - an early look at the next generation's mainstream memory component. Only when Intel has removed the performance barrier of their antiquated FSB and loosely-coupled MCH will we be fully able to finally experience the real improvements that DDR3 promises to provide. But we digress - speculating about what could have been done better or things to come is trite. Instead, let's concentrate on what can be done now to improve the collective performance of our systems.
Below is a somewhat basic representation of the problem at hand (and later we will introduce the fairly elegant solution that Intel has crafted in order to tackle said problem). Are you still wondering what the problem is? Put briefly: designing an effective means for transferring data from one bus "domain" to the other, while still respecting the timing requirements for each bus.
We made mention earlier that we believe Intel's solution to be quite elegant. While this is true, the system is still far from perfect as it introduces a significant amount of extra data access latency to each and every memory read request. It might seem ironic that the design initially constructed in order to allow for higher memory speeds ends up being the very reason for the additional latency penalty incurred, but this is the case. Coincidentally, we can apply a similar accounting of the system's flaws when explaining why actualized system performance improvements are always far less than predicted when making use of the larger dividers.
Moving data between two simultaneously running parallel buses requires a fine level of timing control and a fair amount of programmable logic to be successful, especially if the scheme is to remain flexible. The graphic below should help to illustrate the concept. Four memory data clock cycles for every three system data bus cycles (a 4:3 DRAM:FSB memory ratio) means that the MCH can do little more than queue up data requested from the memory for transport to the system bus.
This process works to ensure that no FSB cycles are left unused, but it does little to improve actual memory throughput. Because of this, the gains seen are largely due to the increased transfer efficiency. For the same reason, sustained data rates, while improved, are made at the expense of requiring added latency associated with the transfer scheme. This has the affect of retarding the responsiveness of short bursts of random read activity. Some users [Ed: Androids in disguise?] even swear that they can tell when they are using an Intel system because of the suppression of the "snappy" experience that comes when using true low-latency systems.
In order to better quantify exactly how the MCH clock crossing procedure influences system performance, it might be helpful to take a step back and examine the different component parameters that contribute to a system's overall memory read (access) latency. The diagram below helps to illustrate just this point - we have identified each distinct operation that takes place during a CPU data request and the sum of the individual latencies that involved with each part of the process make up the total system memory access latency, as measured. As we will see, this time - expressed in nanoseconds (ns) - is critical to overall system performance.
Although the user has at least minor control over just about every aspect of the system as modeled above, there are a few settings that when altered have the potential to drastically impact performance for the better (or worse, depending on how well you understand what you are doing). One of the most well known is the selection of primary memory timings (tCL-tRCD-tRP-tRAS). You may be surprised to learn that the effect of properly setting these values, represented by part "E", is quite insignificant when compared to your choice of tRD (the singular consideration in developing the latency associated with part "F"). This will be our focus for the remainder of our discussion. We have already lightly touched on this mysterious setting on the prior page.
73 Comments
View All Comments
Vikendios - Thursday, January 31, 2008 - link
Very Interesting. But I believe that AT is also guilty of perpetuating the chipset/multiple GPU incompatibility (or non-optimization) myths, by not giving us systematic reviews of X38/48 and 680/790i using both ATI and Nvidia twinned cards.And if some BIOS adjustments or driver updates are becessary to twin Nvidia cards under Intel chipsets, or ATI/AMD cards under Nvidia's, kindly tell and guide us.
I'm not a conspiration theorist, but I think there is more than meets the eye in the present situation.
The apparent paradox of Intel (chipsets) pushing AMD (Crossfire) solutions is just marketing cycle hysteresis from the days when ATI was still an independent canadian company.
But both Intel and AMD resent video card chip manufacturers forcing their way into hard-wired motherboard real estate thru the multiple GPU concept, with attendant slot and chipset modifications. With the demise of Via, Intel and AMD believe they can own the chipsets, as long as the motherboard manufacturers are only assemblers.
For Nvidia, multiple GPU is an easy way to extend the life of a good graphic chip until the next generation comes up, but mostly it provides for a temporary proprietary claim on the motherboard design. 3dfx first tried that years ago in Voodoo days and it worked. It worked again when ATI couldn't follow up fast enough on SLI and had to fall in AMD's arms.
Nvidia gambled that SLI would allow it to impose its own chipset business, either by technical or marketing (SLI endorsment) means. What next ? Special gaming CPU's ? That's a dangerous taunt, although Intel doesn't yet dare buy them, or compete directly with them with their own GPU's, out of anti-trust concerns in Brussels.
Holly - Wednesday, January 30, 2008 - link
Excelent description of memory timing magic. Thumbs up :-)FSBastrd - Tuesday, January 29, 2008 - link
I may have come off a little brash with my first comment. The article is pretty sweet, and I was able to read through it without the pictures, but that doesn't mean I wouldn't like to view them. It's not just this article either. Pictures pretty much never load on this website for me.kjboughton - Wednesday, January 30, 2008 - link
Do you run some type of ad blocker? It may be causing problems by incorrectly blocking images from our servers...FSBastrd - Wednesday, January 30, 2008 - link
I'm basically running a stock version of Firefox, so no. Ironically, the ads are just about the only pictures that do load for me. Also, all of the picture for the AnandTech homepage load for me, it's just the pics in the articles This is the only website that really gives me problems. One last thing, some (rare) pictures do load for me from the articles. All in all, it's quite strange, and I can't figure it out.FSBastrd - Tuesday, January 29, 2008 - link
Am I the only one who can't get pictures to load from this site. It would sure make this article a whole lot easier to follow along.sje123 - Tuesday, January 29, 2008 - link
Excellent review as ever!Quick question with regard to Watercooling blocks for this board. It looks more or less identical to the X38 apart from the different chip in the NB, therefore I'm wondering if you could tell me whether or not you think an ASUS X38 NB block would also fit the ASUS X48 Rampage?
is the NB under the cooler the same size etc and are the mouting screws in the same position as the X38 eg the Maximus?
THe SB and the mofset coolers will be the same as the Maximus.
snarfbot - Sunday, January 27, 2008 - link
alright, pretty exciting results here.at trd of 8 (default) at 400mhz 1:1 cas 4, i got 7687mb/s read, and 64ns latency in everest.
at trd of 6 at the same speed, divider and cas setting i got 8089mb/s read, and 59.8ns latency.
then just for fun i bumped the speed upto 500 and loosened the timings to cas 5, at 5:4, i left the trd at 6. at these settings i got 8640mb/s read, and 57.5 latency.
the latency suprised me, as the trd remained the same, and i actually loosened the cas latency.
anyways pretty good results.
processor is a e2140@3200mhz.
snarfbot - Sunday, January 27, 2008 - link
alright, i have a ga-p35-ds3l. im running the fsb at 400, memory at 1:1 cas 4.i set trd to 6 in the bios. based on the formula, it shouldnt even post.
trd(6) - tcl(4)/n(1) =fsb400(2)/1
2=2
im gonna run through sandra and see what the difference is, if there is any, or perhaps this setting doesnt work correctly on this board.
Fyl - Sunday, January 27, 2008 - link
not to lower the merits of this great article but since I've read it I've been experimenting on my machine different settings and for some of them your formula doesn't seem to stand; here's an example of a stable configuration, no overvoltage to anything:E8500@3.6 (400MHzx9)
P35-DS4 (tRD 7)
2x2G DDR2 800 (400MHz, 5-5-5-12)
based on your formula N = 400:400 = 1 and x = 2
therefore 7-(5/1) > 2/1 => 2 > 2 => false but actually working
am I missing anything?