Real-World Results: What Does a Lower tRD Really Provide?

Up until this point we have spent a lot of time writing about the "performance improvement" available by changing just tRD. First, let's define the gain: lower tRD settings result in lower associated TRD values (at equivalent FSB clocks), which allow for a lower memory read latency time, ultimately providing a higher memory read speed (MB/s). Exactly how a system tends to respond to this increase in available bandwidth remains to be seen, as this is largely dependent on just how sensitive the application/game/benchmark is to variations in memory subsystem performance. It stands to reason that more bandwidth and lower latencies cannot possibly be a bad thing, and we have yet to encounter a situation in which any improvement (i.e. decrease) in tRD has ever resulted in lower observed performance.

EVEREST - a popular diagnostics, basic benchmarking, and system reporting program - gives us a means for quantifying the change in memory read rates experienced when directly altering tRD though the use of its "Cache & Memory Benchmark" tool. We have collected these results and present them below for your examination. The essential point to remember when reviewing these figures is that all of this data was collected using memory speeds and settings well within the realm of normal achievement - an FSB of 400MHz using a 5:4 divider for DDR2-1000 with 4-4-4-10 primary timings at a Command Rate of 2N. The only change made between data collection runs was a modification to tRD.


Memory
Read Bandwidth - Variable tRD

Using the default tRD of 12, our system was able to reach a maximum memory read bandwidth value of 7,597 MB/s - a predictable result considering the rather relaxed configuration. Tightening tRD all the way to a setting of 5 provides us with dramatically different results: 9,166 MB/s, more than 20% higher total throughput! Keep in mind that this was done completely independent of any memory setting adjustment. There is a central tenet of this outcome: because the MCH is solely responsible for delivering the additional performance gains, this concept can be applied to any system, regardless of memory type or quality.

The next graph shows how memory access (read) latency changes with each tRD setting. As we can see, the values march steadily down as we continue to lower tRD. We can also note that the change in latency between any two successive steps is always about 2.5ns, the Tcycle value for 400MHz FSB and the expected equivalent change in TRD for a drop in tRD of one. No other single memory-related performance setting has the potential to influence a reduction in read latency of this magnitude, not even the primary memory timings, making tRD unique in this respect. For this reason, tRD is truly the key to unlocking hidden memory performance, much more so than the primary memory timings traditionally associated with latencies.


Memory
Read Latency - Variable tRD

We realized our best performance by pushing the MCH well beyond its specified range of operation. Not only were we able overclock the controller to 450MHz FSB but we also managed to maintain a tRD of 5 (for a TRD of about 11.1ns) at this exceptional bus speed. Using the 3:2 divider and loosening the primary memory timings to 5-5-5-12 allowed us to capture some of the best DDR2 memory bandwidth benchmarks attainable on an Intel platform. As expected, our choice of tRD plays a crucial role in enabling these exceptional results. Screenshots from EVEREST show just how big a difference tRD can make - we have included shots using tRD values of 7, 6, and 5.







A considerable share of the memory read performance advantage that AMD-based systems have over Intel-based systems can be directly attributed to the lower memory latency times made possible by the design of the AMD processor's on-die memory controller. So far we have done a lot to show you why reducing TRD to a lower level can make such a positive impact on performance; knowing this you might tend to believe that the optimal value would be about zero, and you would be right. Eliminating the latency associated with the MCH Read Delay would further reduce total system memory read latency by another 12.5ns (as modeled by the results above).

Given this, Intel-based systems would perform memory read operations about on par with the last generation AMD-based systems. Although not the only reason, this is one of the main motivations behind Intel's decision to finally migrate to a direct point-to-point bus interface not unlike that which has been historically attributed to AMD. Removing the middleman in each memory access operation will do wonders for performance when Intel's next step in 45nm process technology, codenamed Nehalem, hits the shelves in ~Q4'08. Until then we'll have to try to do the best with what we've got.

MCH Read Delay Scaling and Default tRD Settings for Each Strap The Rules of Working with tRD: What's Allowed and What Isn't
Comments Locked

73 Comments

View All Comments

  • kjboughton - Sunday, January 27, 2008 - link

    The rules as defined may not apply exactly as provided for P35. The equations have been tested to be true for X38/X48 but additional testing is still needed on P35 in order to validate the results.
  • Super Nade - Saturday, January 26, 2008 - link

    Hi,

    I love the technical depth of the article. Outstanding writeup! I hope you will NOT dumb down future articles as this is how, IMO a review should be written.

    S-N
  • Eric Rekut - Saturday, January 26, 2008 - link

    Great article! I have a question, is x48 faster in super-pi than p35/x38?
  • Rajinder Gill - Saturday, January 26, 2008 - link

    Hi,

    In general the X38/X48 chipset outscores the P35 in Super Pi. The x48 can/will pull ahead of the X38 very marginally IF it can handle a lower overall tRD with a higher FSB combination and tighter memory sub-timing ranges - within an available level of Northbridge voltage.

    regards
    Raja
  • Rob94hawk - Saturday, January 26, 2008 - link

    I would love to see you guys do benchmarking and overclocking with the QX9770+DDR3 1800 with this mobo.
  • Rajinder Gill - Saturday, January 26, 2008 - link

    Hi Rob,

    Kris will be testing the Rampage Extreme soon (with DDR3). The 9770's only show a little more prowess than QX9650's under LN2 cooling (in some instances - not always). With cascade/water/air cooling there's little to separate the QX9650 from the QX9770 (at least in my experience with both processors thus far).


    regards
    Raja

  • enigma1997 - Saturday, January 26, 2008 - link

    Another excellent article after the QX9650 O/C one. Congratulations!!

    I have a few questions: What ram did you use to achieve the amazingly high bandwidth result (the one that goes with the 450FSB and tRD 5)? I understand you are using a divider of 3:2 and CAS5, so I expect the DDR2 speed should be at 10800!!

    Also, I am not sure how you can get a memory read of >9000MB/s with tRD 5. I have a pair of G.Skill F2-8000PHU2-2GBHZ 4-4-4-5 and a DFI X38-T2R motherboard. I set it up with a QX9650 with tRD/FSB/ram timing identical to yours, but I only get around 8800MB/s. Note that the CPU runs at 3000Mhz.

    Thanks for the article and your answers to my questions :)
  • kjboughton - Sunday, January 27, 2008 - link

    Memory used for the incredible 450FSB/tRD 5 result was OCZ DDR2 PC-9200 Reaper (2GB kit).

    Regarding the testing you did at equivalent speeds, contrary to popular belief, CPU speed does influence both system memory read latency and bandwidth (add 16 clocks of whatever the CPU's Tcycle is to total system latency - about an extra 1.33ns going from 4GHz, where I tested, down to 3GHz uses in your system). This is certainly enough to reduce your BW results down below 9GB/s.
  • Jodiuh - Saturday, January 26, 2008 - link

    "we feel there is nothing that needs modification by the end user as long as overclocking aspirations are within reason."

    The current Maximus series requires a bit of work (heatgun, fridge) to pull this off and replace with TIM of choice. Also I noticed a 7C drop on the bench when adding a 5CFM 40mm to the NB. Would you mind fleshing out the comment a bit more?

    Thanks for the very thorough information in the article!
  • jedisoulfly - Friday, January 25, 2008 - link

    there is a patriot viper ddr3 1600 cl7 kit at newegg for $295 (out of stock at time of this post) that is dramatically higher than good 800 ddr2 or even 1066 but just over a year ago ddr2 800 2gb kits were going for that price. I think once NV and AMD start making chip sets that support ddr3 the prices will start to come down...hopefully

Log in

Don't have an account? Sign up now