Memory Subsystem Overview

We mentioned how changes to the module design can require changes to the memory controller as well. When an address arrives at the memory, it does not simply appear there directly from the CPU; we are really talking about several steps. First, the CPU sends the request to the cache, and if the data is not in the cache, the request is forwarded to the memory controller via the Front Side Bus (FSB). (In some newer systems like the Athlon 64, requests may arrive via a HyperTransport bus, but the net result is basically the same.) The memory controller then sends the request to the memory modules over the memory bus. Once the data is retrieved internally on the memory module, it gets sent from the RAM via the memory bus back to the memory controller. The memory controller then sends it onto the FSB, and eventually, the requested data arrives at the CPU.

Note that the data could also be requested/sent somewhere else. DMA (Direct Memory Access) allows other devices such as network adapters, sound cards, graphics cards, controller cards, etc. to send requests directly to the memory controller, bypassing the CPU. In this overview, we were talking about the CPU to RAM pathway, but the CPU could be replaced by other devices. Normally, the CPU generates the majority of the memory traffic, and that is what we will mostly cover. However, there are other uses of the RAM that can come into play, and we will address those when applicable.

Now that we have explained how the requests actually arrive, we need to cover a few details about how the data is transmitted from the memory module(s). When the requested column is ready to transmit back to the memory controller, we said before that it is sent in "bursts". What this means is that data will be sent on every memory bus clock edge - think of it as a "slot" - for the RAM's burst length. If the memory bus is running at a different speed than the FSB, though - especially if it's running slower - there can be some additional delays. The significance of these delays varies by implementation, but at best, you will end up with some "bubbles" (empty slots) in the FSB. Consider the following specific example.

On Intel's quad-pumped bus, each non-empty transmission needs to be completely full, so all four slots need to have data. (There are caveats that allow this rule to be "bent", but they incur a loss of performance and so they are avoided whenever possible.) If you have a quad-pumped 200 MHz FSB (the current P4 bus) and the RAM is running on a double-pumped 166 MHz bus, the FSB is capable of transmitting more data than the RAM is supplying. In order to guarantee that all four slots on an FSB clock cycle contain data, the memory controller needs to buffer the data to make sure an "underrun" does not occur - i.e. the memory controller starts sending data and then runs out after the first one or two slots. Each FSB cycle comes at 5 ns intervals, and with a processor running at 3.0 GHz, a delay of 5 ns could mean as many as 15 missed CPU cycles!

There are a couple of options to help speed up the flow of data from the memory controller to the FSB. One is to use dual-channel memory, so the buffer will fill up in half the time. This helps to explain why Intel benefits more from dual-channel RAM than AMD: their FSB and memory controller are really designed for the higher bandwidth. Another option is to simply get faster RAM until it is able to equal the bandwidth of the FSB. Either one generally works well, but having a memory subsystem with less bandwidth than what the FSB can use is not on ideal situation, especially for the Intel design. This is why most people recommend against running your memory and system busses asynchronously. Running RAM that provides a higher bandwidth than what the FSB can use does not really help, other than to reduce latencies in certain situations. If the memory can provide 8.53 GB/s of bandwidth and the FSB can only transmit 6.4 GB/s, the added bandwidth generally goes to waste. For those wondering why benchmarks using DDR2-533 with an 800 FSB P4 do not show much of an advantage for the faster memory, this is the main reason. (Of course, on solutions with integrated graphics, the additional memory bandwidth could be used for graphics work, and in servers, the additional bandwidth can be helpful for I/O.)

If you take that entire description of the memory subsystem, you can also see how AMD was able to benefit by moving the memory controller onto the CPU die. Now, the delays associated with the transmission of data over the FSB are almost entirely removed. The memory controller still has to do work, but with the controller running at CPU clock speeds, it will be much faster than before. The remaining performance deficit that Athlon 64 and Opteron processors suffer when running slower RAM can be attributed to the loss of bandwidth and the increased latencies, which we will discuss more in a moment. There are a few other details that we would like to mention first.

Understanding Memory Access Design Considerations
Comments Locked

22 Comments

View All Comments

  • 666an666 - Thursday, May 14, 2009 - link

    Thanks for the details. Unfortunatelt, most sellers of RAM (and most brand packagings) fail to mention these measurement details. They only show obscure model numbers and "PC-3200" or whatever. They usually only offer the choice of various brands, not various CL values.
  • letter rip - Saturday, December 25, 2004 - link

    This is great reading. When's the next installment?
  • Herm0 - Wednesday, November 10, 2004 - link

    There are two things that sould improve greatly a DIMM performance, in addition to the well known timings things "2-2-2-6"... , but looking at DIMMs specs, are hard to know :

    - The number of internal Banks. When a DIMM use multiple banks, the DIMM is divided in pieces, each holding its own grid of data and the logic to access it. Going from one bank to another one have no penalty : the memory controller have to send the bank address on two physical DIMM pins (so that it can't be more than 4 banks in a DIMM) at each access. Having a 2/4 bank DIMM is really like having 2/4 DIMMs : while one bank is waiting for a delay to exhaust (a CAS latency, a RAS latency, a RAS precharge...), the memory controller can send an order or do r/w things on another one... Most manufacturer build 2 banks DIMMs (when they publish that information !), few of them do 4 banks DIMMs.

    - The wideness of their row. It's slow to access to the 1st data of a row (1: wait for tRP, Row Precharge, from the last operation, 2: send the new row address and wait tRCD, 3: Ras to Cas Delay, send the column address and wait tCL, Cas delay, read the 1st 64bit bloc of data), but it's fast to read from the activated row (Send the starting column and wait tCL, then read/write data, 1 or 2 per clock (SDRAM or DDRAM), of the pre-programmed length & order). In a ideal DIMM having only 1 row, the only penalty would be from the tCL one ! The more large is a row, the more data can be accessed before dealing with Row delays (Precharge, and Ras to Cas). The row size is nearly never published, and I don't know how to get the number from the detailed DIMM/DRAM specs...

    Looking at 1Gb DDR400 DIMM modules too as #19, a good one, theorically, seems to be a Kingston's DIMMs :
    - Timings = 2.5-3-3-7 (shouldn't last digit be 2.5+3+2 = 7.5 or 8 ?), most 1 Gb DIMMs are 3-3-3-8 or slowers.
    - Banks = 4, most of DIMMs, even high-end ones, are only 2 Banks.
    - Row size = ??? Unknown...

    Am I right, or do I have to re-do Ars Technica lessons ? :-)
  • Gioron - Thursday, September 30, 2004 - link

    In terms of buying 512M of fast memory of 1G of slow memory... here's what a quick look at prices for memory looked like (all corsair sticks and only from one vendor because I'm lazy and didn't want to complicate things):
    512M "Value" (CL2.5): $77
    512M "XMS" (CL2): $114
    512M "Xtra low" (2-2-2-5): $135
    1G "Value" kit (CL3, 2x512M):$158

    To me, it looks like the "Xtra low" is indeed not a good bang for the buck, with the 1G upgrade only $20 more. However, the "XMS" 512M might be a good price point if you don't want to go all the way to $158 but have more than $77. Going for insanely low latencies seems to be only worth it if you have plenty of cash to spare and are already at 1G or more. (Or else are optimizing for a single, small application that relies heavily on RAM timings, but I don't think you'll run into that too much in a desktop environment.)

    One thing that might be useful in later articles is a brief discussion on the tradeoffs between size and performenace in relation to swapping pages to disk. Not sure if that will fit in with the planned article content, however.
  • JarredWalton - Wednesday, September 29, 2004 - link

    ??? I didn't think I actually started with a *specific* type of RAM - although I suppose it does apply to SDRAM/DDR, it also applies to most other types of RAM at an abstract level. There are lots of abstractions, like the fact that a memory request actually puts the row address and column address on different pins - it doesn't just "arrive". I didn't want to get into really low-level details, but look more at the overall picture. The article was more about the timings and what each one means, but you have to have a somewhat broader understanding of how RAM is accessed before such detail as CAS and RAS can really be explained in a reasonable manner.
  • Lynx516 - Wednesday, September 29, 2004 - link

    Not much has changed fundementaly with SDRAM since the early days of ddR.

    I never actually said a burst was a column but infact a continous set of columns (unless interleaved).

    Ok I admit there arnt many books on processor design and latency however there are data sheets and articles that describe the basics. Once tyou have grasped the basics you can work it out using the data sheets e.t.c

    Probably a better place to start with this series would have been the memory heirarchy instead of starting with a specifc
    type of RAM
  • JarredWalton - Wednesday, September 29, 2004 - link

    The idea here is to have an article on Anandtech.com. :) I like Ars Technica as much as the next guy, but there are lots of different ways of describing technology. Sometimes you just have to write a new article covering information available elsewhere, you know? How many text books are there on processor design and latency? Well, here's another article discussing memory. Also worth noting is that Ars hasn't updated their memory information since the days of SDRAM and DDR (late 2000), and things certainly have changed since then.

    I should clarify my last comment I made: the column width of DDR is not really 32 bytes or 64 bytes, but that seems to be how many memory companies now refer to it in *layman's* terms. This article is much more of a layman's approach. The deep EE stuff on how everything works is more than most people really want to know or understand (for better or for worse). A column can also be regarded as each piece of a burst, which is probably the correct terminology. We'll be looking at various implementations in the next article - hopefully stuff that you haven't read a lot about yet. :)
  • greendonuts3 - Tuesday, September 28, 2004 - link

    Meh. You kind of started in the middle of the topic and worked your way outward/backward/forward. As a general user, I found the wealth of info more confusing than helpful in understanding ram. Maybe you could focus just on timing issues, which seems to be your intent, and refer the reader to other articles (eg the Ars one mentioned above) for the basics?
    Thanks.
  • JarredWalton - Tuesday, September 28, 2004 - link

    The comparison with set associativity is not that bad, in my opinion. What you have to remember is that we would then be talking about a direct-mapped cache with a whopping four entries (one per sense amp/active row). I guess I didn't explain it too well, and it's not a perfect match, true.

    Regarding burst lengths, each burst is not a column of information, although perhaps it was on older RAM types. For instance, the burst length of DDR can be 4 or 8. Each burst transmits (in the case of single-channel configurations) 64 bits of data, or 8 bytes. The column size is not 8 bytes these days, however - it is either 32 bytes or 64 bytes on DDR. (Dual-channel would effectively double those values.)
  • ss284 - Tuesday, September 28, 2004 - link

    I wouldnt say that the article is that confusing, but there is much truth in the post above^^^.

    -Steve

Log in

Don't have an account? Sign up now