More GDDR5 Technologies: Memory Error Detection & Temperature Compensation

As we previously mentioned, for Cypress AMD’s memory controllers have implemented a greater part of the GDDR5 specification. Beyond gaining the ability to use GDDR5’s power saving abilities, AMD has also been working on implementing features to allow their cards to reach higher memory clock speeds. Chief among these is support for GDDR5’s error detection capabilities.

One of the biggest problems in using a high-speed memory device like GDDR5 is that it requires a bus that’s both fast and fairly wide - properties that generally run counter to each other in designing a device bus. A single GDDR5 memory chip on the 5870 needs to connect to a bus that’s 32 bits wide and runs at base speed of 1.2GHz, which requires a bus that can meeting exceedingly precise tolerances. Adding to the challenge is that for a card like the 5870 with a 256-bit total memory bus, eight of these buses will be required, leading to more noise from adjoining buses and less room to work in.

Because of the difficulty in building such a bus, the memory bus has become the weak point for video cards using GDDR5. The GPU’s memory controller can do more and the memory chips themselves can do more, but the bus can’t keep up.

To combat this, GDDR5 memory controllers can perform basic error detection on both reads and writes by implementing a CRC-8 hash function. With this feature enabled, for each 64-bit data burst an 8-bit cyclic redundancy check hash (CRC-8) is transmitted via a set of four dedicated EDC pins. This CRC is then used to check the contents of the data burst, to determine whether any errors were introduced into the data burst during transmission.

The specific CRC function used in GDDR5 can detect 1-bit and 2-bit errors with 100% accuracy, with that accuracy falling with additional erroneous bits. This is due to the fact that the CRC function used can generate collisions, which means that the CRC of an erroneous data burst could match the proper CRC in an unlikely situation. But as the odds decrease for additional errors, the vast majority of errors should be limited to 1-bit and 2-bit errors.

Should an error be found, the GDDR5 controller will request a retransmission of the faulty data burst, and it will keep doing this until the data burst finally goes through correctly. A retransmission request is also used to re-train the GDDR5 link (once again taking advantage of fast link re-training) to correct any potential link problems brought about by changing environmental conditions. Note that this does not involve changing the clock speed of the GDDR5 (i.e. it does not step down in speed); rather it’s merely reinitializing the link. If the errors are due the bus being outright unable to perfectly handle the requested clock speed, errors will continue to happen and be caught. Keep this in mind as it will be important when we get to overclocking.

Finally, we should also note that this error detection scheme is only for detecting bus errors. Errors in the GDDR5 memory modules or errors in the memory controller will not be detected, so it’s still possible to end up with bad data should either of those two devices malfunction. By the same token this is solely a detection scheme, so there are no error correction abilities. The only way to correct a transmission error is to keep trying until the bus gets it right.

Now in spite of the difficulties in building and operating such a high speed bus, error detection is not necessary for its operation. As AMD was quick to point out to us, cards still need to ship defect-free and not produce any errors. Or in other words, the error detection mechanism is a failsafe mechanism rather than a tool specifically to attain higher memory speeds. Memory supplier Qimonda’s own whitepaper on GDDR5 pitches error correction as a necessary precaution due to the increasing amount of code stored in graphics memory, where a failure can lead to a crash rather than just a bad pixel.

In any case, for normal use the ramifications of using GDDR5’s error detection capabilities should be non-existent. In practice, this is going to lead to more stable cards since memory bus errors have been eliminated, but we don’t know to what degree. The full use of the system to retransmit a data burst would itself be a catch-22 after all – it means an error has occurred when it shouldn’t have.

Like the changes to VRM monitoring, the significant ramifications of this will be felt with overclocking. Overclocking attempts that previously would push the bus too hard and lead to errors now will no longer do so, making higher overclocks possible. However this is a bit of an illusion as retransmissions reduce performance. The scenario laid out to us by AMD is that overclockers who have reached the limits of their card’s memory bus will now see the impact of this as a drop in performance due to retransmissions, rather than crashing or graphical corruption. This means assessing an overclock will require monitoring the performance of a card, along with continuing to look for traditional signs as those will still indicate problems in memory chips and the memory controller itself.

Ideally there would be a more absolute and expedient way to check for errors than looking at overall performance, but at this time AMD doesn’t have a way to deliver error notices. Maybe in the future they will?

Wrapping things up, we have previously discussed fast link re-training as a tool to allow AMD to clock down GDDR5 during idle periods, and as part of a failsafe method to be used with error detection. However it also serves as a tool to enable higher memory speeds through its use in temperature compensation.

Once again due to the high speeds of GDDR5, it’s more sensitive to memory chip temperatures than previous memory technologies were. Under normal circumstances this sensitivity would limit memory speeds, as temperature swings would change the performance of the memory chips enough to make it difficult to maintain a stable link with the memory controller. By monitoring the temperature of the chips and re-training the link when there are significant shifts in temperature, higher memory speeds are made possible by preventing link failures.

And while temperature compensation may not sound complex, that doesn’t mean it’s not important. As we have mentioned a few times now, the biggest bottleneck in memory performance is the bus. The memory chips can go faster; it’s the bus that can’t. So anything that can help maintain a link along these fragile buses becomes an important tool in achieving higher memory speeds.

Lower Idle Power & Better Overcurrent Protection Angle-Independent Anisotropic Filtering At Last
Comments Locked

327 Comments

View All Comments

  • erple2 - Tuesday, September 29, 2009 - link

    What the heck are you talking about? Are you saying that electricity consumed by a device divided by the "volume" of the device is the only way to measure the heat output of the device? Every single Engineering class I took tells me that's wrong, and I'm right. I think you need to take some basic courses in Electrical Engineering and/or Thermodynamics.

    (simplified)
    power consumed = work + waste

    You're looking for the waste heat generated by the device. If something can completely covert every watt of electricity that passes through it to do some type of work (light a light bulb, turn a motor, make some calculation on a GPU etc), then it's not going to heat up. As a result, you HAVE to take into consideration how inefficient the particular device is before you can make any claim about how much the device heats up.

    I'll bet that if you put a Liquid Nitrogen cooler on every ATI card, and used the standard air coolers on every NVidia card, that the ATI cards are going to run crazy cooler than the NVidia cards.

    Ultimately the temperature of the GPU's depends a significant amount on the efficiency of the cooler, and how much heat the GPU is generating as waste. My point is that we don't have enough data to determine whether the ATI die runs hot because the coolers are less than ideal, Nvidia ones are closer to ideal, the die is smaller, or whatever you have. You have to look at a combination of the efficiency of the die (how well it converts input power to "work done"), the efficiency of the cooler (how well it removes heat from it's heat source), and the combination of the two.

    I'd posit that the ATI card is more efficient than the NVidia card (at least in WoW, the only thing we have actual numbers of the "work done" and "input power consumed").

    Now, if you look at the measured temperature of the core as a means of comparing the worthiness of one GPU over another, I think you're making just as meaningful a comparison as comparing the worthiness of the GPU based on the size of the retail box that it comes in.
  • SiliconDoc - Friday, September 25, 2009 - link

    You simply repeated my claim about watts, and replaced core size, with fps, and created a framerate per watt chart, that has near nothing to do with actual heat inside the die, since the SIZE of the die, vs the power traversing through it is the determining factor, affected by fan quality (ram size as well).
    Your argument is "framerate power efficiency", as in watts per framerate, and has nothing to do with core temperature (modified by fan cooling of course to some degree), that the article indeed posts except for the two failed ati cards.
    The problem with your flawwed "science" that turns it into hokum, is that no matter what outputs on the screen, the HEAT generated by the power consumption of the card itself, remains in the card, and is not "pumped through the videoport to the screen".
    If you'd like to claim "wattage vs framerate" efficency for 5870, fine I've got no problem, but claiming that proves core temps are not dependent on power consumption vs die size ( modified by the rest of the card *mem size/power useage/ and the fan heatsink* ) is RIDICULOUS.
    ---
    The cards are generally equivalent manufacturing and component additions, so you take the wattage consumed (by the core) and divide by core size, for heat density.
    Hence, ATI cards, smaller cores and similar power consumption, wind up hotter.
    That's what the charts show, that's what should be stated, that is the rule, and that's the way it plays in the real world, too.
    ---
    The only modification to that is heatsink fan efficiency, and I don't find you fellas claiming stock NVIDIA fans and heatsinks are way better than the ATI versions, hence 66C for NVIDIA, 75C, 85C, etc, and only higher for ATI, in all their cards listed.
    Would you like to try that one on for size ? Should I just make it known that NVIDIA fans and heatsinks are superior to ATI ?
    What is true is a lager surface area (die side squared) dissipates the same amount of heat easier, and that of course is what is going on.
    ATI dies are smaller ( by a marked surface area as has so foten been pointed out), and have similar power consumption, and a higher DENSITY of heat generation, and therefore run hotter.
  • erple2 - Friday, September 25, 2009 - link

    Oops, "milliwatt" should be "kilowatt". I got the decimal place mixed up - I used kilowatt since I thought it was easier to see than 0.247, 0.140, 0.137, 0.181...
  • SiliconDoc - Wednesday, September 23, 2009 - link

    Let's take that LOAD TEMP chart and the article's comments. Right above it, it is stated a good cooler includes the 4850 that ILDE TEMPs in at around 40C (it's actually 42C the highest of those mentioned).
    "The floor for a good cooler looks to be about 40C, with the GTS 250(39C), 3870(41C), and 4850 all turning in temperatures around here"
    OK, so the 4850 has a good cooler, as well as the 3870... then right below is the LOAD TEMP.. and the 4850 is @ 90C -OBVIOUSLY that good cooler isn't up to keeping that tiny hammered core cool...

    3870 is at 89C, 4870 is at 88C, 5870 is at 89C ALL ati....
    but then, nvidia...
    250, 216, 285, 275 all come in much lower at 66C to 85C.... but "temps are all over the place".
    NOT only that crap, BUT the 4890 and 4870x2 are LISTED but with no temps - and take the "coolest position" on the chart!
    Well we KNOW they are in the 90C range or higher...
    So, you NEVER MENTION why 4870x2 and 4980 are "no load temp shown in the chart" - you give them the WINNING SPOTS anyway, you fail to mention the 260's 65C lowest LOAD WIN and instead mention GTX275 at 75C...LOL

    The bias is SO THICK it's difficult to imagine how anyone came up with that CRAP, frankly.
    So the superhot 4980 and 4870x2 are given #1 and #2 spots repsectively, a free ride, the other Nvidia cards KICK BUTT in lower load temps EXCEPT the 295, but it makes sure to mention the 8800GT while leaving the 4980 and 4870x2 LOAD TEMP spots blank ?
    roflmao
    ---
    What were you saying about "why" ? If why the 8800GT was included is TRUE, then comment on the gigantic LOAD TEMP bias... tell me WHY.
  • SiliconDoc - Wednesday, September 23, 2009 - link

    AND, you don't take temps from WOW to use for those two, which no doubt even though it is NOT gpu stressing much, will yeild the 90C for those two cards 4870x2 and 4980, anyway.
    So they FAIL the OCCT, but you have NOTHING on them, which would if listed put EVERY SINGLE ATI CARD @ near 90C LOAD, PERIOD...
    ---
    And we just CANNOT have that stark FACT revelaed, can we ? I mean I've seen this for well over a year here now.
    LET's FINALLY SAY IT.
    ---
    LOAD TEMPS ON THE ATI CARDS ARE ALL, EVERY SINGLE ONE NEAR 90c, much higher than almost ALL of the Nvidia cards.
  • pksta - Thursday, September 24, 2009 - link

    I just want to know...With this much zeal about videocards and more specifically the bias that you see, doesn't it make you sound biased too? Can you say that you have owned the cards you are bashing and seen the differences firsthand? I can say I did. I had an 8800 GT and it was running in the upper 80s under load. I switched to my 4850 with the worst cooler I think I've ever seen mind you, and it stays in the mid to upper 60s under load. The cooler on the 8800 gt was the dual-slot design that was the original reference design. The 4850 had the most pathetic fan I've ever seen. It was similar to the fan and heatsink Intel used on the first Core2 stuff. It was the really cheap aluminum with a tiny copper circle that made contact with the die itself. Now, don't get me wrong I love ATI...But I also love nVidia...Anything that keeps games getting better and prices getting better. I honestly don't think, though, that the article is too biased. I think maybe a little for ATI but nothing to rage on and on about. Besides...Calm down. You know nVidia will have a response for this.
  • SiliconDoc - Sunday, September 27, 2009 - link

    1. Who cares what you think about how you percieve me ? Unless you have a fact to refute, who cares ? What is biased ? There has been quite a DISSSS on PhysX for quite some time here, but the haters have no equal alternative - NOTHING that even comes close. Just ASK THEM. DEAD SILENCE. So, strangely, the very best there is, is BAD.

    Now ask yourself again who is biased, won't you? Ask yourself who is spewing out the endless strings... Do yourself a favor and figure it out. Most of them have NEVER tried PhysX ! They slip up and let it be known, when they are slamming away. Then out comes their PC hate the greedy green rage, and more, because they have to, to fit in the web PC code, instead of thinking for themselves.

    2. Yes, I own more cards currently than you will in your entire life. I started retail computer well over a decade ago.

    3. And now, the standard red rooster tale. It sounds like you were running in 2d clocks 100% of the time, probably on a brand board like a DELL. Happens a lot with red cards. Users have no idea.
    4850 with The worst fan in the World ! ( quick call Keith Olbermann) and it was ice cold, a degree colder than anything else in the review. ROFLMAO
    Once again, the red shorts pinnocchio tale. Forgive me while I laugh, again !
    ROFLMAO
    Did you ever put your finger on the HS under load ? You should have. Did you check your 3D mhz..
    http://forums.anandtech.com/messageview.aspx?catid...">http://forums.anandtech.com/messageview.aspx?catid...
    Not like 90C is offbase, not like I made up that forum thread.

    4. I could care less if nvidia has a response or not. Point is, STOP LYING. Or don't. I certainly have noticed many of the lies I've complained about over a year or so have gone dead silent, they won't pull it anymore, and in at least one case, used in reverse for more red bias, unfortunately, before it became the accepted idea.

    So, I do a service, at the very least people are going to think, and be helped, even if they hate me.
  • SiliconDoc - Wednesday, September 23, 2009 - link

    Well of course that's the excuse, but I'll keep my conclusion considering how the last 15 reviews on the top videocards were done, along with the TEXT that is pathetically biased for ati, that I pointed out. (Even though Derek was often the author).
    --
    You want ot tell me how it is that ONLY the GTX295 is near or at 90C, but ALL the ati cards ARE, and we're told "temperatures are all over the place" ?
    Can you really explain that, sir ?
  • 529th - Wednesday, September 23, 2009 - link

    holy shit, a full review is up already!
  • bill3 - Wednesday, September 23, 2009 - link

    Does the article keep referring to Cypress as "too big"? If Cypress is too big, what the hell is GT200 at 480mm^2 or whatever it was? Are you guys serious with that crap?

    I've heard that the "sweet spot" talk from AMD was a bit of a misdirection from the start anyway. IMO if AMD is going to compete for the performance crown or come reasonably close (and frankly, performance is all video card buyers really care about, as we see with all the forum posts only mentioning that GT300 will supposedly be faster than 58XX and not anything else about it) then they're going to need slightly bigger dies. So Cypress being bigger is a great thing. If anything it's too small. Imagine the performance a 480mm^2 Cypress would have! Yes, Cypress is far too small, period.

    Personally it's wonderful to see AMD engineer two chips this time, a bigger performance one and smaller lower end one. This works out far better all around.

    The price is also great. People expecting 299 are on crack.

Log in

Don't have an account? Sign up now