Diving Deeper: The Maxwell 2 Memory Crossbar & ROP Partitions

Now that we have a basic understanding of the proper specifications of the GTX 970 we can dig deeper into its architecture and layout. As we mentioned previously, the GTX 970 ships with an unusual configuration, one we’ve never quite seen before. In it, thanks to a new feature of the Maxwell architecture, NVIDIA has been able to disable part of a ROP/Memory Controller partition, something that they could not do on Kepler or earlier generation designs. As a result of this the intended behavior of the GTX 970 is different than the GTX 980 or any theoretical Maxwell part with a completely disabled ROP/MC partition.

The biggest impact of this configuration is that it creates the segmented memory conditions NVIDIA outlined in their earlier statement. Due to performance issues from the unbalanced ROP/MC partition, NVIDIA segments the memory into a high-performance 3.5GB segment – what they refer to as segment 0 – and a low-performance 512MB (0.5GB) segment containing the rest of the RAM. It is this segmentation that results in some of the previously-unusual memory allocation behaviors and memory bandwidth benchmarks on the GTX 970.

But in order to understand just why the segments exist, how they are allocated, and how they influence performance, we first need to learn about the Maxwell memory crossbar.


GTX 970 Memory Crossbar

In the above diagram from NVIDIA, NVIDIA illustrates at a high-level overview how the Maxwell memory crossbar is laid out, and specifically how it’s configured for GTX 970. In it we have the 16 GM204 SMMs at the top connected to the crossbar, meanwhile below we have 4 ROP/MC partitions, with each partition divided up into individual ROP/L2 units and memory controllers. Note that the specific units disabled on any GTX 970 chip will vary – it’s not necessarily the 1/3/5 SMMs or the last ROP/MC partition – but from a high level overview this is consistent for GTX 970 regardless of which specific physical units have been disabled.

In any case, while the crossbar shows how the SMMs and ROP/MC partitions are connected via the crossbar, memory segmentation is really a story about the ROP/MC partitions. The SMMs do not play a role in how GTX 970’s memory is configured – the crossbar abstracts all of that – and the crossbar itself matters only in as much as the ROP/MC partitions are connected to it.

When it comes to the ROP/MC partitions then, we can see first-hand in the GTX 970 what’s new in the architecture. In Kepler and previous generation designs, NVIDIA’s granularity for disabling ROP/MC partitions was all or nothing – either the entire partition was turned on or the entire partition was turned off. However starting with Maxwell, NVIDIA has gained the ability to disable the individual ROP/L2 units within a partition, allowing one of the two ROP/L2 units to be disabled. This essentially introduces a 3rd possible option: a partially-enabled ROP/MC partition, backed by two memory controllers and a single ROP/L2 unit. Meanwhile the memory controllers are still tied together, and disabling one of them requires disabling the entire partition.

NVIDIA ROP/Memory Controller Configuration Options
Kepler Maxwell
ROP/MC Partition Fully Enabled ROP/MC Partition Fully Enabled
N/A ROP/MC Partition Partially Disabled
ROP/MC Partition Fully Disabled ROP/MC Partition Fully Disabled

It’s this option that the GTX 970 leverages, and in doing so becomes the first Maxwell part to use this feature. NVIDIA has disabled 1 ROP/L2 unit, removing 8 “ROPs” (or rather 1 unit capable of 8 pixels/clock) and 256KB of L2 cache from the GTX 970. This is why the GTX 970 is left with 56 ROPs spread over the 4 partitions, along with 1.75MB of L2 cache.

Meanwhile there’s one other new feature here that’s activated only on the partially disabled partition, and that’s the link between the first and second units of the ROP partition. Typically each ROP/L2 unit would have a link to a port on the crossbar and a link to its own dedicated 32-bit memory controller channel; however because GTX 970 disabled a ROP/L2 unit, the “buddy” link comes in to play. This link is essentially the lynchpin of Maxwell’s new partial disable functionality, and allows the second half of the memory controller to stay active. This link only needs to be active when a ROP/L2 unit is disabled, and NVIDIA has confirmed that it is a full bandwidth link identical to the normal ROP/L2 to MC link, meaning it’s capable of 4 32 byte requests per clock (2 reads and 2 writes). Ultimately this link is what makes a partially disabled partition possible, and is also what makes it possible to have the full 256-bit memory bus present and active in spite of the lack of a ROP/L2 unit and its associated crossbar port.

Finally, because of this configuration and the lack of a ROP/L2 unit we get to the memory segments.  Although the full 256-bit memory bus is present and active on GTX 970 and capable of providing 224GB/sec of combined memory bandwidth between the DRAM modules and the memory controllers, it’s a bit of a misnomer to say the card has that much bandwidth between the memory controllers and everything else, at least in the same sense that the GTX 980 does.

Like virtually every other operation on a GPU, memory accesses are done in parallel to improve throughput. On a fully enabled card such as GTX 980 these operations are striped across all of the memory controllers in what NVIDIA calls a 1KB stride. In doing this, each and every 32 bit memory channel needs a direct link to the crossbar through its partner ROP/L2 unit. However in the case of the GTX 970 a wrench is thrown into the works, as there are 7 crossbar ports and 8 memory channels.

Ultimately due to the design of the crossbars and the memory controllers, it is not possible for 1 crossbar port to carry the full load of 2 memory channels in all circumstances. The crossbar port and its attached ROP/L2 unit can access both memory channels at once, splitting up the 4 operations among them, but there is only 1 read return bus and 1 write data bas, and hence in practice it cannot issue identical operations to both memory channels at once . As a result NVIDIA has segmented the GTX 970’s memory into the now-familiar 3.5GB and 512MB segments. In the case of the 3.5GB segment, this behaves otherwise identically to a fully enabled card such as the GTX 980, with the 1KB stride being striped over 7 crossbar ports, and hence 7 DRAM modules. Meanwhile the 8th and final DRAM module sits in its own 512MB segment, and must be addressed by the crossbar on its own.

This in turn is why the 224GB/sec memory bandwidth number for the GTX 970 is technically correct and yet still not entirely useful as we move past the memory controllers, as it is not possible to actually get that much bandwidth at once when doing a pure read or a pure write. In the case of pure reads for example, GTX 970 can read the 3.5GB segment at 196GB/sec (7GHz * 7 ports * 32-bits), or it can read the 512MB segment at 28GB/sec, but it cannot read from both at once; it is a true XOR situation. The same is also true for writes, as only one segment can be written to at a time.

Unfortunately what this means is that accessing the weaker 512MB segment blocks access to the stronger 3.5GB segment if both memory operations are identical; or put another way, using the 512MB segment can harm the performance of the 3.5GB segment. For example, if we want to issue reads to both segments at once, reading the 512MB segment blocks any other reads to the 3.5GB segment for that cycle. If the 3.5GB segment is blocked in this fashion and doesn't have a non-blocking write to work on instead, it would have to go idle for that cycle, which would reduce the effective memory bandwidth of the 3.5GB segment. This means that taken over time in our example, the larger the percentage of the time the crossbar is reading the 512MB segment, the lower the effective read memory bandwidth would be from the 3.5GB segment.

Despite all of this, achieving peak memory bandwidth performance on the GTX 970 is still possible, but it requires much more effort since simple striping will not do the trick. The easiest and most effective solution in this regard is to interleave reads and writes over the segments, such that one segment is writing while another segment is reading. Interleaving in this fashion allows both segments to work at once – avoiding the blocking effect of the shared read and write buses – and makes it more likely that both segments are doing useful work rather than waiting for their turn on an operation. However because this is only applicable to situations where more than 3.5GB of VRAM is in use and both segments are necessary, this means it's only theoretically possible to achieve 224GB/sec when more than 3.5GB of VRAM is in use. In any situations below 3.5GB we are de-facto limited to just the larger segment, in which case there are only 7 memory channels (196GB/sec) to stripe memory operations across. NVIDIA could of course interleave operations sooner than that and use both segments more often, but due to the blocking effect we've discussed before the performance hit from using the 512MB segment can quickly become greater than any gains.

The end result of all of this is that the GTX 970 has a unique memory layout that needs to be accounted for by the operating system and NVIDIA’s drivers, as the use of multiple segments adds a new level of complexity to optimizing memory performance. Unequal in size and performance, among the two segments the 3.5GB segment is a larger, faster, and otherwise more preferable segment to the 512MB segment. Which means from a logical hierarchical perspective, the 512MB segment essentially serves as an additional layer of memory between the main VRAM and system memory, being slower than the 3.5GB segment but offering almost 2x the performance of going out over the PCIe bus to system memory.

The upshot here is that while this is the first time NVIDIA has used this specific ROP/MC configuration in a product, this is not the first product they have designed with segmented or otherwise unbalanced memory configurations. Since the GTX 500 series, on some midrange SKUs NVIDIA has used unbalanced/asymmetrical memory configurations, most recently on the GTX 660 and GTX 660 Ti. In the case of both of those cards, NVIDIA utilized a 192-bit memory bus with 2GB of VRAM attached, which meant that some memory controllers had more VRAM attached to them than others. The end result as it turns out is very similar, and while NVIDIA has never explained in-depth how they handle memory allocation on those cards, it turns out that it’s very similar to GTX 970’s memory segmentation. Which is to say that NVIDIA actually has multiple generations of experience with segmented memory, and this is not the first time they have implemented it. Rather this is first time we’ve seen such a configuration on a high-performance card such as the GTX 970.


For Comparison: GTX 660 Ti's Unbalanced Memory Bus

NVIDIA Maxwell Family Memory Controller Configurations
Part Memory Segments
GTX 980 1 Segment (4x8 MC)
GTX 970 2 Segments (4x7 MC)
GTX 980M 1 Segment (4x8 MC)
GTX 970M 1 Segment (3x6 MC)
GTX 965M 1 Segment (2x4 MC)

As for why NVIDIA is using such a configuration here, the crux of the matter is money and yields. Without the ability to partially disable a ROP/MC partition, NVIDIA would either have to spec a card to use a fully enabled partition – essentially reducing yields for that card and driving up costs – or disable the entire partition and lose all of the benefits of the additional ROPs, memory, and the memory controller. This finer granularity allows NVIDIA to better control how they harvest bad chips and what resulting configurations they bring to market, along with making a single ROP/L2 defect less harmful to overall performance by keeping the rest of a partition online. Otherwise, to stick with a “balanced” configuration with as many crossbar ports as DRAM modules would result in either a higher spec GTX 970, or a lower spec card with a 192-bit memory bus.

In the grand scheme of things then, depending on how you want to define memory capacity the GTX 970 can be said to have either 3.5GB of VRAM or 4GB of VRAM. Only the first 3.5GB segment is a high-speed (full bandwidth) segment, and for naïve operations this is the segment that software will want to use. However the final 512MB segment is accessible, and despite its lower performance it is still usable and still far better than going to system memory, in which case GTX 970 has and can access a full 4GB of VRAM. This means that both answers can be correct, and like so many other facets of modern technology, the true answer is often more complex than a single number.

GeForce GTX 970: Correcting The Specs & Exploring Memory Allocation Segmented Memory Allocation in Software
Comments Locked

398 Comments

View All Comments

  • Will Robinson - Wednesday, January 28, 2015 - link

    You're going to love this then...
    http://gamenab.net/2015/01/26/truth-about-the-g-sy...
  • Oxford Guy - Thursday, January 29, 2015 - link

    Fascinating link, for sure.
  • mudz78 - Wednesday, January 28, 2015 - link

    "we have also been working on cooking up potential corner cases for the GTX 970 and have so far come up empty"

    Riiight.

    "As part of our discussion with NVIDIA, they laid out the fact that the original published specifications for the GTX 970 were wrong, and as a result the “unusual” behavior that users had been seeing from the GTX 970 was in fact expected behavior for a card configured as the GTX 970 was."

    Nvidia has already admitted they had complaints about performance.

    If you want to come up with scenarios where the 970 shits its pants you should really try harder:

    http://www.overclock.net/t/1535502/gtx-970s-can-on...

    http://forums.guru3d.com/showthread.php?t=396064

    https://www.reddit.com/r/hardware/comments/2s333r/...

    http://www.reddit.com/r/pcgaming/comments/2s2968/g...

    All of those threads have been around for weeks before Nvidia's announcment.

    Who cares what Nvidia's take on the situation is? It was an accident? Oh, no worries, mate!

    They are a business that lied, there's consequences to that. Nobody cares that they didn't mean it.

    Refunds will start rolling out in coming weeks.
  • Yojimbo - Wednesday, January 28, 2015 - link

    Hey, can you link to the actual relevant part of those threads where someone is posting his methodology and results for creating a performance problem? The overclocker link seems to be a link to a 106-page thread whose first message is just a link to the other 3 threads you posted. The first message in the guru3d thread claims that the card can't use more than 3.5GB at all, which we now know to be completely false. It's like you're throwing us a cookbook and flour and saying "Here, there's a pie in here somewhere." If it's somewhere in there, and you have seen it before, could you please find and point to the methodology and claimed results so that people can try to repeat it rather than you just saying "you really should try harder"?
  • mudz78 - Wednesday, January 28, 2015 - link

    I think a more fitting analogy would be, somebody is complaining they can't spell and I am handing them a dictionary. I'm telling you the information is in there, so have a read and find it.

    Maybe if you bothered to read beyond the first post in each thread you would have some answers?

    " The first message in the guru3d thread claims that the card can't use more than 3.5GB at all,"

    No it doesn't.

    "I think (maybe) is here a little problem with GTX 970. If I run some games, for example Far Cry 4, GTX 970 allocate only around 3500MB video memory, but in same game and same scene GTX 980 allocate full 4000MB video memory.
    But if I change resolution to higher - 3840x2160, then all memory is allocated.
    Same problem exist in many other games like Crysis 3, Watch Dogs etc..

    Where is problem?? I really dont know..."
    http://forums.guru3d.com/showthread.php?t=396064

    "I didn't believe this at first, but I just decided to try and test it myself with texture modded Skyrim and my SLI 970s. I tried to push the 3.5 GBs barrier by downsampling it from 5120x2880 with the four following experimental conditions:

    1. No MSAA applied on top
    2. 2xMSAA applied on top
    3. 4xMSAA applied on top
    4. 8xMSAA applied on top

    Since MSAA is known to be VRAM heavy, it made sense. I also kept a close eye on GPU usage and FPS with the Rivatuner overlay as well as VRAM usage. All of this was done running around Whiterun to minimize GPU usage. My results were as follows.

    1. Skyrim peaked at about 3600 MBs in usage with occasional brief hitching while loading new textures in and out of VRAM. GPU usage remained well below 99% on each card.

    2. Skyrim once again peaked at about 3600 MBs with the mentioned hitching, this time somewhat more frequently. Once again, GPU usage remained well below 99%.

    3. Skyrim yet again peaked at about 3600 MBs and hitched much more prominently and frequently at the same time as VRAM usage droppped down 100-200 MBs. GPU usage was below 99% again with FPS still at 60 aside from those hitches.

    4. Now Skyrim was using the full 4 GB framebuffer with massive stuttering and hitching from a lack of VRAM. This time, I had to stare at the ground to keep GPU usage below 99% and retain 60 FPS. I ran around Whiterun just staring at the ground and it remained at 60 FPS except with those massive hitches where GPU usage and framerate temporarily plummeted. This last run merely indicated that Skyrim can indeed use more VRAM than it was with the previous 3 settings and so the issue seems to be with the 970s themselves rather than just the game in this example. The performance degradation aside from VRAM was severe, but that could just be 8xMSAA at 5K taking its calculative toll.

    So it seems to me that my 970s refuse to utilize above ~3600 MBs of VRAM unless they absolutely need it, but I've no idea why. Nvidia didn't gimp the memory bus in any overly obvious way from the full GM204 chip therefore the 970s should have no issue using the same VRAM amount as the 980s. I don't like what I see, it's like the situation with the GTX 660 that had 2 GBs but could only effectively use up 1.5 without reducing its bandwidth to a third, so it tried to avoid exceeding 1.5. The difference is that was predictable due to the GK106's 192-bit memory bus, there's nothing about the 970's explicit specifications that indicates the same situation should apply.

    A similar shortcoming was noticed sometime back regarding the 970's ROPs and how the cutting-down of 3 of GM204's 16 SMM units affected the effective pixel fillrate of the 970s despite retaining the full 64 ROPs. It's possible that Maxwell is more tightly-connected to shader clusters and severing them affects a lot about how the chip behaves, but that doesn't really make sense. If this is an issue, it's almost certainly software-related. I'm not happy regardless of the reason and I'll try more games later. Anecdotally, I have noticed recent demanding games peaking at about 3500-3600 MBs and can't actually recall anything going beyond that. I didn't pay attention to it or change any conditions to test it."
    http://www.overclock.net/t/1535502/gtx-970s-can-on...

    "I can reproduce this issue in Hitman: Absolution.
    Once more than 3.5GB get allocated, there is a huge frametime spike.
    The same scene can be tested to get reproducible results.
    In 4k, memory usage stays below 3.5GB and there is no extreme spike. But in 5k (4x DSR with 1440p), at the same scene, there is a huge fps drop once the game wants to allocate 2-300MB at once and burst the 3.5GB.
    It happens in the tutorial mission when encountering the tennis field.

    With older driver (344.11 instead of 347.09), memory usage is lower, but you can enable MSAA to get high VRAM usage and thus be able to reproduce by 100%.

    Could a GTX 980 owner test this?"
    http://www.overclock.net/t/1535502/gtx-970s-can-on...

    "Without AA or just FXAA, I have around 3.5GB used in AC: U and mostly no stuttering. With 2xMSAA it rises to ~3.6-3.7GB and performance is still ok. But when I enable 4xMSAA and it needs ~3.8GB, I often have severe stuttering.
    When I set resolution to 720p and enable 8xMSAA, VRAM usage is well below 3GB and there is no stuttering at all."
    http://forums.guru3d.com/showpost.php?p=4991141&am...

    "In Far Cry 4 @ 1440p
    No AA: 3320MB Max Vram, locked at 60 fps
    2x MSAA: 3405MB Max Vram, locked at 60fps
    4x MSAA: 3500MB Max Vram, 45-60fps
    8x MSAA, starts around 3700-3800MB @ 4-5fps, stabilizes at 3500MB @ 30-40fps."
    http://forums.guru3d.com/showpost.php?p=4991210&am...

    There's plenty more evidence supporting the acknowledged (by Nvidia) fact that the GTX970 has performance issues with VRAM allocation above 3.5GB.

    And all those people posting "my games run fine at 1080p", you are clearly missing the point.
  • aoshiryaev - Wednesday, January 28, 2015 - link

    Why not just disable the slow 512mb of memory?
  • SkyBill40 - Wednesday, January 28, 2015 - link

    Why not just have the full 4GB at the rated speed as advertised?
  • Oxford Guy - Thursday, January 29, 2015 - link

    Ding ding ding.
  • MrWhtie - Wednesday, January 28, 2015 - link

    I can run 4 games at 100+ fps on 1080p simultaneously (MSI GTX 970). Power like this used to always cost $500+. I have no complaints; I didn't have $500 to spend on a GTX 980.

    I feel Nvidia is doing us a favor by significantly undercutting AMD.
  • mudz78 - Wednesday, January 28, 2015 - link

    Yeah, a huge favour. By lying about their product specs, undercutting the competition and concreting market share, they set themselves up to hike prices in the future.

Log in

Don't have an account? Sign up now