Memory Subsystem

With the same underlying CPU and GPU architectures, porting games between the two should be much easier than ever before. Making the situation even better is the fact that both systems ship with 8GB of total system memory and Blu-ray disc support. Game developers can look forward to the same amount of storage per disc, and relatively similar amounts of storage in main memory. That’s the good news.

The bad news is the two wildly different approaches to memory subsystems. Sony’s approach with the PS4 SoC was to use a 256-bit wide GDDR5 memory interface running somewhere around a 5.5GHz datarate, delivering peak memory bandwidth of 176GB/s. That’s roughly the amount of memory bandwidth we’ve come to expect from a $300 GPU, and great news for the console.

Xbox One Motherboard, courtesy Wired

Die size dictates memory interface width, so the 256-bit interface remains but Microsoft chose to go for DDR3 memory instead. A look at Wired’s excellent high-res teardown photo of the motherboard reveals Micron DDR3-2133 DRAM on board (16 x 16-bit DDR3 devices to be exact). A little math gives us 68.3GB/s of bandwidth to system memory.

To make up for the gap, Microsoft added embedded SRAM on die (not eDRAM, less area efficient but lower latency and doesn't need refreshing). All information points to 32MB of 6T-SRAM, or roughly 1.6 billion transistors for this memory. It’s not immediately clear whether or not this is a true cache or software managed memory. I’d hope for the former but it’s quite possible that it isn’t. At 32MB the ESRAM is more than enough for frame buffer storage, indicating that Microsoft expects developers to use it to offload requests from the system memory bus. Game console makers (Microsoft included) have often used large high speed memories to get around memory bandwidth limitations, so this is no different. Although 32MB doesn’t sound like much, if it is indeed used as a cache (with the frame buffer kept in main memory) it’s actually enough to have a substantial hit rate in current workloads (although there’s not much room for growth).

Vgleaks has a wealth of info, likely supplied from game developers with direct access to Xbox One specs, that looks to be very accurate at this point. According to their data, there’s roughly 50GB/s of bandwidth in each direction to the SoC’s embedded SRAM (102GB/s total bandwidth). The combination of the two plus the CPU-GPU connection at 30GB/s is how Microsoft arrives at its 200GB/s bandwidth figure, although in reality that’s not how any of this works. If it’s used as a cache, the embedded SRAM should significantly cut down on GPU memory bandwidth requests which will give the GPU much more bandwidth than the 256-bit DDR3-2133 memory interface would otherwise imply. Depending on how the eSRAM is managed, it’s very possible that the Xbox One could have comparable effective memory bandwidth to the PlayStation 4. If the eSRAM isn’t managed as a cache however, this all gets much more complicated.

Microsoft Xbox One vs. Sony PlayStation 4 Memory Subsystem Comparison
  Xbox 360 Xbox One PlayStation 4
Embedded Memory 10MB eDRAM 32MB eSRAM -
Embedded Memory Bandwidth 32GB/s 102GB/s -
System Memory 512MB 1400MHz GDDR3 8GB 2133MHz DDR3 8GB 5500MHz GDDR5
System Memory Bus 128-bits 256-bits 256-bits
System Memory Bandwidth 22.4 GB/s 68.3 GB/s 176.0 GB/s

There are merits to both approaches. Sony has the most present-day-GPU-centric approach to its memory subsystem: give the GPU a wide and fast GDDR5 interface and call it a day. It’s well understood and simple to manage. The downsides? High speed GDDR5 isn’t the most power efficient, and Sony is now married to a more costly memory technology for the life of the PlayStation 4.

Microsoft’s approach leaves some questions about implementation, and is potentially more complex to deal with depending on that implementation. Microsoft specifically called out its 8GB of memory as being “power friendly”, a nod to the lower power operation of DDR3-2133 compared to 5.5GHz GDDR5 used in the PS4. There are also cost benefits. DDR3 is presently cheaper than GDDR5 and that gap should remain over time (although 2133MHz DDR3 is by no means the cheapest available). The 32MB of embedded SRAM is costly, but SRAM scales well with smaller processes. Microsoft probably figures it can significantly cut down the die area of the eSRAM at 20nm and by 14/16nm it shouldn’t be a problem at all.

Even if Microsoft can’t deliver the same effective memory bandwidth as Sony, it also has fewer GPU execution resources - it’s entirely possible that the Xbox One’s memory bandwidth demands will be inherently lower to begin with.

CPU & GPU Hardware Analyzed Power/Thermals, OS, Kinect & TV
Comments Locked

245 Comments

View All Comments

  • JDG1980 - Wednesday, May 22, 2013 - link

    In terms of single-threaded performance *per clock*, Thuban > Piledriver. Sure, if you crank up the clock rate *and the heat and power consumption* on Piledriver, you can barely edge out Deneb and Thuban on single-threaded benchmarks. But if you clock them the same, the Thuban uses less power, generates less heat, and performs better. Tom's Hardware once ran a similar test with Netburst vs Pentium M, and his conclusion was quite blunt: the test called into question the P4's "right to exist". The same is true of the Bulldozer/Piledriver line.
    And I don't buy the argument that K10 is too old to be fixable. Remember that Ivy Bridge and Haswell are part of a line stretching all the way back to the original Pentium Pro. The one time Intel tried a clean break with the past (Netburst) it was an utter fail. The same is true of AMD's excavation equipment line and for the same reason - IPC is terrible so the only way to get acceptable performance is to crank up clock rate, power, noise, and thermals.
  • silverblue - Wednesday, May 22, 2013 - link

    It's true that K10 is generally more effective per clock, but look at it this way - AMD believed that the third AGU was unnecessary as it was barely used, much like when VLIW4 took over from VLIW5 as the average slot utilisation within a streaming processor was 3.4 at any given time. Put simply, they made trade-offs where it made sense to make them. Additionally, K10 was most likely hampered by its 3-issue front end, but it also lacked a whole load of ISAs - SSE4.1 and 4.2 are good examples.

    Thuban compares well with the FX-8150 in most cases and favourably so when we're considering lighter workloads. The work done to rectify some of Bulldozer's ills shows that Piledriver is not only about 7% faster per clock, but can clock higher within the same power envelope. AMD was obviously aiming for more performance within a given TDP. The FX-83xx series is out of reach of Thuban in terms of performance.

    The 6300 compares with the 1100T BE as such:

    http://www.cpu-world.com/Compare/316/AMD_FX-Series...

    Oddly, one of your arguments for having a Thuban in the first place was power consumption. The very reason a Thuban isn't clocked as high as the top X4s is to keep power consumption in check. Those six cores perform very admirably against even a 2600K in some circumstances, and generally with Bulldozer and Piledriver you'd look to the FX-8xxx CPUs if comparing with Thuban, however I expect the FX-6350 will be just enough to edge the 1100T BE in pretty much any area:

    http://www.cpu-world.com/Compare/321/AMD_FX-Series...

    The two main issues with the current "excavation equipment line" as you put it is a lack of single threaded power, plus the inherent inability to switch between threads more than once per clock - clocking Bulldozer high may offset the latter in some way but at the expense of power usage. The very idea that Steamroller fixes the latter with some work done to help the former, and that Excavator improves IPC whilst (supposedly) significantly reducing power consumption should be evidence enough that whilst it started off bad, AMD truly believes it will get better. In any case, how much juice does anybody expect eight cores to use at 4GHz with a shedload of cache? Does anybody remember how hungry Nehalem was, let along P4?

    I doubt that Jaguar could come anywhere near even a downclocked A10-4600M. The latter has a high-speed dual channel architecture and a 4-issue front end; to be perfectly honest, I think that even with its faults, it would easily beat Jaguar at the same clock speed.

    Tacking bits onto K10 is a lost cause. AMD doesn't have the money, and even if it did, Bulldozer isn't actually a bad idea. Give them a chance - how much faster was Phenom II over the original Phenom once AMD worked on the problem for a year?
  • Shadowmaster625 - Wednesday, May 22, 2013 - link

    Yeah but AMD would not have stood still with K10. Look at how much faster Regor is compared to the previous athlon:

    http://www.anandtech.com/bench/Product/121?vs=27

    The previous athlon had a higher clock speed and the same amount of cache, but regor crushes it by almost 30% in Far Cry 2. It is 10% faster across the board despite being lower clocked and consuming far less power. Had they continued with Thuban it is possible they would have continued to squeeze 10% per year out of it as well as reduce power consumption by 15%, which if you do the math that leaves us with something relatively competitive today. Not to mention they would have saved a LOT of money. They could have easily added AVX or any other extensions to it.
  • Hubb1e - Wednesday, May 22, 2013 - link

    Per clock Thuban > Piledriver, but power consumption favors Piledriver. Compare two chips of similar performance. The PhII 965 is a 125W CPU and the FX4300 is a 95W CPU and they perform similarly with the FX4300 actually beating the PhII by a small margin.
  • kyuu - Wednesday, May 22, 2013 - link

    ... Lol? You can't simply clock a low-power architecture up to 4GHz. Even if you could, a 4GHz Jaguar-based CPU would still be slower than a 4GHz Piledriver-based one.

    Jaguar is a low-power architecture. It's not able (or meant to) compete with full-power CPUs in raw processing power. It's being used in the Xbox One and PS4 for two reasons: power efficiency, and cost. It's not because of its processing power (although it's still a big step up from the CPUs in the 360/PS3).
  • plcn - Wednesday, May 22, 2013 - link

    BD/PD have plenty of viability in big power envelope, big/liquid cooler, desktop PC arrangements. consoles aspire to be much quieter, cooler, energy efficient - thus the sensible jaguar selection. even the best ITX gaming builds out there are still quite massive and relatively unsightly vs what seems achievable with jaguar... now for laptops on the other hand, a dual jaguar 'netbook' could be very very interesting. you can probably cook your eggs on it, too, but still interesting..
  • lmcd - Wednesday, May 22, 2013 - link

    It isn't a step in the right direction in IPC. Piledriver 40% faster than Jaguar at the same clocks and also clocks higher.

    Stop spreading the FUD about Piledriver -- my A8-4500m is a very solid processor with very strong graphics performance and excellent CPU performance for all but the most taxing tasks.
  • lightsout565 - Wednesday, May 22, 2013 - link

    Pardon my ignorance, What is the "Embedded Memory" used for?
  • tipoo - Wednesday, May 22, 2013 - link

    It's a fast memory pool for the GPU. It could help by holding the framebuffer or caching textures etc.
  • BSMonitor - Wednesday, May 22, 2013 - link

    Embedded memory latency is MUCH closer to L1/L2 cache latency than system memory. System memory is Brian and Stewie taking the airline to Vegas vs the Teleporter to Vegas that would be cache/embedded memory...

Log in

Don't have an account? Sign up now