The Xbox 360 GPU: ATI's Xenos

On a purely hardware level, ATI's Xbox 360 GPU (codenamed Xenos) is quite interesting. The part itself is made up of two physically distinct silicon ICs. One IC is the GPU itself, which houses all the shader hardware and most of the processing power. The second IC (which ATI refers to as the "daughter die") is a 10MB block of embedded DRAM (eDRAM) combined with the hardware necessary for z and stencil operations, color and alpha processing, and anti aliasing. This daughter die is connected to the GPU proper via a 32GB/sec interconnect. Data sent over this bus will be compressed, so usable bandwidth will be higher than 32GB/sec. In side the daughter die, between the processing hardware and the eDRAM itself, bandwidth is 256GB/sec.

At this point in time, much of the bandwidth generated by graphics hardware is required to handle color and z data moving to the framebuffer. ATI hopes to eliminate this as a bottleneck by moving this processing and the back framebuffer off the main memory bus. The bus to main memory is 512MB of 128-bit 700MHz GDDR3 (which results in just over 22GB/sec of bandwidth). This is less bandwidth than current desktop graphics cards have available, but by offloading work and bandwidth for color and z to the daughter die, ATI saves themselves a good deal of bandwidth. The 22GB/sec is left for textures and the rest of the system (the Xbox implements a single pool of unified memory).

The GPU essentially acts as the Northbridge for the system, and sits in the middle of everything. From the graphics hardware, there is 10.8GB/sec of bandwidth up and down to the CPU itself. The rest of the system is hooked in with 500MB/sec of bandwidth up and down. The high bandwidth to the CPU is quite useful as the GPU is able to directly read from the L2 cache. In the console world, the CPU and GPU are quite tightly linked and the Xbox 360 stands to continue that tradition.

Weighing in at 332M transistors, the Xbox 360 GPU is quite a powerful part, but its architecture differs from that of current desktop graphics hardware. For years, vertex and pixel shader hardware have been implemented separately, but ATI has sought to combine their functionality in a unified shader architecture.

What's A Unified Shader Architecture?

The GPU in the Xbox 360 uses a different architecture than we are used to seeing. To be sure, vertex and pixel shader programs will run on the part, but not on separate segments of the hardware. Vertex and pixel processing differ in purpose, but there is quite a bit of overlap in the type of hardware needed to do both. The unified shader architecture that ATI chose to use in their Xbox 360 GPU allows them to pack more functionality onto fewer transistors as less hardware needs to be duplicated for use in different parts of the chip and will run both vertex and shader programs on the same hardware.

There are 3 parallel groups of 16 shader units each. Each of the three groups can either operate on vertex or pixel data. Each shader unit is able to perform one 4 wide vector operation and 1 scalar operation per clock cycle. Current ATI hardware is able to perform two 3 wide vector and two scalar operations per cycle in the pixel pipe alone. The vertex pipeline of R420 is 6 wide and can do one vector 4 and one scalar op per cycle. If we look at straight up processing power, this gives R420 the ability to crunch 158 components (30 of which are 32bit and 128 are limited to 24bit precision). The Xbox GPU is able to crunch 240 32bit components in its shader units per clock cycle. Where this is a 51% increase in the number of ops that can be done per cycle (as well as a general increase in precision), we can't expect these 48 piplines to act like 3 sets of R420 pipelines. All things being equal, this increase (when only looking at ops/cycle) would be only as powerful as a 24 piped R420.

What will make or break the difference between something like a 24 piped R420 and the unified shaders of the Xbox GPU is how well applications will lend themselves to the adaptive nature of the hardware. Current configurations don't have nearly the same vertex processing power as they do pixel processing power. This is quite logical when we consider the fact that games have many more pixels displayed than vertices. For each geometry primitive, there are likely a good number of pixels involved. Of course, not all titles will need the same ratio of geometry to pixel power. This means that all the ops per clock could either be dedicated to geometry processing in truly polygon intense scenes. On the flip side (and more likely), any given clock cycle could see all 240 ops being used for pixel processing. If game designers realize this and code their shaders accordingly, we could see much more focused processing power dedicated to a single type of problem than on current hardware.

ATI is predicting that developers will use lots of very small triangles in Xbox 360 games. As engines like Epic's Unreal Engine 3 have shown incredible results using pixel shaders and normal maps to augment low geometric detail, we can't tell if ATI is trying to provide the chicken or the egg. In other words, will we see many small triangles on Xbox 360 because console developers are moving in that direction or because that is what will run well on ATI's hardware?

Regardless of the paths that lead to this road, it is obvious that the Xbox 360 will be a geometry power house. Not only are all 3 blocks of 16 shaders able to become vertex shaders, but ATI's GPU will be able to handle twice as many z operations if a z only pass is performed. The same is true of current ATI and NVIDIA hardware, but the fact that a geometry only pass can now make use of shader hardware to perform 48 vector and 48 scalar operations in any given clock cycle while doing twice the z operations is quite intriguing. This could allow some very geometrically complicated scenes.

How Many Threads? Inside the Xenos GPU
POST A COMMENT

93 Comments

View All Comments

  • BenSkywalker - Sunday, June 26, 2005 - link

    ""One thing is for sure, support for two 1080p outputs in spanning mode (3840 x 1080) on the PS3 is highly unrealistic. At that resolution, the RSX would be required to render over 4 megapixels per frame, without a seriously computation bound game it’s just not going to happen at 60 fps." -- Quote from page 10"

    First off 1080p doesn't support 60FPS as of this moment anyway, and there are an awful lot of games on consoles that aren't remotely close to being GPU bound anyway. Remember that the XBox has titles now that are pushing out 1080i and the RSX is easily far more then four times the speed of the GPU in the XBox.
    Reply
  • tipoo - Wednesday, August 6, 2014 - link

    "RSX is easily far more then four times the speed of the GPU in the XBox."

    It's funny reading these comments years later, and seeing how crazy the PS3 hype machine was. I assume this insane comment reffered to the 1 terraflop RSX thing, which was a massive joke. RSX was worse than Xenon not only in raw gflops (180 vs over 200 I think), but since it didn't have unified shaders it could be bottlenecked by a scene having too much vertex or pixel effects and leaving shaders underused.
    Reply
  • calimero - Sunday, June 26, 2005 - link

    Here is one tip about Cell:
    to play MP3 files (stereo) on PC you need 100MHz 486 CPU. Atari Falcon030 with MC68030 (16MHz) and DSP (32MHz) can do same thing!
    Everyone who know to program will find Cell outstanding and thrilling everyone else who pretend to be a programer please continue to waste CPU cycles with your shity code!
    Reply
  • coolme - Sunday, June 26, 2005 - link

    "Supporting 1080p x2 may seem like overkill,"

    It's not gonna support 1080p x2

    "One thing is for sure, support for two 1080p outputs in spanning mode (3840 x 1080) on the PS3 is highly unrealistic. At that resolution, the RSX would be required to render over 4 megapixels per frame, without a seriously computation bound game it’s just not going to happen at 60 fps." -- Quote from page 10
    Reply
  • nevermind4711 - Sunday, June 26, 2005 - link

    People have different ways of expressing the frequency of DDRAM. The correct memory frequency of 7800GTX is 256MB/256-bit GDDR3 at 600MHz, but as it is double rate some people say 1200 MHz.

    In the same way you can say the RSX memory is operating at 1400 MHz. How else could 128 bit result in a memory bandwidth of 22 GB/s for the RTX?

    #64 knitecrow, who is your source that the RSX does not contain e-dram, or is it just speculation?

    Besides, your conclusion from extrapolating the transistor count may be correct, but assuming the transistor count is proportional to the number of pixel pipelines is a rather big simplification, there is quite a lot of other stuff inside a GPU as well, stuff that does not scale proportionally to the pixel pipelines.
    Reply
  • Furen - Sunday, June 26, 2005 - link

    The RSX is supposed to be clocked higher but will only have a 700MHz, 128bit memory bus (as opposed to the 1200MHz, 256bit memory bus on the 7800gtx). Reply
  • knitecrow - Saturday, June 25, 2005 - link

    #61
    too bad you don't speak marketing.
    When they say near.. it means very close. Could be slightly under or over. If it was something like 320M... they will be hyp3ing 320M.


    #62 too bad you are wrong

    with 300M transistors, the RSX is a native 24 pixel pipeline card

    You can extrapolate the number by looking at:
    6800ultra - 16 - 222M
    6600GT - 8 - 144M

    it has no eDRAM.

    The features remain to be seen, but its going to be a G70 derivate -- just like XGPU for the xbox was a geforce3 derivative.

    There is absolutely no evidence to suggest that the RSX is going to be more powerful than 7800GTX.

    Just because a product comes out later doesn't make it better

    Exhibit A:
    Radeon 9700pro vs. 5800ultra

    Reply
  • Darkon - Saturday, June 25, 2005 - link

    http://www.psinext.com/index.php?categoryid=3&... Reply
  • Dukemaster - Saturday, June 25, 2005 - link

    I think it is very clear why the RSX gpu has the same number of transistors but still is more powerfull then the 7800GTX: the 7800GTX is a chip with 32 pipelines with 8 of them turned off. Reply
  • nevermind4711 - Saturday, June 25, 2005 - link

    Interesting article. However, I find it strange that Anand and Derek do not comment on the difference in floating point capacity between the combatants. 1 TFlops for X360 vs. 2 TFlops for PS3. For X360 we know that the majority of flops come from the GPU, where probably the big part consists of massively paralell compare ops and such coming from the AA- and filtering circuitry integrated with the e-DRAM.
    It would be very interesting to know how the RSX provides 1.8 TFlops. I do not think the G70 has a capacity anything near that. Could it be possible that Sony will bring some e-DRAM to the party together with AA and filtering circuitry similar to X360. After all Sony has quite some experience of e-DRAM from PS2 and PSP.
    Anand and Derek wrote "Both the G70 and the RSX share the same estimated transistor count, of approximately 300.4 million transistors." Where do this information come from? Sony only said in its presentation the RSX will have 300+ mil t:s. G70 we now know contains 302 mil t:s.
    #48: Sony may very well have replaced some video en/de-coding circuitry of the G70 with some e-dram circuitry.
    Reply

Log in

Don't have an account? Sign up now