The Xbox 360 GPU: ATI's Xenos

On a purely hardware level, ATI's Xbox 360 GPU (codenamed Xenos) is quite interesting. The part itself is made up of two physically distinct silicon ICs. One IC is the GPU itself, which houses all the shader hardware and most of the processing power. The second IC (which ATI refers to as the "daughter die") is a 10MB block of embedded DRAM (eDRAM) combined with the hardware necessary for z and stencil operations, color and alpha processing, and anti aliasing. This daughter die is connected to the GPU proper via a 32GB/sec interconnect. Data sent over this bus will be compressed, so usable bandwidth will be higher than 32GB/sec. In side the daughter die, between the processing hardware and the eDRAM itself, bandwidth is 256GB/sec.

At this point in time, much of the bandwidth generated by graphics hardware is required to handle color and z data moving to the framebuffer. ATI hopes to eliminate this as a bottleneck by moving this processing and the back framebuffer off the main memory bus. The bus to main memory is 512MB of 128-bit 700MHz GDDR3 (which results in just over 22GB/sec of bandwidth). This is less bandwidth than current desktop graphics cards have available, but by offloading work and bandwidth for color and z to the daughter die, ATI saves themselves a good deal of bandwidth. The 22GB/sec is left for textures and the rest of the system (the Xbox implements a single pool of unified memory).

The GPU essentially acts as the Northbridge for the system, and sits in the middle of everything. From the graphics hardware, there is 10.8GB/sec of bandwidth up and down to the CPU itself. The rest of the system is hooked in with 500MB/sec of bandwidth up and down. The high bandwidth to the CPU is quite useful as the GPU is able to directly read from the L2 cache. In the console world, the CPU and GPU are quite tightly linked and the Xbox 360 stands to continue that tradition.

Weighing in at 332M transistors, the Xbox 360 GPU is quite a powerful part, but its architecture differs from that of current desktop graphics hardware. For years, vertex and pixel shader hardware have been implemented separately, but ATI has sought to combine their functionality in a unified shader architecture.

What's A Unified Shader Architecture?

The GPU in the Xbox 360 uses a different architecture than we are used to seeing. To be sure, vertex and pixel shader programs will run on the part, but not on separate segments of the hardware. Vertex and pixel processing differ in purpose, but there is quite a bit of overlap in the type of hardware needed to do both. The unified shader architecture that ATI chose to use in their Xbox 360 GPU allows them to pack more functionality onto fewer transistors as less hardware needs to be duplicated for use in different parts of the chip and will run both vertex and shader programs on the same hardware.

There are 3 parallel groups of 16 shader units each. Each of the three groups can either operate on vertex or pixel data. Each shader unit is able to perform one 4 wide vector operation and 1 scalar operation per clock cycle. Current ATI hardware is able to perform two 3 wide vector and two scalar operations per cycle in the pixel pipe alone. The vertex pipeline of R420 is 6 wide and can do one vector 4 and one scalar op per cycle. If we look at straight up processing power, this gives R420 the ability to crunch 158 components (30 of which are 32bit and 128 are limited to 24bit precision). The Xbox GPU is able to crunch 240 32bit components in its shader units per clock cycle. Where this is a 51% increase in the number of ops that can be done per cycle (as well as a general increase in precision), we can't expect these 48 piplines to act like 3 sets of R420 pipelines. All things being equal, this increase (when only looking at ops/cycle) would be only as powerful as a 24 piped R420.

What will make or break the difference between something like a 24 piped R420 and the unified shaders of the Xbox GPU is how well applications will lend themselves to the adaptive nature of the hardware. Current configurations don't have nearly the same vertex processing power as they do pixel processing power. This is quite logical when we consider the fact that games have many more pixels displayed than vertices. For each geometry primitive, there are likely a good number of pixels involved. Of course, not all titles will need the same ratio of geometry to pixel power. This means that all the ops per clock could either be dedicated to geometry processing in truly polygon intense scenes. On the flip side (and more likely), any given clock cycle could see all 240 ops being used for pixel processing. If game designers realize this and code their shaders accordingly, we could see much more focused processing power dedicated to a single type of problem than on current hardware.

ATI is predicting that developers will use lots of very small triangles in Xbox 360 games. As engines like Epic's Unreal Engine 3 have shown incredible results using pixel shaders and normal maps to augment low geometric detail, we can't tell if ATI is trying to provide the chicken or the egg. In other words, will we see many small triangles on Xbox 360 because console developers are moving in that direction or because that is what will run well on ATI's hardware?

Regardless of the paths that lead to this road, it is obvious that the Xbox 360 will be a geometry power house. Not only are all 3 blocks of 16 shaders able to become vertex shaders, but ATI's GPU will be able to handle twice as many z operations if a z only pass is performed. The same is true of current ATI and NVIDIA hardware, but the fact that a geometry only pass can now make use of shader hardware to perform 48 vector and 48 scalar operations in any given clock cycle while doing twice the z operations is quite intriguing. This could allow some very geometrically complicated scenes.

How Many Threads? Inside the Xenos GPU
POST A COMMENT

93 Comments

View All Comments

  • jotch - Friday, June 24, 2005 - link

    #20 well that can't be right for the whole consumer base, as I'm 24 and only know other adults that have consoles and alot of them have flashy tv's for them as well, I do. I think if you look at the market for consoles it is mainly teens and adults that have consoles - not kids. Alot of people I know started with a NES or an Atari 2500, etc and have continued to like games as they have grown up. Why is it that the best selling game has an 18 rating?? (GTA: San Andreas)

    The burning of the screen would be minimal unless you have a game paused for hours and the tv left on - TV technology is moving on and they often turn themselves off if a static image is displayed for an amount of time. So burning shouldn't occur.
    Reply
  • nserra - Friday, June 24, 2005 - link

    All the people that i know having consoles is kids (80%), and their parents have bought an TV just for the console, an 70€ TV.....

    Who is the parent that will let kids on an LCD or PLASMA (3000€) to play games (burn them).

    Or there will be good 480i "compatibility" in games, or forget it....

    #17 I agree.
    Reply
  • fitten - Friday, June 24, 2005 - link

    #14 There are a number of issues being discussed.

    For example, given the nature of current AI code, making that code parallel (as in more than one thread executing AI code working together) seems non-trivial. Data dependencies and the very branch heavy code making data dependencies less predictable probably cause headaches here. Sure, one could probably take the simple approach and say one thread for AI, one for physics, one for blah but that has already been discussed by numerous people as a possibility.

    Parallel code comes in many flavors. The parallelism in the graphics card, for instance, is sometimes classified as "embarassingly parallel" which means it's trivial to do. Then there are pipelines (dataflow) which CPUs and GPUs also use. These are usually fairly easy too because the data partitioning is pretty easy. You break out a thread for each overall task that you want to do. You want to do OpA on the data, then OpB, then OpC. All OpB depends on is the output data of OpA and OpC just depends on OpB's final product. Three threads, each one doing an Op on the output of the previous.

    Then there are codes that are quite a bit more complex where, for example, there are numerous threads that all execute on parts of the whole data instead of all of it at once but the solution they are solving for requires many iterations on the data and at the end of each iteration, all the threads exchange data with each other (or just their 'neighbors') so that the next iteration can be performed. These are a bit more work to develop.

    Anyway, I got long-winded anyway. Basically... there are *many* kinds of parallelism and many kinds of algorithms and implementations of parallelism. Some are low hanging fruit and some are non-trivial. Since I've already read that numerous developers for each platform already see low hanging fruit (run one thread for AI, another for physics, etc.) I can only believe they are talking about things that are non-trivial, such as a multithreaded AI engine, for example (again, as opposed to just breaking out the AI engine into one thread seperate from the rest of game play).
    Reply
  • probedb - Friday, June 24, 2005 - link

    Nice article! I'll wait till they're both out and have a play before I buy either. Last console I bought was an original PlayStation :) But gotta love that hi-def loveliness at last!

    #3 yeah 1080i is interlaced and at such a high res and low refresh the text is really difficult to read, it'd be far better at 1080p I think since that would effectively be the same as 1920x1080 on a normal monitor. 1080i is flickery as hell for me for desktop use but fine for any video and media centre type interfaces on the PC.
    Reply
  • A5 - Friday, June 24, 2005 - link

    You know, the vast majority of the TVs these systems will be hooked up to will only do 480i (standard TV)... Reply
  • jotch - Friday, June 24, 2005 - link

    #14 - here here! Reply
  • jotch - Friday, June 24, 2005 - link

    #10 - sounds to me like they're way ahead of they're time, future-proofing is good as they'll need another 6 years to develop the PS4 - but the Cell and Xenon will force developers to change their ways and will prepare them for the future of developing on PC's that eventually have this kind of CPU chip design (ref intel's chip design future pic on the first page of the article), like the article says the initial round of games will be single threaded etc etc...

    You might get alot of mediocre games but then you should get ones that really shine bright on the PS3, noticeably Unreal 3 and I bet the Gran Turismo (polyphony) guys will put in the effort.
    Reply
  • Pannenkoek - Friday, June 24, 2005 - link

    I'm quite tired of hearing how difficult it is to develop a multithreaded game. Only pathetic programmers can not grasp the concept of parallel code execution, it's not as if the current CPU/GPU duality does not qualify as one. Reply
  • knitecrow - Friday, June 24, 2005 - link

    you'll need HDD for online service and MMOP

    how many people are going to buy a $100 HDD if they don't have to?
    Reply
  • LanceVance - Friday, June 24, 2005 - link

    "the PS3 won’t ship with a hard drive"

    If that's true, then will it be like:

    - PS2 Memory Card; non-included but standard equipment required by all games.
    - PS2 Hard Drive; non-included and considered exotic unusual equipment and used by very few games.
    Reply

Log in

Don't have an account? Sign up now