Why In-Order?

Ever since the Pentium Pro, desktop PC microprocessors have implemented Out of Order (OoO) execution architectures in order to improve performance.  We’ve explained the idea in great detail before, but the idea is that an Out-of-Order microprocessor can reorganize its instruction stream in order to best utilize its execution resources.  Despite the simplicity of its explanation, implementing support for OoO dramatically increases the complexity of a microprocessor, as well as drives up power consumption. 

In a perfect world, you could group a bunch of OoO cores on a single die and offer both excellent single threaded performance, as well as great multi-threaded performance.  However, the world isn’t so perfect, and there are limitations to how big a processor’s die can be.  Intel and AMD can only fit two of their OoO cores on a 90nm die, yet the Xbox 360 and PlayStation 3 targeted 3 and 9 cores, respectively, on a 90nm die; clearly something has to give, and that something happened to be the complexity of each individual core. 

Given a game console’s 5 year expected lifespan, the decision was made (by both MS and Sony) to favor a multi-core platform over a faster single-core CPU in order to remain competitive towards the latter half of the consoles’ lifetime. 

So with the Xbox 360 Microsoft used three fairly simple IBM PowerPC cores, while Sony has the much publicized Cell processor in their PlayStation 3.  Both will perform absolutely much slower than even mainstream desktop processors in single threaded game code, but the majority of games these days are far more GPU bound than CPU bound, so the performance decrease isn’t a huge deal.  In the long run, with a bit of optimization and running multi-threaded game engines, these collections of simple in-order cores should be able to put out some fairly good performance. 

Does In-Order Matter?

As we discussed in our Cell article, in-order execution makes a lot of sense for the SPEs.  With in-order execution as well as a small amount of high speed local memory, memory access becomes quite predictable and code is very easily scheduled by the compiler for the SPEs.  However, for the PPE in Cell, and the PowerPC cores in Xenon, the in-order approach doesn’t necessarily make a whole lot of sense.  You don’t have the advantage of a cacheless architecture, even though you do have the ability to force certain items to remain untouched by the cache.  More than anything having an in-order general purpose core just works to simplify the core, at the expense of depending quite a bit on the compiler, and the programmer, to optimize performance. 

Very little of modern day games is written in assembly, most of it is written in a high level language like C or C++ and the compiler does the dirty work of optimizing the code and translating it into low level assembly.  Compilers are horrendously difficult to write; getting a compiler to work is a pretty difficult job in itself, but getting one to work well, regardless of what the input code is, is nearly impossible. 

However, with a properly designed ISA and a good compiler, having an in-order core to work on is not the end of the world.  The performance you lose by not being able to extract the last bit of instruction level parallelism is made up by the fact that you can execute far more threads per clock thanks to the simplicity of the in-order cores allowing more to be packed on a die.  Unfortunately, as we’ve already discussed, on day one that’s not going to be much of an advantage. 

The Cell processor’s SPEs are even more of a challenge, as they are more specialized hardware only suitable to executing certain types of code.  Keeping in mind that the SPEs are not well suited to running branch heavy code, loop unrolling will do a lot to improve performance as it can significantly reduce the number of branches that must be executed.  In order to squeeze the absolute maximum amount of performance out of the SPEs, developers may be forced to hand code some routines as initial performance numbers for optimized, compiled SPE code appear to be far less than their peak throughput. 

While the move to in-order architectures won’t cause game developers too much pain with good compilers at their disposal, the move to multi-threaded game development and optimizing for the Cell in general will be much more challenging. 

Xenon vs. Cell How Many Threads?
POST A COMMENT

93 Comments

View All Comments

  • BenSkywalker - Sunday, June 26, 2005 - link

    ""One thing is for sure, support for two 1080p outputs in spanning mode (3840 x 1080) on the PS3 is highly unrealistic. At that resolution, the RSX would be required to render over 4 megapixels per frame, without a seriously computation bound game it’s just not going to happen at 60 fps." -- Quote from page 10"

    First off 1080p doesn't support 60FPS as of this moment anyway, and there are an awful lot of games on consoles that aren't remotely close to being GPU bound anyway. Remember that the XBox has titles now that are pushing out 1080i and the RSX is easily far more then four times the speed of the GPU in the XBox.
    Reply
  • tipoo - Wednesday, August 6, 2014 - link

    "RSX is easily far more then four times the speed of the GPU in the XBox."

    It's funny reading these comments years later, and seeing how crazy the PS3 hype machine was. I assume this insane comment reffered to the 1 terraflop RSX thing, which was a massive joke. RSX was worse than Xenon not only in raw gflops (180 vs over 200 I think), but since it didn't have unified shaders it could be bottlenecked by a scene having too much vertex or pixel effects and leaving shaders underused.
    Reply
  • calimero - Sunday, June 26, 2005 - link

    Here is one tip about Cell:
    to play MP3 files (stereo) on PC you need 100MHz 486 CPU. Atari Falcon030 with MC68030 (16MHz) and DSP (32MHz) can do same thing!
    Everyone who know to program will find Cell outstanding and thrilling everyone else who pretend to be a programer please continue to waste CPU cycles with your shity code!
    Reply
  • coolme - Sunday, June 26, 2005 - link

    "Supporting 1080p x2 may seem like overkill,"

    It's not gonna support 1080p x2

    "One thing is for sure, support for two 1080p outputs in spanning mode (3840 x 1080) on the PS3 is highly unrealistic. At that resolution, the RSX would be required to render over 4 megapixels per frame, without a seriously computation bound game it’s just not going to happen at 60 fps." -- Quote from page 10
    Reply
  • nevermind4711 - Sunday, June 26, 2005 - link

    People have different ways of expressing the frequency of DDRAM. The correct memory frequency of 7800GTX is 256MB/256-bit GDDR3 at 600MHz, but as it is double rate some people say 1200 MHz.

    In the same way you can say the RSX memory is operating at 1400 MHz. How else could 128 bit result in a memory bandwidth of 22 GB/s for the RTX?

    #64 knitecrow, who is your source that the RSX does not contain e-dram, or is it just speculation?

    Besides, your conclusion from extrapolating the transistor count may be correct, but assuming the transistor count is proportional to the number of pixel pipelines is a rather big simplification, there is quite a lot of other stuff inside a GPU as well, stuff that does not scale proportionally to the pixel pipelines.
    Reply
  • Furen - Sunday, June 26, 2005 - link

    The RSX is supposed to be clocked higher but will only have a 700MHz, 128bit memory bus (as opposed to the 1200MHz, 256bit memory bus on the 7800gtx). Reply
  • knitecrow - Saturday, June 25, 2005 - link

    #61
    too bad you don't speak marketing.
    When they say near.. it means very close. Could be slightly under or over. If it was something like 320M... they will be hyp3ing 320M.


    #62 too bad you are wrong

    with 300M transistors, the RSX is a native 24 pixel pipeline card

    You can extrapolate the number by looking at:
    6800ultra - 16 - 222M
    6600GT - 8 - 144M

    it has no eDRAM.

    The features remain to be seen, but its going to be a G70 derivate -- just like XGPU for the xbox was a geforce3 derivative.

    There is absolutely no evidence to suggest that the RSX is going to be more powerful than 7800GTX.

    Just because a product comes out later doesn't make it better

    Exhibit A:
    Radeon 9700pro vs. 5800ultra

    Reply
  • Darkon - Saturday, June 25, 2005 - link

    http://www.psinext.com/index.php?categoryid=3&... Reply
  • Dukemaster - Saturday, June 25, 2005 - link

    I think it is very clear why the RSX gpu has the same number of transistors but still is more powerfull then the 7800GTX: the 7800GTX is a chip with 32 pipelines with 8 of them turned off. Reply
  • nevermind4711 - Saturday, June 25, 2005 - link

    Interesting article. However, I find it strange that Anand and Derek do not comment on the difference in floating point capacity between the combatants. 1 TFlops for X360 vs. 2 TFlops for PS3. For X360 we know that the majority of flops come from the GPU, where probably the big part consists of massively paralell compare ops and such coming from the AA- and filtering circuitry integrated with the e-DRAM.
    It would be very interesting to know how the RSX provides 1.8 TFlops. I do not think the G70 has a capacity anything near that. Could it be possible that Sony will bring some e-DRAM to the party together with AA and filtering circuitry similar to X360. After all Sony has quite some experience of e-DRAM from PS2 and PSP.
    Anand and Derek wrote "Both the G70 and the RSX share the same estimated transistor count, of approximately 300.4 million transistors." Where do this information come from? Sony only said in its presentation the RSX will have 300+ mil t:s. G70 we now know contains 302 mil t:s.
    #48: Sony may very well have replaced some video en/de-coding circuitry of the G70 with some e-dram circuitry.
    Reply

Log in

Don't have an account? Sign up now