IBM PowerPC 970FX: Superscalar monster

Meet the G5 processor, which is in fact IBM's PowerPC 970FX processor. The RISC ISA, which is quite complex and can hardly be called "Reduced" (The R of RISC), provides 32 architectural registers. Architectural registers are the registers that are visible to the programmer, mostly the compiler programmer. These are the registers that can be used to program the calculations in the binary (and assembler) code.

Compilers for the PowerPC 970FX should thus be able to produce code that is cleaner, with less shuffling of data between the L1-cache, "secret" rename registers and architectural registers. The end result is better performance, thanks to less "bookkeeping". Insiders say that the PowerPC970FX has less "register pressure" than, for example, the EM64T and AMD64 CPUs (16 registers), which on their turn have less register pressure than the "older" 32 bit x86 CPUs with only 8 architectural registers. Remember that most of the performance boost (10-30%) noticed in x86 64 bit programs came from the 8 extra registers available in "pure" 64 bit mode.

The 970FX is deeply pipelined, quite a bit deeper than the Athlon 64 or Opteron. While the Opteron has a 12 stage pipeline for integer calculations, the 970FX goes deeper and ends up with 16 stages. Floating point is handled through 21 stages, and the Opteron only needs 17. 21 stages might make you think that the 970FX is close to a Pentium 4 Northwood, but you should remember that the Pentium 4 also had 8 stages in front of the trace cache. The 20 stages were counted from the trace cache. So, the Pentium 4 has to do less work in those 20 stages than what the 970FX performs in those 16 or 21 stages. When it comes to branch prediction penalties, the 970FX penalty will be closer to the Pentium 4 (Northwood). But when it comes to frequency headroom, the 970FX should do - in theory - better than the Opteron, but does not come close to the "old" Pentium 4.

The design philosophy of the 970FX is very aggressive. It is not only a deeply pipelined processor, but it is also a very wide superscalar CPU that can theoretically sustain up to 5 instructions (4+ 1 branch) per clock cycle. The Opteron can sustain 3 at most; the Pentium 4's trace cache bandwidth "limits" the P4 to about 2 x86 instruction per clock cycle.

The 970FX works out of order and up to 200 instructions can be kept in flight, compared to 126 in the Pentium 4. The rate at which instructions are fetched will not limit the issue rate either. The PowerPC 970 FX fetches up to 8 instructions per cycle from the L1 and can decode at the same rate of 8 instructions per cycle. So, is the 970FX the ultimate out-of-order CPU?

While 200 instructions in flight are impressive, there is a catch. If there was no limitation except die size, CPUs would probably keep thousands of instructions in flight. However, the scheduler has to be able to pick out independent instructions (instructions that do not rely on the outcome of a previous one) out of those buffers. And searching and analysing the buffers takes time, and time is very limited at clock speeds of 2.5 GHz and more. Although it is true that the bigger the buffers, the better, the number of instructions that can be tracked and analysed per clock cycle is very limited. The buffer in front of the execution units is about 100 instructions big, still respectable compared to the Athlon 64's reorder buffer of 72 instructions, divided into 24 groups of 3 instructions.

The same grouping also happens on the 970FX or G5. But the grouping is a little coarser here, with 5 instructions in one group. This grouping makes reordering and tracking a little easier than when the scheduler would have to deal with 100 separate instructions.

The grouping is, at the same time, one of the biggest disadvantages. Yes, the Itanium also works with groups, but there the compilers should help the CPU with getting the slots filled. In the 970FX, the group must be assembled with pretty strict limitations, such as at one branch per group. Many other restrictions apply, but that is outside the scope of this article. Suffice it to say that it happens quite a lot that a few of the operations in the group consist of NOOP, no-operation, or useless "do nothing" instructions. Or that a group cannot be issued because some of the resources that one member of the group needs is not available ( registers, execution slots). You could say that the whole grouping thing makes the Superscalar monster less flexible.

Branch prediction is done by two different methods each with a gigantic 16K entry history table. A third "selector" keeps another 16K history to see which of the two methods has done the best job so far. Branch prediction seems to be a prime concern for the IBM designers.

Memory Subsystem

The caches are relatively small compared to the x86 competition. A 64 KB I-cache and 32 KB D-cache is relatively "normal", while the 512 KB L2-cache is a little small by today's standards. But, no complaints here. A real complaint can be lodged against the latency to the memory. Apple's own webpage talks about 135 ns access time to the RAM. Now, compare this to the 60 ns access time that the Opteron needs to access the RAM, and about 100-115 ns in the case of the Pentium 4 (with 875 chipset).

A quick test with LM bench 2.04 confirms this:

 Host  OS  Mem read (MB/s)  Mem write (MB/s)  L2-cache latency (ns)  RAM Random Access (ns)
Xeon 3.06 GHz Linux 2.4 1937 990 59.940 152.7
G5 2.7 GHz Darwin 8.1 2799 1575 49.190 303.4
Xeon 3.6 GHz Linux 2.6 3881 1669 78.380 153.4
Opteron 850 Linux 2.6 1920 1468 50.530 133.2

Memory latency is definitely a problem on the G5.

On the flipside of the coin is the excellent FSB bandwidth. The G5/Power PC 970FX 2.7 GHz has a 1.35 GHz FSB (Full Duplex), capable of sending 10.8 GB/s in each direction. Of course, the (half duplex) dual channel DDR400 bus can only use 6.4 GB/s at most. Still, all this bandwidth can be put to good use with up to 8 data prefetch streams.

Index Summary: de cores compared


View All Comments

  • jhagman - Tuesday, June 07, 2005 - link

    OK, this clears it up, thanks.

    One little thing still, what is the number you are giving in the ab results table? Is it requests per second or perhaps the transfer rate?

  • demuynckr - Tuesday, June 07, 2005 - link

    As i mentioned before, we used gcc 3.3.3 for all linux, and gcc 3.3 mac compiler on apple, because that was the standard one.
    I did a second flops test with the gcc 4.0 compiler included on the Tiger cd, and the flops are much better when compiled with the -mcpu=g5 option which did not seem available when using the gcc 3.3 Apple compiler.
    As for ab i used these settings,
    ab -n 100000 -n x http://localhost/

    x for the various concurrencies: 5,20,50,100,150.
  • spinportal - Monday, June 06, 2005 - link

    Guess there's no one arguing that the PPC is not keeping its paces with the current market, but rather OS/X able to do Big Iron computing. And if rumors be true, where will you be able to get a PPC built once Apple drops IBM for Intel?
    In a Usenet debate in 93, Torvalds and Tannenbaum go roasting Mach microkernel vs. the death of Linux. Seems Linus' work will be seeing more light of day, and Mach go the way of the dodo. Will Apple rewrite OS/X for Intel x86/64? As far as practical business sense, that's like shooting off one's leg foot.
  • spinportal - Monday, June 06, 2005 - link

  • jhagman - Monday, June 06, 2005 - link

    Could you please give the exact method of testing apache with ab? It is really hard to try to redo the tests when one does not know which methodology was used. The amount of clients and switches of ab would be appreciated.

    Also an answer to why Apple's newest gcc (4.0) was not used would be an interesting one and did you _really_ use gcc 3.3.3 and not Apple's gcc?

    Other than these omissions I found the article very interesting, thanks.
  • demuynckr - Monday, June 06, 2005 - link

    Yes I have read the article, I also personally compiled the microbenchmarks on linux as well as on the PPC, and I can tell you I used gcc 3.3 on Mac for all compilation needs :). Reply
  • webflits - Monday, June 06, 2005 - link

    demuynckr, did your read the article?

    "So, before we start with application benchmarks, we performed a few micro benchmarks compiled on all platforms with the SAME gcc 3.3.3 compiler. "

    BTW I ran the same tests using Apple's version of gcc 3.3
    As you can see my 2.0Ghz now beats the 2.5Ghz on 5 of the 8 tests, and a 2.7Ghz G5 would be on par with the Opteron 250 when you extrapolate the results.

    Lets face it, Anandtech screwed up by using a crippled compiler for the G5 tests

    GCC 3.3/OSX 10.4.1/2.0GHz G5

    FLOPS C Program (Double Precision), V2.0 18 Dec 1992

    Module Error RunTime MFLOPS
    1 4.0146e-13 0.0140 997.2971
    2 -1.4166e-13 0.0108 648.4622
    3 4.7184e-14 0.0089 1918.5122
    4 -1.2546e-13 0.0139 1076.8597
    5 -1.3800e-13 0.0312 928.9079
    6 3.2374e-13 0.0182 1596.1407
    7 -8.4583e-11 0.0348 344.3954
    8 3.4855e-13 0.0196 1527.6638

    Iterations = 512000000
    NullTime (usec) = 0.0004
    MFLOPS(1) = 827.5658
    MFLOPS(2) = 673.7847
    MFLOPS(3) = 1037.6825
    MFLOPS(4) = 1501.7226
  • demuynckr - Monday, June 06, 2005 - link

    Just to clear things up: on linux the gcc 3.3.3 was used, on macintosh gcc 3.3 was used (the one that was included with the OS).
  • Joepublic2 - Monday, June 06, 2005 - link

    Wow, pixelglow, that's an awesome way to advertise your product. No marketing BS, just numbers! Reply
  • pixelglow - Sunday, June 05, 2005 - link

    I've done a direct comparison of G5 vs. Pentium 4 here. The benchmark is cache-bound, minimal branching, maximal floating point and designed to minimize use of the underlying operating system. It is also single-threaded so there's no significant advantage to dual procs. More importantly it uses Altivec on G5 and SSE/SSE2 on the Pentium 4, and also compares against different compilers including the autovectorizing Intel ICC.

    Let the results speak for themselves.

Log in

Don't have an account? Sign up now