An Early Christmas present from AMD: More Registers

In our coverage of the Opteron we focused primarily on the major architectural enhancements the K8 core enjoyed over the K7 (Athlon XP) - the on-die memory controller, improved branch predictor and more robust TLBs. For information on exactly what these improvements are for and why we'll direct you back to our Opteron coverage; the same information applies to the Athlon 64 as we are talking about the same fundamental core.

What we didn't spend much time talking about in our Opteron coverage was the benefit of additional registers, a benefit that is enabled in 64-bit mode. To understand why this is a benefit let's first discuss the role registers play in a microprocessor.

Although we think of main memory and cache as a CPU's storage areas, the often overlooked yet very important storage areas that we don't talk about are registers. Registers are individual storage locations that can hold numbers; these numbers can be values to add together, they can be memory addresses where the CPU can find the next piece of information it will need or they can be temporary storage for the outcome of one operation. For example, in the following equation:

A = 2 + 4

The number 2, the number 4 and the resulting number 6 will all be stored in registers, with each number taking up one register. These high speed storage locations are located very close to the processor's functional units (the ALUs, FPUs, etc…) and are fixed in size. In a 32-bit x86 processor like the Athlon XP or Pentium 4, the majority of registers will be 32 bits in width, meaning they can store a single 32-bit value. In 32-bit mode, the Athlon 64's general purpose registers are treated as being 32-bits wide, just like in its predecessor. However, in 64-bit mode all of the general purpose registers (GPRs) become 64-bits wide, and we gain twice as many GPRs. Why are more registers important and why haven't AMD or Intel added more registers in the past? Let's answer these two questions next.

Take the example of A = 2 + 4 from before; in a microprocessor with more than 3 registers, this operation could be carried out successfully without ever running out of registers. Internal to the microprocessor, the operation would be carried out something like this:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in Register 3

After the operation has been carried out, all three values are able to be used, so if we wanted to add 2 to the answer, the processor would simply add register 1 and register 3.

If the microprocessor only had 2 registers however, if we ever needed to use the values 2 or 4 again, they would have to be stored in main memory before being overwritten by the resulting value of A. Things would change in the following manner:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in a location in main memory

Here you can see that there is now an additional memory access that wasn't there before, and what we haven't even taken into account is that the location in main memory the CPU will store the result in will also have to be placed in a register so that the CPU knows where to tell the load/store unit to send the data. If we wanted to use that result for anything the CPU would have to first go to main memory to retrieve the result, evict a piece of data from one of the occupied registers and put it in main memory, and then store the result in a register. As you can see, the number of memory accesses increases tremendously; and the more memory accesses you have, the longer your CPU has to wait in order to get work done - thus you lose performance. Simple enough? Now here's where things get a little more complicated, why don't we just keep on adding more registers?

The beauty of the x86 Instruction Set Architecture (ISA) is that there are close to two decades of software that will run on even today's x86 microprocessors. One way this sort of backwards compatibility is maintained is by keeping the ISA the same from one microprocessor generation to the next; while this doesn't include things like functional units, cache sizes, or anything of that nature, it does include the number and names of registers. When a program is compiled to be run on an x86 CPU, the compiler knows that the architecture has 8 general purpose registers and when translating the programmer's code into machine code that the CPU can understand it references only those 8 general purpose registers. If Intel were to have 10 general purpose registers, anything that was compiled for an Intel CPU would not be able to run on an AMD CPU as the extra 2 general purpose registers would not be found on the AMD processor.

Microprocessor designers have gotten around this by introducing a technique known as register renaming, which makes only the allowed number of registers visible to software, however the hardware can rename other internal registers to juggle data around without going to main memory. Register renaming does fix a large percentage of the issues associated with register conflicts, where a CPU simply runs out of registers and must start swapping to main memory, however there are some cases where we simply need more registers.

When AMD introduced their AMD64 architecture, they had a unique opportunity at their hands. Because no other x86 processor would be able to run 64-bit code anyways, they decided to double the number of general purpose and SSE/SSE2 registers that were made available in 64-bit mode. Since AMD didn't have to worry about compatibility, doubling the register count in 64-bit mode wasn't really a problem, and the majority of the performance increases you will see for 64-bit applications on the desktop will be due to the additional registers.

What is important to note is that although AMD has increased the number of visible registers in 64-bit mode, the number of internal registers for renaming has not increased - most likely for cost/performance ratio constraints.

Index Where does 64-bit help?
POST A COMMENT

121 Comments

View All Comments

  • Anonymous User - Friday, October 17, 2003 - link

    In response to anonymous "Intel Boy" (biased, biased, biased) you can be in love with Intel if you choose. My experience has been that AMD processors have always been smother running and they run cooler than Intel which increases processor life. The AMD64 is in its infancy. It will get better in the months to come. Reply
  • Anonymous User - Wednesday, October 08, 2003 - link

    sorry I mean#107 Reply
  • Anonymous User - Wednesday, October 08, 2003 - link

    To #117 you wrpote is totally truth but do u think a lot understand it ? thanks anyway :)) Reply
  • Anonymous User - Monday, October 06, 2003 - link

    For #4 and other intel fan boys.
    I understand that you are in furious, you think as chip costs higher it is better and you paid much more money for intel and what? It usually is deafeted by AMD again and you feel sorry especially after the scandal with BAPCo where became clear that BAPCO is witing benchmarks for intel to show tham in better lighte heh even in sys marks 2002 which is "broken" and AMD doesn't recognize this bench even in this test which must not be used by anand athlon51fx is better than 3200EE of intel. and I can't understand how u can defend Intel when thic processor has 3.2 Gghzs and is DEAFETED BY 2200Gghz ? more than 1.2 Gghz handicap. I'll never bye intel even in due of this caus here is clear for even the dumbiest donkeey which technologie is better. thats why real computer specialists always prefare AMD and love tham.
    Reply
  • Anonymous User - Friday, October 03, 2003 - link

    These benchmark figures appear as if the P4 was used in a single channel setup. Does anybody know if this is correct? Also, ECC DDR-400 chips are very hard to come by, prohibitively expensive, and aren't available with low latencies. I don't think FX systems will be price competitive. What good is the high memory limit when you can only afford 512Mb, or a fast CPU with C3 memory? Too bad. Reply
  • Anonymous User - Friday, October 03, 2003 - link

    Hi, this is about your Athlon 64 Vs. Pentium 4 article, specifically the use of Quake3 as a CPU benchmark when comparing AMD vs. Intel cpus, as shown on this page

    http://www.anandtech.com/cpu/showdoc.html?i=1884&a...
    http://www.hardocp.com/article.html?art=NTI0LDU=
    http://www.tomshardware.com/cpu/20030923/athlon_64...

    Let me say the article is great, no complaints there. I know it takes alot of work to produce these articles.

    Now, I see two reasons for using a game as a cpu benchmark:
    1) It presents a fair (emphasis on the word 'fair') comparison of the competing cpu architectures and scaling issues.
    2) The game itself is of current interest to the community.

    In your article you already concede 2). Quake3 itself is not relevant as a game to anybody. Quake3-derived games are another matter, and are still popular and certainly relevant. More on these later.

    I believe there is strong evidence that Quake3 does not provide a fair benchmark for comparing *modern* (AthlonXP and possibly Athlon64 as well) AMD cpus vs Intel cpus. The reason being (and let me emphasize that I don't know this as an verified fact, I'm going on what a couple of programmers involved with helping AMD produce optimized game code have told me) that the Quake3 cpu recognition code does not recognize the AthlonXP as an SSE-capable cpu. Not only that, but the 3DNow code in Quake3 is apparently non-functional for this cpu.

    The politics and history behind this are interesting, but probably boil down to the AthlonXP being released well after Quake3, and Carmack being rightly uninterested in patching an old game.

    If this is true, you are benchmarking two equally SSE-capable cpus against each other, using a game engine which enables SSE for the Intel cpu and *disables* SSE for the AMD cpu (apparently there's no simple way to force SSE recognition either), for no valid reason, other than the game is too old to know about the AMD cpu's capabilities. What would be even worse is if this same recognition problem carries over to the Athlon64 (I have no word on this) and to newer Quake3-based games.

    Again, assuming this is true, it removes any rationale for using a 3-year old game that: a) few people play, b) which gives ridiculously high scores, and which c) unfairly handicaps AMD cpus; as a benchmark to be used specifically in comparing AMD cpus vs their Intel competitors in articles such as this one.

    So. Here are the recommendations I, as an interested Hardocp/Anand/Toms reader (and admitted AMD fan) am making to you and your site:

    1) Investigate this matter further, and write an article discussing it. And in particular discuss the relevance of this cpu issue to current Quake3-based games. Assuming there is in fact an Intel bias to Quake3-based benchmarking I think people would be very interested to learn about it. Apparently the SSE issue does indeed carry over to later games.

    2) Assuming there is a bias, discontinue using Quake3 as a cpu benchmark, and especially discontinue it's use when comparing AMD vs Intel cpus. The game will never be patched to fix this issue, and using 3rd party fixes noone cares about is more or less pointless too. I'm referring to the dlls on this page:
    http://speedycpu.dyndns.org/opt/

    This guy is one of the programmers I referred to earlier, and he tells me the dlls do not enable SSE where it really matters anyway. The other was a student working at AMD writing assembly 3DNow code. The best solution is simply to retire this benchmark, just as Q1 and Q2 were retired.

    rms
    Reply
  • Anonymous User - Thursday, October 02, 2003 - link

    Not to be a ball buster, but in your paragraph:

    "For starters, at a 192mm^2, the Athlon 64 and Athlon 64 FX are well above AMD's "sweet spot" for manufacturing. When we last talked with AMD's Fred Weber, 100 - 120mm^2 die size is ideal for mass production given AMD's wafer size, yields and other manufacturing characteristics - and the Athlon 64 is close to twice that size"

    If you calculate it out, the 64FX is closer to 4x the die size of the "sweet spot". 192mm x 192mm = 36864 sq mm. The "sweet spot" is 100mm x 100mm = 10000 sq mm. Sorry, just figured I'd point that out.


    -Kooldino
    Reply
  • Anonymous User - Wednesday, October 01, 2003 - link

    don't hold your breadth1 as far as ms is concerned the visual studio compilers is still not truly 32 bit let alone be 64 bit. without such compilers you cannot get 64 bit apps

    Even Winxp so claimed to be redisigend from bootom up is not true. Well its desigend from broken pieces on the ground hurriedly glued together. How come you still have a System and a System32 folders in c:\Windows??? Thats the 16bit and 32 bit DLLs. Why the sudden Blue scren of death? Same old problem - confilcts between DLLs.

    Try writing code in Visual STudio and query the WinOS ver - for WinXP you will get WinNT as the response. HOw can a truly ground up redesigned OS behave as such? Beats me?

    Until such time that WinXX OS is truly 32bit or 64 bit you cannot have any true 64 bit apps running.

    The BIOS also have problems. nFOrce2 still buggy and not properly fixed - can you trust nForce3? If those guys cannot fix up nForce2, then nForce3 is gonna have lots more problems.
    Reply
  • Locutus4657 - Tuesday, September 30, 2003 - link

    #32,

    On what exactly are you basing your arguments? You obviously have no experience or knowledge of Win64... If you did you'd realize 64 bit versions of Windows NT date back to NT4 on DEC Alpha hardware... You obviously have no clue what so ever... Try posting a relevant argument next time... Try something based on benchmarks, and heck, next try even putting it into context as to how you use your computer...
    Reply
  • Anonymous User - Monday, September 29, 2003 - link

    all i know is i bought amd stock for less than $5 a few months ago and it's on the way to tripling in value. perhaps i'll use the profits to buy another one of their chips. Reply

Log in

Don't have an account? Sign up now