An Early Christmas present from AMD: More Registers

In our coverage of the Opteron we focused primarily on the major architectural enhancements the K8 core enjoyed over the K7 (Athlon XP) - the on-die memory controller, improved branch predictor and more robust TLBs. For information on exactly what these improvements are for and why we'll direct you back to our Opteron coverage; the same information applies to the Athlon 64 as we are talking about the same fundamental core.

What we didn't spend much time talking about in our Opteron coverage was the benefit of additional registers, a benefit that is enabled in 64-bit mode. To understand why this is a benefit let's first discuss the role registers play in a microprocessor.

Although we think of main memory and cache as a CPU's storage areas, the often overlooked yet very important storage areas that we don't talk about are registers. Registers are individual storage locations that can hold numbers; these numbers can be values to add together, they can be memory addresses where the CPU can find the next piece of information it will need or they can be temporary storage for the outcome of one operation. For example, in the following equation:

A = 2 + 4

The number 2, the number 4 and the resulting number 6 will all be stored in registers, with each number taking up one register. These high speed storage locations are located very close to the processor's functional units (the ALUs, FPUs, etc…) and are fixed in size. In a 32-bit x86 processor like the Athlon XP or Pentium 4, the majority of registers will be 32 bits in width, meaning they can store a single 32-bit value. In 32-bit mode, the Athlon 64's general purpose registers are treated as being 32-bits wide, just like in its predecessor. However, in 64-bit mode all of the general purpose registers (GPRs) become 64-bits wide, and we gain twice as many GPRs. Why are more registers important and why haven't AMD or Intel added more registers in the past? Let's answer these two questions next.

Take the example of A = 2 + 4 from before; in a microprocessor with more than 3 registers, this operation could be carried out successfully without ever running out of registers. Internal to the microprocessor, the operation would be carried out something like this:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in Register 3

After the operation has been carried out, all three values are able to be used, so if we wanted to add 2 to the answer, the processor would simply add register 1 and register 3.

If the microprocessor only had 2 registers however, if we ever needed to use the values 2 or 4 again, they would have to be stored in main memory before being overwritten by the resulting value of A. Things would change in the following manner:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in a location in main memory

Here you can see that there is now an additional memory access that wasn't there before, and what we haven't even taken into account is that the location in main memory the CPU will store the result in will also have to be placed in a register so that the CPU knows where to tell the load/store unit to send the data. If we wanted to use that result for anything the CPU would have to first go to main memory to retrieve the result, evict a piece of data from one of the occupied registers and put it in main memory, and then store the result in a register. As you can see, the number of memory accesses increases tremendously; and the more memory accesses you have, the longer your CPU has to wait in order to get work done - thus you lose performance. Simple enough? Now here's where things get a little more complicated, why don't we just keep on adding more registers?

The beauty of the x86 Instruction Set Architecture (ISA) is that there are close to two decades of software that will run on even today's x86 microprocessors. One way this sort of backwards compatibility is maintained is by keeping the ISA the same from one microprocessor generation to the next; while this doesn't include things like functional units, cache sizes, or anything of that nature, it does include the number and names of registers. When a program is compiled to be run on an x86 CPU, the compiler knows that the architecture has 8 general purpose registers and when translating the programmer's code into machine code that the CPU can understand it references only those 8 general purpose registers. If Intel were to have 10 general purpose registers, anything that was compiled for an Intel CPU would not be able to run on an AMD CPU as the extra 2 general purpose registers would not be found on the AMD processor.

Microprocessor designers have gotten around this by introducing a technique known as register renaming, which makes only the allowed number of registers visible to software, however the hardware can rename other internal registers to juggle data around without going to main memory. Register renaming does fix a large percentage of the issues associated with register conflicts, where a CPU simply runs out of registers and must start swapping to main memory, however there are some cases where we simply need more registers.

When AMD introduced their AMD64 architecture, they had a unique opportunity at their hands. Because no other x86 processor would be able to run 64-bit code anyways, they decided to double the number of general purpose and SSE/SSE2 registers that were made available in 64-bit mode. Since AMD didn't have to worry about compatibility, doubling the register count in 64-bit mode wasn't really a problem, and the majority of the performance increases you will see for 64-bit applications on the desktop will be due to the additional registers.

What is important to note is that although AMD has increased the number of visible registers in 64-bit mode, the number of internal registers for renaming has not increased - most likely for cost/performance ratio constraints.

Index Where does 64-bit help?
Comments Locked

122 Comments

View All Comments

  • AgaBooga - Tuesday, September 23, 2003 - link

    Where is the P4EE in the memory tests?
  • Anonymous User - Tuesday, September 23, 2003 - link

    Personally this was rather anti-climatic for me. It's certainly not a Intel killer that all the hype proclaimed. AMD for business, Intel for content, and a throwup for gaming. Same as it has been for awhile.
  • Anonymous User - Tuesday, September 23, 2003 - link

    #27 & #28 (amd fanboy double post)

    It SHOULD be up there with the P4EE because the PRESCOTT will be coming right around the corner! Face it, AMD did not put out a killer and Intel is sitting pretty in 2004.
  • Anonymous User - Tuesday, September 23, 2003 - link

    #20 are you serious? Did you just comment in the forum without looking at the review or did you actually look at the review. AMD is not "lagging" behind Intel. They are right up there with them. Look at the benchmarks and you will see the CURRENTLY AVAILABLE Athlon64 easily matches a NOT CURRENTLY AVAILABLE P4EE.
  • Anonymous User - Tuesday, September 23, 2003 - link

    #20 are you serious? Did you just comment in the forum without looking at the review or did you actually look at the review. AMD is not "lagging" behind Intel. They are right up there with them. Look at the benchmarks and you will see the CURRENTLY AVAILABLE Athlon64 easily matches a NOT CURRENTLY AVAILABLE P4EE.
  • Anonymous User - Tuesday, September 23, 2003 - link

    AMD, Pamela Anderson called. She want's to know how she can get a bust as big as yours. I have two words for AMD- "Segway" and "Scooter."
  • Anonymous User - Tuesday, September 23, 2003 - link

    nForce3 performance bug

    Time to re-do the benchmarks, Anand.

    Your FX-51 benchmarks are inaccurate.

    http://www20.tomshardware.com/cpu/20030923/athlon_...

    Nvidia: NForce-3 Bug

    The extremely low AGP performance of the NForce3 can be clearly attributed to problems with the HyperTransport channel interface to the Northbridge. That is proven by the benchmark results and the performance differences of up to 33.2 percent. Details about this can be found in the benchmark section of this article.

    Originally, Nvidia had planned to also integrate a SATA RAID controller in the Southbridge. Although the controller is included in the current NForce 3, Nvidia deactivated this feature. The reason was that error-free operation was not possible. For this reason, we decided to use additional boards based on the VIA K8T800 chipset.

    Nvidia (Athlon 64 FX, or alternatively GeForce FX - related names) may be a more high-profile partner for AMD than VIA. However, we would point out that VIA, with the K8T800 chipset, currently offers a clearly better solution for the Athlon 64.

  • Anonymous User - Tuesday, September 23, 2003 - link

    What is that smell?

    AMD just let loose with a huge turd!
  • Anonymous User - Tuesday, September 23, 2003 - link

    #4 You may be right (I don't think so but let say you are), but then ask yourself - where is the Pentium 4 Extreme Edition? There is no mention of this CPU at Intel web site at all, there is no datasheet and no batch numbers. Today, it is only a prototype CPU, such as Prescott is. They managed to build few Gallatin B1 cores that are able to work at this frequency and then remarked them. This CPU is not reality, only OEMs can buy it in very limited quantities, but end users can't. I think a 3 GHz Athlon 64 FX on 90nm prototype would perform far the best in this review... and it would be the same policy as with this Pentium 4 Extreme Edition.
  • Anonymous User - Tuesday, September 23, 2003 - link

    #17 Answers:

    1. Athlon 64's memory controller is very fast as you can see from the benchmarks. Dual channel is only needed in some situations to give decent performance. HT operates at 800 MHz with DDR and 16 bits, thus giving 3.2 GB/s each way (6.4 GB/s). Not so bad for a I/O and AGP interface only bus

    3. S754 is a lower end platform while S940 is an Opteron platform. AMD will introduce S939 early next year and will continue to produce CPUs for all those sockets. S940 A64 FX will, however, disapper in the end of next year.

    6. HyperTransport "Tunnel" system allows for practically unlimited number of chipset combinations, thus a PCI Express will only require to add another Tunnel or integrate it into current chipsets.

Log in

Don't have an account? Sign up now