An Early Christmas present from AMD: More Registers

In our coverage of the Opteron we focused primarily on the major architectural enhancements the K8 core enjoyed over the K7 (Athlon XP) - the on-die memory controller, improved branch predictor and more robust TLBs. For information on exactly what these improvements are for and why we'll direct you back to our Opteron coverage; the same information applies to the Athlon 64 as we are talking about the same fundamental core.

What we didn't spend much time talking about in our Opteron coverage was the benefit of additional registers, a benefit that is enabled in 64-bit mode. To understand why this is a benefit let's first discuss the role registers play in a microprocessor.

Although we think of main memory and cache as a CPU's storage areas, the often overlooked yet very important storage areas that we don't talk about are registers. Registers are individual storage locations that can hold numbers; these numbers can be values to add together, they can be memory addresses where the CPU can find the next piece of information it will need or they can be temporary storage for the outcome of one operation. For example, in the following equation:

A = 2 + 4

The number 2, the number 4 and the resulting number 6 will all be stored in registers, with each number taking up one register. These high speed storage locations are located very close to the processor's functional units (the ALUs, FPUs, etc…) and are fixed in size. In a 32-bit x86 processor like the Athlon XP or Pentium 4, the majority of registers will be 32 bits in width, meaning they can store a single 32-bit value. In 32-bit mode, the Athlon 64's general purpose registers are treated as being 32-bits wide, just like in its predecessor. However, in 64-bit mode all of the general purpose registers (GPRs) become 64-bits wide, and we gain twice as many GPRs. Why are more registers important and why haven't AMD or Intel added more registers in the past? Let's answer these two questions next.

Take the example of A = 2 + 4 from before; in a microprocessor with more than 3 registers, this operation could be carried out successfully without ever running out of registers. Internal to the microprocessor, the operation would be carried out something like this:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in Register 3

After the operation has been carried out, all three values are able to be used, so if we wanted to add 2 to the answer, the processor would simply add register 1 and register 3.

If the microprocessor only had 2 registers however, if we ever needed to use the values 2 or 4 again, they would have to be stored in main memory before being overwritten by the resulting value of A. Things would change in the following manner:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in a location in main memory

Here you can see that there is now an additional memory access that wasn't there before, and what we haven't even taken into account is that the location in main memory the CPU will store the result in will also have to be placed in a register so that the CPU knows where to tell the load/store unit to send the data. If we wanted to use that result for anything the CPU would have to first go to main memory to retrieve the result, evict a piece of data from one of the occupied registers and put it in main memory, and then store the result in a register. As you can see, the number of memory accesses increases tremendously; and the more memory accesses you have, the longer your CPU has to wait in order to get work done - thus you lose performance. Simple enough? Now here's where things get a little more complicated, why don't we just keep on adding more registers?

The beauty of the x86 Instruction Set Architecture (ISA) is that there are close to two decades of software that will run on even today's x86 microprocessors. One way this sort of backwards compatibility is maintained is by keeping the ISA the same from one microprocessor generation to the next; while this doesn't include things like functional units, cache sizes, or anything of that nature, it does include the number and names of registers. When a program is compiled to be run on an x86 CPU, the compiler knows that the architecture has 8 general purpose registers and when translating the programmer's code into machine code that the CPU can understand it references only those 8 general purpose registers. If Intel were to have 10 general purpose registers, anything that was compiled for an Intel CPU would not be able to run on an AMD CPU as the extra 2 general purpose registers would not be found on the AMD processor.

Microprocessor designers have gotten around this by introducing a technique known as register renaming, which makes only the allowed number of registers visible to software, however the hardware can rename other internal registers to juggle data around without going to main memory. Register renaming does fix a large percentage of the issues associated with register conflicts, where a CPU simply runs out of registers and must start swapping to main memory, however there are some cases where we simply need more registers.

When AMD introduced their AMD64 architecture, they had a unique opportunity at their hands. Because no other x86 processor would be able to run 64-bit code anyways, they decided to double the number of general purpose and SSE/SSE2 registers that were made available in 64-bit mode. Since AMD didn't have to worry about compatibility, doubling the register count in 64-bit mode wasn't really a problem, and the majority of the performance increases you will see for 64-bit applications on the desktop will be due to the additional registers.

What is important to note is that although AMD has increased the number of visible registers in 64-bit mode, the number of internal registers for renaming has not increased - most likely for cost/performance ratio constraints.

Index Where does 64-bit help?
Comments Locked

122 Comments

View All Comments

  • Anonymous User - Thursday, September 25, 2003 - link

    The Athlon64 FX doesn't have a multiplier lock either, but we never saw any results from that. Also I don't think a chip overclocking well means it's designed for "higher clock speeds".
  • Anonymous User - Thursday, September 25, 2003 - link

    toms just revised their review, "Update Sept 24,2003: Unfortunately we have made a mistake in the original article: In addition to the official P4 EE 3.2GHz we had included benchmark scores of the P4 Extreme 3.4GHz and 3.6GHz. These values were planned for a future THG article and were not intended to be included here. We would like to apologize especially to those readers who misinterpreted our charts. The two bars of the P4 Extreme 3.4GHz and 3.6GHz have now been removed. However, this issue does not affect our conclusion as we have only compared the official P4 3.2GHz EE to all other test candidates in our original article. For your information: The press sample of the P4 Extreme provided by Intel does not have a multiplier lock and is already designed for higher clock speeds. "
  • Anonymous User - Thursday, September 25, 2003 - link

    #81
    I also question why toms have a review to overclock P4 3.2 EE to 3.6 to win every performance chart. Is it fair to AMD? I like Intel CPU but I also like fair review.
  • Anonymous User - Thursday, September 25, 2003 - link

    AMD needs to almost give this thing away so that it can sell well thus attracting a flood of 64 big developers. I think they should even do this to the detriment of their profit margins because if this doesnt sell well then all the software wont be developed. Its kinda like the chicken and the egg here and I think AMD should take a beating now in terms of $ to get this thing out and get 64 bit in the hands of the people. If everyone has it the software will follow.
  • Anonymous User - Thursday, September 25, 2003 - link

    Logic dictates that people whom use the term "fanboy" are mentally disturbed persons whom feel the need to categorize others into a certain group to make themselves feel better. On a side note though I think the Athlon64 3200+ is winner given its current availability, price, and performance. I’m just curious as to how far AMD hopes to scale the processor for the remainder of the year as though I already know there will be a 3400+ release in short time, I am wondering if there will be a 3600+ release in anticipation of Prescott. I’m also curious as to how quickly AMD will transition it to 90nm as I’m thinking one of the main reasons AMD hasn’t really made full effort in mass producing K8 processors are the manufacturing costs at 130nm. Either way it’s nice to see such a chip out, especially at the price it is being quoted for (though it seems some people are having fits that they can’t buy A64s for $100).
  • Anonymous User - Thursday, September 25, 2003 - link

    I think Intel is faring pretty well considering that AMD has reduced latency four fold with its integrated memory controller, incresed transistor performance by %30 with SOI, and doubled cache to 1MB. I think Intel will only close the gap with the upcomng Prescott but will pull ahead with LGA 775 Prescott and Grantsdale with PCI Express. Fanboys, save your speeches. Argue with logic.
  • Anonymous User - Wednesday, September 24, 2003 - link

    When is somebody going to come up with "folding" for people. We could use all the extra time people have on their hands debating what chip is better, to access their brain power to come up with cures for world hunger, A.I.D.S and introducing fanboys to fangirls. That being said, I appreciate all your opinions in helping me decide what chip to buy. Taking in to account the proccesing power I need for work and play, I have decided to buy an Xbox and a typewriter and forgo the 64 or P4EE.
  • Anonymous User - Wednesday, September 24, 2003 - link

    THIS FANBOY CRAP HAS TO STOP HOW NERDY CAN U BE??i am glad i am not so much into computers as most of u ;)...watch if one of these companies go out of business u see the survivor amd or intel making poor performing cpu's sold for $$$$ with a "take it of leave it" attitude...QUIT THE FANBOY CRAP truth is these companies don't give a shite about you only that little friend in your pocket that holds ur money
  • sprockkets - Wednesday, September 24, 2003 - link

    The PM people believe that since they see the current situation in that Intel pays everyone not to use AMD, and that makes them a niche market. It's not due to AMD being slower or more error prone. Let's face it, Intel is bigger and has more to deal with, but as I've said before, they also can waste millions, perhaps a billion or so on Itanium and it's going nowhere. Perhaps it will now, but it's pretty stupid to see why. Sure it doesn't suffer from x86 legacy code. But look at what it took to get there, redoing software, apps, hardware, and a huge 400mm die. The Alpha people look to turn it into something, but that's alpha that made it something, otherwise it sucks.

    It's pretty stupid to argue here that the P4 3.2 ghz is faster or the emergency (good one :) ) edition is, the Xenon or even Itanium architecture with the cpus sharing a FSB and memory via a hub or northbridge architecture sucks compared to the hyper transport architecture the Opteron uses, and no amount of clock speed or memory speed is going to change that.

    I wonder if Intel can now use it's own Itaniums instead of Alphas to run it's chip production line.
  • Anonymous User - Wednesday, September 24, 2003 - link

    #91, That would be an expected outcome when half the tests are media/encoding benchmarks which are optimized for HT/SSE2. Not that there is anything wrong with that, just a simple note.

Log in

Don't have an account? Sign up now