An Early Christmas present from AMD: More Registers

In our coverage of the Opteron we focused primarily on the major architectural enhancements the K8 core enjoyed over the K7 (Athlon XP) - the on-die memory controller, improved branch predictor and more robust TLBs. For information on exactly what these improvements are for and why we'll direct you back to our Opteron coverage; the same information applies to the Athlon 64 as we are talking about the same fundamental core.

What we didn't spend much time talking about in our Opteron coverage was the benefit of additional registers, a benefit that is enabled in 64-bit mode. To understand why this is a benefit let's first discuss the role registers play in a microprocessor.

Although we think of main memory and cache as a CPU's storage areas, the often overlooked yet very important storage areas that we don't talk about are registers. Registers are individual storage locations that can hold numbers; these numbers can be values to add together, they can be memory addresses where the CPU can find the next piece of information it will need or they can be temporary storage for the outcome of one operation. For example, in the following equation:

A = 2 + 4

The number 2, the number 4 and the resulting number 6 will all be stored in registers, with each number taking up one register. These high speed storage locations are located very close to the processor's functional units (the ALUs, FPUs, etc…) and are fixed in size. In a 32-bit x86 processor like the Athlon XP or Pentium 4, the majority of registers will be 32 bits in width, meaning they can store a single 32-bit value. In 32-bit mode, the Athlon 64's general purpose registers are treated as being 32-bits wide, just like in its predecessor. However, in 64-bit mode all of the general purpose registers (GPRs) become 64-bits wide, and we gain twice as many GPRs. Why are more registers important and why haven't AMD or Intel added more registers in the past? Let's answer these two questions next.

Take the example of A = 2 + 4 from before; in a microprocessor with more than 3 registers, this operation could be carried out successfully without ever running out of registers. Internal to the microprocessor, the operation would be carried out something like this:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in Register 3

After the operation has been carried out, all three values are able to be used, so if we wanted to add 2 to the answer, the processor would simply add register 1 and register 3.

If the microprocessor only had 2 registers however, if we ever needed to use the values 2 or 4 again, they would have to be stored in main memory before being overwritten by the resulting value of A. Things would change in the following manner:

Store "2" in Register 1
Store "4" in Register 2
Store Register 1 + Register 2 in a location in main memory

Here you can see that there is now an additional memory access that wasn't there before, and what we haven't even taken into account is that the location in main memory the CPU will store the result in will also have to be placed in a register so that the CPU knows where to tell the load/store unit to send the data. If we wanted to use that result for anything the CPU would have to first go to main memory to retrieve the result, evict a piece of data from one of the occupied registers and put it in main memory, and then store the result in a register. As you can see, the number of memory accesses increases tremendously; and the more memory accesses you have, the longer your CPU has to wait in order to get work done - thus you lose performance. Simple enough? Now here's where things get a little more complicated, why don't we just keep on adding more registers?

The beauty of the x86 Instruction Set Architecture (ISA) is that there are close to two decades of software that will run on even today's x86 microprocessors. One way this sort of backwards compatibility is maintained is by keeping the ISA the same from one microprocessor generation to the next; while this doesn't include things like functional units, cache sizes, or anything of that nature, it does include the number and names of registers. When a program is compiled to be run on an x86 CPU, the compiler knows that the architecture has 8 general purpose registers and when translating the programmer's code into machine code that the CPU can understand it references only those 8 general purpose registers. If Intel were to have 10 general purpose registers, anything that was compiled for an Intel CPU would not be able to run on an AMD CPU as the extra 2 general purpose registers would not be found on the AMD processor.

Microprocessor designers have gotten around this by introducing a technique known as register renaming, which makes only the allowed number of registers visible to software, however the hardware can rename other internal registers to juggle data around without going to main memory. Register renaming does fix a large percentage of the issues associated with register conflicts, where a CPU simply runs out of registers and must start swapping to main memory, however there are some cases where we simply need more registers.

When AMD introduced their AMD64 architecture, they had a unique opportunity at their hands. Because no other x86 processor would be able to run 64-bit code anyways, they decided to double the number of general purpose and SSE/SSE2 registers that were made available in 64-bit mode. Since AMD didn't have to worry about compatibility, doubling the register count in 64-bit mode wasn't really a problem, and the majority of the performance increases you will see for 64-bit applications on the desktop will be due to the additional registers.

What is important to note is that although AMD has increased the number of visible registers in 64-bit mode, the number of internal registers for renaming has not increased - most likely for cost/performance ratio constraints.

Index Where does 64-bit help?
POST A COMMENT

121 Comments

View All Comments

  • Anonymous User - Wednesday, September 24, 2003 - link

    Nice review anand, however I am missing the P4 EE in a number of the tests, as previous post (#67) suggested.

    The Athlon 64/A64 FX appears to be a nice processor, for a shiny new design cpu the advantages were expectable.

    Some more 64bit tests, maybe a divx codec pre-compiled for 64bit in a test?

    As for the amd vs intel combat:
    The A64 and A64FX match up a lot better against the latest p4/p4EE. I wouldnt have expected anything else.
    While the prescot still lurks in the dark and I have a feeling Intel has something up their sleeve I wouldnt call the prescot an failure yet.
    If Intel plays nicely along, maybe they can create a cpu that beats the A64/A64FX in 32 (and just maybe in 64bit http://www.theinquirer.net/?article=11668).

    Either way, the more AMD and Intel compete each other, the happier I am, after all I end up paying less for either cpu.
    Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    There is already a 64 bit port of America's Army available that doesn't need a 64 bit OS! http://www.amd.com/us-en/Corporate/VirtualPressRoo... Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    Guys, if the AMD Athlon 64s don't succeed, I believe AMD would go under or be in financial trouble. So these new processors MUST sell well. Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    I'm curious why RedHat Taroon, which is an enterprise-focused Linux distribution, was used for the 64-bit benchmarks and not RedHat GinGin64, which is more consumer-focused. Both are available from the RedHat FTP site. Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    Several of the benchmarks left out the P4 Extreme scores (memory bandwidth, Content Creation 2003, and Virtual Studio 6.0 Compile) - was that a mistake or benchmarking per AMD's new guidelines? It's also funny that this is one of the first Anandtech CPU reviews without the full system specs documented (i.e. ECC memory vs non ECC, Intel branded motherboard vs. ASUS enthusiast motherboard, memory latency settings, etc.) - more AMD review guidelines?

    The way I see it, you can either spend the money on a whole new Athlon 64/FX CPU, motherboard, and memory outfit, or buy a P4 EE and stick it into any motherboard today that accepts a 3.2 GHz CPU - that sure beats having to buy a new motherboard and memory to get about the same level of performance average across the board.

    Even when 64-bit Windows comes out, does everyone really think that Microsoft and Bill Gates will really make it priced at mainstream levels and reduce the cost of the current 32-bit Windows XP so soon? I have my doubts but I guess we'll just have to wait and see.

    Another interesting thing to note from Tom's Hardware review is that the 64-bit code for AMD64 does run faster on 64-bit OS but if you read carefully, he says that the same program optimized for the P4 runs even faster on 32-bit OS. So, software companies will probably have to make a choice (unless they are big enough and make enough money to serve all markets): A - optimize 32-bit software to take advantage of the P4/Prescott and Hyperthreading using compilers that Intel provides, or B - compile 64-bit software for which there is still no mainstream OS and there are hardly any standard compilers available for and market them to the 500,000 or so people who will have the opportunity to own AMD64 desktop chips this year.

    Sure Intel has a problem with the Prescott heat dissipation right now but I don't think they will be sitting idle. Thermal interface technology is getting better all of the time and I wouldn't doubt if Intel isn't already making process improvements and/or implementing newer cooling methods. After all, it was Intel who came up with the heatspreader design for the current generation P4 that is now being used by Hammer chips.

    Once the Prescott on .09-micron technology hits the streets it will continue to be refined and improved upon so the clock speed will continue to increase. Imagine a Prescott EE CPU with 1MB L2 and 2MB L3 or more. What would be a real thorn in AMD's side would be if Intel makes a shrink of the current P4 onto the new .09-micron technology and increases the clock speed to the 4 GHz level (already achievable by some CPUs on the current .13-micron process) to keep pace with the Athlon 64/FX which is supposed to be AMD's next generation CPU. They could put a whole bunch of P4 die (even P4 EE die) on a 300 mm wafer and put a hurting on AMD until they can get their 90nm process and 300mm wafer process going. It is a scary possibility for AMD but could be reality for Intel - meanwhile, AMD still has to face the daunting task of converting to 300mm wafers and 90nm process at the same time to keep up. AMD says that they will start 90nm production in the first half of 2004, but then again, they've been promising hammer since 2001. But they have to do something because with their current situation of roughly 192 square millimeters per Athlon 64/FX die on a 200mm wafer yields a theoretical 73 die per wafer (per Tom's Hardware review). And I believe that AMD wants to put all of their products on the same line and differentiate them at the end - similar to the way Intel does with their Northwood/Celeron products (same die with certain cache and other things disabled) - so even the 256K L2 cache mainstream Athlon 64 comes out, it may still be the same size as all of the other Athlon 64/FX/Opterons.

    Hector Ruiz, Jerry Sanders and AMD as a whole have a very steep mountain in front of them to climb. Time will tell if they have what it takes to get up and over it. The first checkpoint for them will come in about 3 weeks in the form of Q3 earnings. By then we'll see how sales of their new CPUs are going and if their joint venture in FLASH pays off. (It didn't really make sense to me for them to lay off 2000 people at the beginning of the year to reduce costs and then turn around 2 quarters later and pick up 7000 people in the FLASH venture with Fujitsu which comes with more debt than earnings.) I'm not a betting man, but if I were, my money would be on AMD making 9 straight quarters of losses in a row. When Hector Ruiz came to office he vowed that AMD restructuring would make them hit break-even sometime around Q2 2003 but that never happened. There seems to be a pattern with promises made by AMD. I guess it's why his 3 million stock options which were granted at $16 are still under water.
    Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    Apparently the FX series is unlocked multiplier, and mobos will be coming shortly that have multiplier selection options (read Anand's "weblog" entry)... I can't wait to see the results of a 13x 220 FX-51; now, THAT I might part with $800 to play with... Talk about insanely fast processors, the higher FX goes, the smaller Intel's lead gets.
    Whoever said that throwing more cache on the P4 core beat Hammer is just deluded; P4 is at the end of its line, AMD64 is just beginning. And once again, Intel supporters seem to grow rather silent when you point out that the on-die memory controller becomes significantly more powerful when the clock speeds ramp up; a 3.3GHz FX chip would be more than a match for a 3.4GHz Prescott, I'd think. Memory bandwidth advantage is a thing of the past for Intel, now it's up to AMD to shore up their lacking SSE/SSE2 performance and work on speeding up the core, as well as meeting the demand for such upgraded processors.

    I'm not normally so pro-AMD (though I support their products more than Intel's, just from a cost efficiency standpoint), but it's kind of hard to not be wowed by the muscle this chip can flex. I mean, this is the Day One marketed "prototype" and it's capable of matching its most recent and mature rivals, can you imagine what next year is going to look like?
    Reply
  • sprockkets - Wednesday, September 24, 2003 - link

    The fact that Intel HAD to release a EE edition shows desparation at looking behind. Yeah, so the Prescott does look good. That and the 103w dissipation, unconfirmed if it is on the 90 process, if it is then that's pathetic.

    I can buy a Athlon 64 or FX, where is the EE like others said? And why was the NDA lifted on the same day as the AMD Athlon 64?

    Complaining about price? The 1.5ghz P4 costed around $1000 when it came out and was slower than a P3 1.0ghz, while the new Athlon 64 always is faster than the XP.
    Reply
  • Anonymous User - Wednesday, September 24, 2003 - link

    lmfao

    AMD is still the underdog :P Always will be. You get what you pay for.

    AMD is for the guys who love to root for the underdog; in otherwords fanboys. If you want solid, no hassle performance with top support .. you know where to put your money.

    I mean christ, Intel doesn't even have to design a next generation core to outmatch AMD's next core eveolution -K8- they just simply tack on more cache ;) How sad is that?

    AMD fans, take a hint ... AMD's STILL playing cat-up with Intel. Read the article closely, and you'll see what I mean -the author hints at it so clearly as well- Its so easy to see. AMD has never had the advantage ;) Only clever marketing which most people pin as bad marketing on AMD's part. Quite the contrary, Stupid kids!
    Reply
  • Anonymous User - Tuesday, September 23, 2003 - link

    #61 ya I know how they compare...the g5 is a mac no software support and the cartoons might pop out of the screen and eat you.....no comparison.... Reply
  • Anonymous User - Tuesday, September 23, 2003 - link

    Which chip is faster at Divx encoding?

    http://www.anandtech.com/cpu/showdoc.html?i=1884&a...
    OR
    http://www.hardocp.com/article.html?art=NTI0LDM=
    OR
    http://www.aceshardware.com/read.jsp?id=60000256
    OR
    http://www4.tomshardware.com/cpu/20030923/athlon_6...

    Reply

Log in

Don't have an account? Sign up now