Final Words

While AMD touched on an incredibly vast amount of technology and data over the course of their 3 hour webcast, the depth of each branch was not nearly enough to satisfy our tastes. We are in the process of scheduling briefings with as many AMD engineers as possible in order to get our questions answered, and we will certainly report on the details of our research as soon as we are able. Hopefully next week's Computex will be very fruitful on the AMD front.

We can't be too upset over the lack of detail though. In fact, for a day designed around presenting technology to analysts, AMD was pretty heavy on the technology and architecture. Now that they've officially confirmed some of the key features of their next gen processor and platform technology, we certainly hope they will be able to back up their claims with real architectural data on the hardware.

In the meantime, we can all dream sweet dreams over the possibilities AMD's Torrenza presents. Giving expansion cards the bandwidth and low latency of an HTX connection with the ability to support coherent HyperTransport will enable hardware vendors to create a new class of expansion card. Though AMD likes to call these "accelerators," we'll try our best to steer clear of buzz words and marketing speak. Suffice it to say that giving hardware vendors the capability of accessing any CPU or memory in the system directly with cache coherency should really shake things up. The advantages are probably most apparent to the HPC market, where HTX can offer an easy and standard way to add custom FPGAs or very specialized hardware to a massive system. However, there are absolutely advantages out there for those who want to build hardware to really work in lock-step with the CPU.

This applies directly to companies like AGEIA with their PhysX card which, when used in a game, must communicate bi-directionally with the CPU before a frame can be sent to the GPU for rendering. Additionally, GPU makers could easily take advantage of this technology to tie the graphics card even more tightly to the CPU and system memory. In fact, this would serve to eliminate one of the largest differences between PCs and game consoles. The major advantage that still remains on console systems (aside from their limited need for backwards compatibility compared to the PC) is the distance from the CPU to the GPU. There is huge bandwidth and low latency between these two subsystems in a console, and many games are written to take advantage of (or even depend on) the ability to actively share the rendering workload between the CPU and GPU on a very low level. Won't it be ironic if we start seeing high performance Xbox 360 and PS3 emulators only a couple years after their release? This is the kind of thing that could make it possible.

With Torrenza and the introduction of 4x4 in the consumer space, it seems clear that AMD will be offering consumer level CPUs with multiple external coherent HyperTransport channels. As the lack thereof has been the only limitation keeping us from building multiple processor systems with consumer products, we have to wonder how AMD will really differentiate its server and workstation parts this time around. Out of the gate, the K8L Opteron will be a 4 core part, while the desktop chip will only have 2, but eventually the desktop will support 4 cores as well. Will we start to see more specialized hardware "accelerators" on Opteron chips, or will we see more I/O oriented modules? Will HT-3's link unganging to allow 2 8bit links for every 16bit link only be available on the high end parts? AMD's leadership in performance in the 2P and 4P workstation market has been very solid since the beginning of Opteron, and we are excited to see the ways AMD will attempt to continue this trend.

The final word on AMD's Analyst Day? Performance. It's pure and simple, and AMD is all about it. On the high end it's 4x4 or 8 coherent HT links, and on the mobile side, its performance per Watt. By 2008, AMD hopes that 1/3 of the market place will let the world know that they've still got solid performance for the mainstream at good prices as well. The next gen CPU market will certainly be exciting to watch.

K8L Architecture
Comments Locked

40 Comments

View All Comments

  • peternelson - Saturday, June 3, 2006 - link


    High end pcie cards are available if you look for them

    eg Areca 8 sata II onto 8x pci express
    eg Myrinet 10 gigabit ethernet onto 8x pci express
    Plenty of other examples.

    Also, witness the highend server boards many are now offering pcie as an option to the former server standard pci-x.

    PCIE is here to stay and is a must for anyone interested in a performance system.

    There is a direct mapping of pcie onto Hypertransport.

    There is already fast networking available on an HTX card.
  • lopri - Friday, June 2, 2006 - link

    Correction: ..video card.. transfers data via PCI Express..
    ;)
  • saratoga - Friday, June 2, 2006 - link

    quote:

    At a lower level, we have a block diagram of the compute core for K8L CPUs. Again, this diagram is a bit oversimplified, but we can see a few key features of the architecture. On the FP side, the CPU is able to handle 2x128-bit floating point or SSE operations per clock. While this isn't quite as flexible as Intel's Core with its 3 SSE units, AMD's K8L will be able to handle 4 double precision floating point operations per clock. . (Current K8 chips can only do 1x128/2x64-bit SSE instructions per clock.)


    I'm a little confused. IIRC Core2 can do 2x 128 bit operations, each of which can be an add or multiply, but only one of which can be a load. AMD is restricting the actual operations to just 1 add and 1 multiply, but is removing the restriction on loads? So they'll be better able to feed the vector units then Intel, but have less flexibility once they've loaded?

    That doesn't make a whole lot of sense to me. I'd think if their SSE implementation was less agressive, they would not have added more load units to feed it. Has AMD confirmed that there are only 1 add and 1 mult unit? Or is this a case of Intel designing a nice backend and not providing the front end resources to keep it fed?
  • mino - Friday, June 2, 2006 - link

    Well, you're kinda right and wrong at the same time:)

    However intel's C2 frontend(from L2 up) is far superior to AMD's. And was such since Banias. Also intel's backend(execution units) is now on par but only recently Yonah and older were inferior to AMD's brute force 3-issue backend.

    AMD has kinda ingeniously hidden poor backend by IMC however for streamed(desktop) pseudo-random loads intel's huge cache structures mitigated this so they are forced to improve frontend(hard to do) and do some backend optimizations(easy) on the way. Well, they kinda knew they will have to do this since the 90's, they have just chosen to implement IMC and cater to the core itself in the next iteration.

    On the 2load units - without them the maxFLOPS would be n, real one x. With them(load units are relatively simple and low power compared to FPU's) they've got MmaxFLOPS around 2n AND real achievable one(IMHO) in the 1.2x~1.5x range. Pretty good ROI for the one added load unit.
  • saratoga - Saturday, June 3, 2006 - link

    quote:

    However intel's C2 frontend(from L2 up) is far superior to AMD's. And was such since Banias. Also intel's backend(execution units) is now on par but only recently Yonah and older were inferior to AMD's brute force 3-issue backend.


    Could you explain how?

    quote:

    On the 2load units - without them the maxFLOPS would be n, real one x.


    No it wouldn't. CPUs have registers, so the number of load units has nothing to do with FLOPs. You could have just one load unit and still sustain an arbitrary number of FLOPs, provided you didn't mind using the same registers over and over again, which I suppose could be the case if you're doing an iterative approximation of a value.

    quote:

    With them(load units are relatively simple and low power compared to FPU's) they've got MmaxFLOPS around 2n AND real achievable one(IMHO) in the 1.2x~1.5x range. Pretty good ROI for the one added load unit.


    I don't think loads count as FLOPs, even if you're loading things to be used in FP operations, so having more load units doesn't increase max FLOPs.
  • mino - Friday, June 2, 2006 - link

    Sory for the english, grammar wasn't my friend :)
  • DigitalFreak - Friday, June 2, 2006 - link

    The smartest thing AMD ever did was create HyperTransport. There are so many cool uses for it! Intel, on the other hand, still insists on using their proprietary solutions.
  • DerekWilson - Friday, June 2, 2006 - link

    HyperTransport was created by an open consortium.

    But you do have to remember that AMD implimented a propreitary coherent HT for use in SMP systems. They haven't always been open, even if their method was implimented on top of an open standard.

    I do agree that general use of HyperTransport makes I/O much easier on many levels, and was a very good move for AMD. And now that they are opening up cHT, some really cool things can happen -- if the industry is ready. :-)
  • Viditor - Saturday, June 3, 2006 - link

    quote:

    HyperTransport was created by an open consortium

    Actually, it was created by AMD, it was developed by an open consortium.
    However coherent HT is still (at least until now) proprietary AMD...
  • Viditor - Saturday, June 3, 2006 - link

    Doh! I need to read first, post second...already asked and answered. Sorry...

Log in

Don't have an account? Sign up now