Original Link: http://www.anandtech.com/show/1655

Along with Scott Wasson of The Tech Report and Kyle Bennett of HardOCP, we recently had some time to sit down and talk with AMD's CTO Fred Weber about his vision of the future of microprocessors.  We took the opportunity to compare and contrast his vision with our discussions from this year's Spring IDF that we had written about. 

Be sure to read our IDF Spring 2005 - Predicting Future CPU Architecture Trends article before continuing as it provides a lot of background information necessary for this piece. 

The ILP/TLP Debate in AMD's Shoes

When we talked to Intel at IDF, we had the distinct impression that the focus on improving microprocessor performance as a whole had shifted pretty significantly from ILP to TLP. To put it plain and simple, making individual cores faster was no longer top priority; rather, getting multiple cores to work together was the new focus. 

Weber's stance on ILP vs. TLP tended to agree with what we had heard from Intel; TLP is the future and using ILP to increase performance is at a point of extremely diminished returns.  That being said, we asked Fred where he thought the improvements in ILP would be going forward and he responded with the following four areas:
  1. Frequency
  2. Reducing Memory Latency
  3. Instruction Combining
  4. Branch Prediction Latency
Fred's number one increase for single core, single thread performance was clock frequency, so we will inevitably see that clock speed will go up as time goes on.  It is quite possible that combined with a reduction in branch prediction latency, future versions of the Athlon 64 will use a lengthened pipeline to reach higher operating frequencies.  If paired with Prescott-caliber branch predictors, a somewhat deeper pipelined K8 would provide additional frequency headroom without too much worry. 

Behind clock frequency, Weber saw reducing memory latency as the other major way of increasing single core performance.  Reducing memory latency in this sense basically means two things:
  • higher levels of cache hierarchy, and
  • better prefetching. 
More than once during our conversations with Weber, it became clear that future multi-core AMD processors will continue to have their L1 and L2 caches separate, but a shared L3 cache will eventually be introduced to help reduce memory latency and keep those cores fed. 

To Weber's second point, the use of helper threads (compiler or application generated threads that go out and work on prefetching useful data into cache before it's requested) will also improve single core performance.  Intel has been talking about using helper threads since before Hyper Threading, but there is no idea of when we can expect real world implementation of helper threads at this point. 

The topic of instruction combining was also interesting because it is something that we have only seen used in the Pentium M (Micro-Ops Fusion).  Weber couldn't elaborate on an AMD implementation of some form of instruction combining, but we did get the distinct impression that it's something that's in the cards going forward.  It looks as if elements from both AMD's and Intel's present day architectures will shape tomorrow's designs. 

In the end, Fred left us with the following: if you see single core performance improving at a rate of 40% per 12 - 18 months, it will now improve at about half that rate for the foreseeable future.

Weber's Thoughts on Cell

Ever since its official introduction, we've been going around asking everyone we ran into about their thoughts on IBM/Sony/Toshiba's Cell microprocessor, and Fred Weber was no different.  Surprisingly enough, Weber's response to the Cell question was quite similar to Justin Rattner's take on Cell.  Weber saw two problems with Cell:
  1. Cell is too far ahead of its time in terms of manufacturing, and
  2. Cell is a bit too heterogeneous in its programming model, referring to Cell's approach as both asymmetric and heterogeneous (we'll explain this in a bit).
As we concluded in our Cell investigation, the approach to microprocessor design of having one general purpose core surrounded by several smaller cores is not one that is unique to Cell.  Intel has now publicly stated that this heterogeneous multi-core approach is, at a high level, something that they will be pursuing in the next decade.  The problem is that to be produced on a 90nm process, the individual cores that make up Cell has to be significantly reduced in complexity, which Weber saw as an unreasonable sacrifice at the current stage. 

The next problem that Weber touched on was the Cell approach to a heterogeneous multi-core microprocessor.  To Fred Weber, a heterogeneous multi-core microprocessor is one that has a collection of cores, each one of which can execute the same code, but some can do so better than others - the decision of which to use being determined by the compiler.  Weber referred to his version of heterogeneous multi-core as symmetric in this sense.  Cell does not have this symmetric luxury; instead, all of their cores are not equally capable and thus, in Weber's opinion, Cell requires that the software needs to know too much about its architecture to perform well.  The move to a more general purpose, symmetric yet heterogeneous array of cores would require that each core on Cell must get bigger and more complex, which directly relates back to Weber (and our) first problem with Cell that it is too far ahead of its time from a manufacturing standpoint. 

The K8 is here to stay

One of the most interesting points that we came away from our discussion of future AMD architectures was Weber's stance that the K8 execution core is as wide as they are going to go for quite some time.  Remember that the K8 execution core was taken from the K7, so it looks like the execution core that was originally introduced in the first Athlon will be with us even after the Athlon 64. 

What's even more interesting is that Intel's strategy appears to confirm that AMD's decision was indeed the right one. After all, it looks like the Pentium M architecture is eventually going to be adapted for the desktop in the coming years.  Based on the P6 execution core, the Pentium M is inherently quite similar (although also inferior) to the K7/K8 execution core that AMD has developed.  Given that Intel is slowly but surely implementing architectural features that AMD has done over the past few years, we wouldn't be too shocked to see an updated Pentium M execution core that was more competitive with the K7/K8 by the time that the Pentium M hits the desktop. 

Fred went on to say that for future microprocessors, he's not sure if the K8 core necessarily disappears and that in the long run, it could be that future microprocessors feature one or more K8 cores complemented by other cores.  Weber's comments outline a fundamental shift in the way that microprocessor generations are looked at.  In the past, the advent of a new microprocessor architecture meant that the outgoing architecture was retired - but now it looks as if outgoing architectures will be incorporated and complemented rather than put out to pasture.  The reason for this reuse instead of retire approach is simple - with less of a focus on increasing ILP, the role of optimizing the individual core decreases, and the problems turn into things like: how many cores can you stick on a die and what sort of resources do they share? 

In the past, new microprocessor architectures were sort of decoupled from new manufacturing processes.  You'd generally see a new architecture debut on whatever manufacturing process was out at the time and eventually scale down to smaller and smaller processes, allowing for more features (i.e. cache) and higher clock speeds.  In the era of multi-core, its the manufacturing process that really determines how many cores you can fit on a die and thus, the introduction of "new architectures" is very tightly coupled with smaller manufacturing processes.  We put new architectures in quotes because often times, the architectures won't be all that different on an individual core basis, but as an aggregate, we may see significant changes. 

How about a Hyper Threaded Athlon?

When Intel announced Hyper Threading, AMD wasn't (publicly) paying any attention at all to TLP as a means to increase overall performance.  But now that AMD is much more interested and more public about their TLP direction, we wondered if there was any room for SMT a la Hyper Threading in future AMD processors, potentially working within multi-core designs. 

Fred's response to this question was thankfully straightforward; he isn't a fan of Intel's Hyper Threading in the sense that the entire pipeline is shared between multiple threads. In Fred's words, "it's a misuse of resources."  However, Weber did mention that there's interest in sharing parts of multiple cores, such as two cores sharing a FPU to improve efficiency and reduce design complexity.  But things like sharing simple units just didn't make sense in Weber's world, and given the architecture with which he's working, we tend to agree. 

An Update on Turion

We also managed to corner some AMD folks about their new "mobile technology", the Turion 64. Here's what we were able to get out of them:

Much as we suspected, all of the power optimizations that went "into" Turion 64 are all transistor level optimizations.  Basically, selecting transistors that provide better thermal and power characteristics at the expense of lower switching frequencies.  Given that the Turion 64 runs at multiple speed grades lower than the fastest desktop Athlon 64s, this trade-off makes sense, but it also means that Turion 64 is no Pentium M killer.  There was one logic level optimization that went into Turion 64 and that was the support of a deeper C3 sleep state, but other than that, the Turion 64 is architecturally identical to a Socket-754 Athlon 64. 

The similarity between mobile and desktop goes one step further as we just confirmed that the packaging of the Turion 64 is no different than the Socket-754 desktop Athlon 64, except for the fact that the heatspreader is removed.  AMD did mention that they are looking at different packaging options that would surface in the second revision of the Turion 64 microprocessor. 

The Turion 64 notebooks that are going to be released will all be in the 1" - 1.4" thickness range, and weigh around 5 to 6.5 lbs.  The Turion is specifically targeted at what AMD is referring to as the mainstream thin and light segment, which also means that AMD will continue to remain non-competitive in the smaller form factor notebooks in which Centrino is available. 

AMD did mention that there is "focus" on a new mobile platform architecture, presumably similar in approach to the Centrino platform, designed from the ground up to be specifically for mobile applications rather than just down-scaling desktop technologies.  AMD was extremely quiet about details on this front other than the fact that it was something that their new Japan engineering lab is playing a key role in defining.  Whenever this new architecture does surface, it will carry the Turion brand.

Final Words

From talking to people like Justin Rattner and Fred Weber, the future of the CPU industry is looking to be particularly bright.  For the first time in recent history, we have both AMD and Intel agreeing on major points of future microprocessor architectures, and to AMD's credit, it looks like a lot of the decisions they made with the Athlon 64 were, in fact, the right ones.  What can we expect from AMD going forward?

We can expect the K8 execution core to remain relatively unchanged. Its successor may be deeper pipelined, but for the most part, the core itself appears to be mostly done evolving. 

We can expect future AMD chips, beyond 65nm, to be large groupings of cores, but the focus will continue to be on making them all general purpose, however with varying individual strengths (symmetric and heterogeneous). 

The Cell approach appears to be one supported by both AMD and Intel, but also appears to be too early in both their eyes.  It's clear that giving up Weber's symmetric heterogeneous approach isn't a sacrifice that either AMD or Intel are willing to make; they both appear to be waiting for smaller manufacturing processes to approach architectures similar in nature to Cell without sacrificing present day performance or hardware transparency. 

We also asked Weber about his thoughts on wafer and die stacking; he sounded particularly interested in them, but added that for a microprocessor, it's far too early to count on die stacking because of yield concerns.  He said that the time for the technology to be used on microprocessors would only exist once there's mass market use of it in memory manufacturing. Then, and only then, would it be mature enough to migrate to microprocessors. 

Log in

Don't have an account? Sign up now