What about Hyper-Threading and IMC?

Core's impressive execution resources and massive shared cache seem to make it the ideal CPU design for SMT. However, there is no Simultaneous Multi Threading anywhere in the Core architecture. The reason is not that SMT can't give good results (See our elaborate discussion here), but that the engineers were given the task to develop a CPU with a great performance ratio that could be used for the Server, Desktop and Mobile markets. So the designers in Israel decided against using SMT (Hyper-Threading). While SMT can offer up to a 40% performance boost, these performance benefits will only be seen in server applications. SMT also makes the hotspots even hotter, so SMT didn't fit very well in Core's "One Micro-Architecture to Rule them All" design philosophy.

As far as including an Integrated Memory Controller (IMC), we were also told that the transistors which could have been spent in the IMC were better spent in the 4 MB shared cache. This is of course highly debatable, but it is a fact cache consumes less power. The standard party line from Intel is that keeping the memory controller on the chipset allows them to support additional memory types without having to re-spin the CPU core. That is certainly true, and with the desktop/mobile sectors using standard DDR2 while servers are set to move to FB-DIMM designs, the added flexibility isn't terrible. Techniques such as memory disambiguation and improved prefetch logic can also help to eliminate any advantage an IMC might offer. Would an IMC improve Core's performance? Almost certainly, but Intel will for the time being pursue other options.

Conclusion 1 : AMD K8 versus Intel P8

The Intel Core architecture is clearly the heir and descendant of the hugely successful P6 architecture. However, it has state of the art technology on board such as micro-op/macro-op fusion, memory disambiguation and massive SIMD/FP power.

Compared to the excellent AMD K8/Hammer architecture, the Core CPU is simply a wider, more efficient and more out of order CPU. When I suggested to Jack Doweck that the massive execution resources may not be fully used until SMT is applied, he disagreed completely. Memory disambiguation should push the current limits of ILP in integer loads a lot higher, and the massive bandwidth that the L1 and L2 can deliver should help Core to come close to the execution utilization percentages of the current P-M. 33% more execution potential could thus come very close to 33% more performance, clock-for-clock.

So is it game over for AMD? Well, if you read the previous pages, it is pretty clear that there are some obvious improvements that should happen in AMD's next generation. However, there is no reason at all to assume that the current K8 architecture is at the end of its life. One obvious upgrade possibility is to enhance the SSE/SIMD power by increasing the wideness of each unit or by simply implementing more of them in the out of order FP pipeline.

To sustain the extra (SIMD) FP power, AMD should definitely improve the bandwidth of the two caches further. The K7 had a pretty slow L2-cache, and the K8 doubled the amount of bandwidth that the L2 could deliver for example. It's not unreasonable to think a 256-bit wide cache bus could be added to a near-future AMD design.

Finally, there is also a lot of headroom for increasing integer performance. The fact that Loads can hardly be reordered has been a known weak point since the early K7 days. In fact, we know that engineers at AMD were well aware of it then, and it is surprising that AMD didn't really fix this in the K8 architecture. Allowing a much more flexible reordering of Loads - even without memory disambiguation - would give a very healthy boost to IPC (5% and more). It is one of the main reasons why the P-M can beat the Athlon 64 clock-for-clock in certain applications.

Those are just a few examples that are well known. It is very likely that there are numerous other possible improvements that could take the K8 architecture much further.

Looking at the server version of Core ("Woodcrest") and considering that it is very hard to find a lot of ILP in server applications, the only weakness of Core is that there is no multi-threading in each Core. This small disadvantage is a logical result of the design goal of Core, an architecture which is an all-around compromise for the server, desktop and mobile markets. The lack of Hyper-Threading in Xeon Core products might give Sun and IBM a window of opportunity in the heavy thread server application benchmarks, but since Tigerton (65 nm, two Woodcrests in one package, 4 cores) will come quickly, the disadvantage of not being able to extract more TLP might never be seen. Our astute readers will have understood by now that it is pretty hard to find a weakness in the new Core architecture.

Conclusion 2 : The free lunch is back!

It is ironic that just a year ago, Intel and others were downplaying the importance of increasing IPC and extracting more ILP. Multi-core was the future, single thread performance was a minor consideration. The result was that the reputed Dr. Dobbs journal headlined : "the free lunch is over" [1] claiming that only larger caches would increase IPC a little bit and that the days that developers could count on the ever increasing clockspeeds and IPC efficiency of newer CPU to run code faster were numbered. Some analysts went even further and felt that CPU packages with many relatively simple, small in-order CPUs were the future.

At AnandTech, we were pretty skeptical about the "threading is our only savior" future, as Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games. The fat, wide OoO core running at high clockspeeds was buried a little too soon. Yes, Intel's Core does not use the aggressive domino and LVS circuit-design strategy that NetBurst designs used to achieve stunning clockspeeds. At the same time, it is a fat, massive reordering CPU which gives free lunch to developers who don't want to spend too much time on debugging heavily threaded applications. Multi-core is here to stay, but getting better performance is once again the shared responsibility of both the developer and the CPU designer. Yes, dual-core is nice, but single threaded performance is still important!

I would like to express my thanks to the following people who helped to make this article possible:
Jack Doweck, "Foo", "Redpriest", Jarred and Anand

References

[1] The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software By Herb Sutter


Intel's Next Generation Microarchitecture Unveiled, by David Kanter, Real World Technologies
Faster Load Times
POST A COMMENT

85 Comments

View All Comments

  • JarredWalton - Monday, May 01, 2006 - link

    I won't say it's a quick fix, but just as Core is a derivative of P6, AMD could potentially just get some better OoO capabilities into K8 and get some serious performance improvements. Their current inability to move loads forward much (if at all) makes them even more dependent on RAM latency. You could even say that they *needed* the IMC to improve performance, but still L2 latency is far better than RAM latency, and cutting down L2 latency hit from 12 cycles to say 6 cycles (if you can do the load 6 cycles early) would have to improve performance. Loads happen "all the time" in ASM, so optimizing their performance can pay huge dividends.

    I'm going to catch flak for this, but basically Intel has more elegant designs than AMD in several areas. It comes from throwing billions of dollars at the problems. Better L1 cache? Yup - 8-way vs. 2-way is pretty substantial; 256-bit vs. 128-bit is also substantial. Better specialization of hardware? Yeah, I have to give them that as well: rather than just using three ALUs, they often take the path of having a few faster ALUs to handle the common cases.

    Really, the only reason AMD was able to catch (and exceed) Intel performance was because Intel got hung up on clock speeds. They basically let marketing dictate chip design to engineering - which is never good, IMO, at least not in the long run. Even NetBurst still has some very interesting design features (double-pumped ALUs, specialization of functional units, trace cache), and if nothing else it served as a good lesson on how far you can push clock speeds and pipeline lengths before you start encountering some serious problems. I would have loved to see Northwood tweaked for 90nm and 65nm, personally - 31+8 pipeline stages was just hubris, but 20+8 with some other tweaks could have been interesting.

    Here's hoping AMD can make some real improvements to their chips sooner rather than later. Intel Core is looking very strong right now, and I would rather have close competition than a 20% margin of victory like we've been seeing lately. (First with AMD K8 beating NetBurst, and now it looks like Conroe is going to turn the tables.)
    Reply
  • Regs - Wednesday, June 07, 2006 - link

    Okcourse you will catch flack. It's an opinion.

    My opinion is Intel is more innovative or even compromising
    while AMD is more intuitive.
    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    "8-way 32kb * 2 L1 should in theory exceed the hitrate of K8's 2-way 64kb * 2 L1."
    It should? I'd like to see some sources on that. From what I've seen, the 64KB cache still has an advantage there, with a hitrate not much below that of a 64KB/8-way.

    Also, I disagree that Intel's CPU's are generally more elegant. First, their L1 cache isn't neccesarily "better" (see above). Of course, the 256 vs 128 bit bandwidth is a big factor, however.

    Specialization in hardware? Is that elegant? I'd say there's a certain elegance in making a general solution as well, as opposed to specializing everything to the point where you're screwed if the code you have to execute isn't 100% optimal.

    And I definitely think AMD's distributed reservation stations are more elegant than the central one used by Intel. Same goes with the usual HyperTransport vs FSB story.
    There are a few other really elegant features of the K8 that I haven't seen duplicated in Core.

    So overall, I don't see the big deal with "elegance". Both architectures have plenty of elegant features. However, the K8 is definitely aging, and will have problems keeping up with Conroe.

    But then again, the K8 die is tiny in comparison. They've got plenty of space for improvements.

    Really looking forward to
    1) Being able to get a Merom-powered laptop, and
    2) Seeing what AMD comes up with next year.
    Reply
  • IntelUser2000 - Monday, May 01, 2006 - link

    quote:

    But then again, the K8 die is tiny in comparison. They've got plenty of space for improvements.


    Tiny?? 199mm2 for 2x1MB cache K8 at 90nm is tiny?? Ok there. Conroe with 4MB cache is around 140mm2 die size.

    http://www.aceshardware.com/forums/read_post.jsp?i...">http://www.aceshardware.com/forums/read_post.jsp?i...

    Even compairing against Intel's SRAM size for 90nm and 65nm comparison, at 90nm, Conroe would be 250mm2. In Prescott, 1MB L2 cache takes 16-17mm2. 250mm2-32mm2=218mm2.

    And Intel didn't to shrinks that are relative to SRAM sizes. In comparison for Cedarmill and Prescott 2M, which are same cores essentially.

    Prescott 2M: 135mm2
    Cedarmill: 81mm2

    Only difference being process size, the comparison is 0.6.

    Conroe at 90nm would have been 233mm2, which is compact as X2 per core.
    Reply
  • Spoonbender - Tuesday, May 02, 2006 - link

    Okay, I guess I should have said that at the same process size, it is tiny. I meant that when AMD gets around to migrating to 65nk as well, they'll have a smaller core (assuming no big changes to the chip), which gives them plenty (some?) of room for improvement.

    [quote]
    Conroe at 90nm would have been 233mm2, which is compact as X2 per core.
    [/quote]
    But in absolute terms, still bigger than an Athlon X2. Which means AMD has some space for improvement. That was my only point. I guess I should have been more clear. ;)
    Reply
  • coldpower27 - Wednesday, May 03, 2006 - link

    Yes, if using the 0.6 Factor for Brisbane the 65nm Athlon 64x2 it will be around 132mm2 assuming no changes over the 220mm2 Windsor DDR2 Athlon64x2. 199mm2 is only for Toledo which is reaching end of life and can no longer be used as a comparison. And it's irrelevent to compare to Conroe on what it would be on 90nm as it never was built on 90nm technology to begin with.

    Conroe is looking to be ~14x mm2 with the x=0-9. Yes if you can compare them at the same process nodes considering the Conroe will only be competing with the 65nm Dual Core Athlon 64x2 in the second half of it's lifetime.
    Reply
  • JumpingJack - Tuesday, May 02, 2006 - link

    Nice analysis, the current AMD X2 dual cores are about 1.5 to 2.0 X the size of Intel dual cores (on 65 nm), this is where 65 nm adds such a benefit. Conroe will come in around 140 mm^2 as you said. Yohna at 2 meg shared is 90 mm^2, less than 1/2 the X2.

    Right now, cost wise in Si realestate AMD is more expensive.
    Reply
  • BitByBit - Monday, May 01, 2006 - link

    quote:

    From what I've seen, the 64KB cache still has an advantage there, with a hitrate not much below that of a 64KB/8-way.


    Here is a good article on processor cache:
    http://en.wikipedia.org/wiki/CPU_cache">http://en.wikipedia.org/wiki/CPU_cache

    If you scroll down to the miss-rate vs. cache size graph, you can see that an 8-way 64Kb cache has a miss-rate less than one-tenth the miss-rate of a 2-way 64Kb cache.

    An 8-way 64Kb * 2 L1 would probably be too difficult to implement, given the time it would take to search. However, according to the relationships shown by that graph, increasing the Athlon's L1 associativity to 4-ways could yield a nice boost in hitrate and consequently performance.


    Reply
  • Spoonbender - Monday, May 01, 2006 - link

    And of course, we all know wikipedia is the ultimate source of all truth and knowledge... ;)

    Keep in mind that this graph only shows the Spec2000 benchmark (and only the integer section, at that). That's far from being representative of all code.

    According to http://www.amazon.com/gp/product/1558605967/002-28...">http://www.amazon.com/gp/product/155860...-2818646... which, in my experience is pretty damn good, the missrates are as follows *in general*:

    32KB, 8-way: 0.037
    64KB, 2-way: 0.031
    64KB 8-way: 0.029

    But yeah, of course improving the Athlon's cache would help. But it's not the first place I'd look to optimize. For one thing, making it more complex would, as seen above, not yield a significantly lower hitrate, but it would slow the cache down, either forcing them to increase its latency, or limiting the frequency potential of the cpu as a whole. The cache bandwidth might be a better candidate for improvement. Or some of the actual cpu logic. Or the L2 cache size. I think the L1 cache is pretty healthy on the K8 already.
    Reply
  • Betwon - Tuesday, May 02, 2006 - link

    The data can not be used for Core -- Because it did not use the smart prefetcher.

    The Advanced smart prefetchers of Core's L1D have decreased the miss-rates very much. In fact, The data cache of Core --much more efficiency than K8's.
    Compared with Core's smart cahce, K8's 64KB L1D is like an idiot .
    Reply

Log in

Don't have an account? Sign up now