Introduction

The age of multi-core is upon us, and the game of who has the highest clock speed has turned into who has the most cores (at least for now). Intel released Clovertown in Q4 of 2006, a bit ahead of its originally scheduled 2007 launch date. Obviously, the reason for the early launch was at least partially to ensure they were the first to market with quad core, ahead of rival AMD.

Clovertown is targeted at dual socket servers, typically in a 1-2U form factor. It launched with speeds up to 2.66 GHz, with 3.0 GHz on the horizon. Intel has also recently launched low voltage parts, which are rated at 50W and are clocked at 1.86 and 1.60 GHz.

So, what applications could benefit from eight cores? Today, the obvious choice is virtualization, although database servers, exchange servers, and compute clusters would also be good candidates. Virtualization is the primary target for Clovertown; a rack of ESX servers running on 2U Clovertown boxes would consolidate a significant amount of business applications in a relatively small foot-print.

Last year, at an IBM technical conference, one of their senior technical representatives said the following: "In the coming years, the operating systems we use today will be merely applications running in a single operating system". Although you could say that's true today, it's only the beginning of what is going to be a complete shift in the traditional way we approach and think about "servers". Virtualization is growing at an exponential rate, and the shift to multi-core is only going to accelerate that growth.

Although a significant portion of Clovertown systems will be deployed in virtualized environments, there will be some used in the more traditional single purpose server scenarios. However, there's something to keep in mind if you plan to throw eight cores at your database server or any other server that is I/O intensive. You have now increased your processing power by at least two fold relative to a dual core configuration, and ensuring that your I/O subsystem is capable of keeping up with that extra processing power may be difficult. As you will read later in the article, we ran into significant issues with our test suite with eight cores and our I/O subsystem.

Architecture & Roadmap
Comments Locked

56 Comments

View All Comments

  • timelag - Wednesday, April 4, 2007 - link

    Authors--

    Er, gosh. Dunno what to make of the preceding discussion. Eh, they don't scare me--I'll post anyway.

    Even though the title of this article is "Quad core for the masses", the benchmark is for enterprise database applications. Because of the title, I had expected some workstation benchmarking. Any plans for doing benchmarks for scientific and visualization applications? From bio-tech (BLAST, etc.), to fluid dynamics, to 3D rendering. That sort of thing.
  • Viditor - Wednesday, April 4, 2007 - link

    Didn't mean to put you off there timelag...:)
    My apologies...

    Some of what you're asking for was done in a http://www.anandtech.com/showdoc.aspx?i=2897&p...">previous article by Johan
  • Beenthere - Saturday, March 31, 2007 - link

    Intel's attempt to use two dual cores on a slice of silicon and call it a quad core shows how easily they can manipulate the media with foolishness. Only a fool would buy Intel's inferior 2+2 design when they can have Barcelona and it's many superior derivatives.
  • JarredWalton - Saturday, March 31, 2007 - link

    Riiight... only a fool would get QX6700 right now when Barcelona isn't out. Two chips in a package has disadvantages, but there are certainly instances where it will easily outperform the 2x2 Opteron, even in eight-way configurations. There are applications that are not entirely I/O bound, or bandwidth bound. When it comes down to the CPU cores, Core 2 is significantly faster than any Opteron right now.

    As an example, a 2.66 GHz Clovertown (let alone a 3.0 GHz Xeon) as part of a 3D rendering farm is going to be a lot better than two 2.8 GHz (or 3.0 GHz...) Opteron parts. Two Xeon 5355 will also be better than four Opteron 8220 in that specific instance, I'm quite sure. The reason is the 4MB per chip L2 is generally enough for 3D rendering. There are certainly other applications where this is the case, but whether they occur more than the other way (i.e. 4x2 Opteron being faster than 2x4 Xeon) I couldn't say.

    AMD isn't really going to have a huge advantage because of native quad core with Barcelona, and Intel wouldn't get a huge boost by having native quad core either. If you thought about it more, you would realize that the real reason Intel's quad core chips have issues with some applications is that all four cores are pulling data over a single FSB connection - one connection per socket. Intel has to use that single FSB link for RAM, Northbridge, and inter-CPU communications.

    In contrast AMD's "native quad core" will have to have all four cores go over the same link for RAM access (potential bottleneck). They can use another HT link to talk to another socket (actually two links), and they can use the third HT link to talk to the Northbridge. The inter-CPU communication generally isn't a big deal, and Northbridge I/O is also a much smaller piece of the bandwidth pie than RAM accesses. It's just that AMD gets all the RAM bandwidth possible. AMD could have done a "two die in one package" design and likely had better scaling than Intel, but they chose not to.

    And of course Intel will be going to something similar to HyperTransport with Nehalem in 2008. Even they recognize that the single FSB solution is getting to be severely inadequate for many applications.
  • Viditor - Saturday, March 31, 2007 - link

    quote:

    As an example, a 2.66 GHz Clovertown (let alone a 3.0 GHz Xeon) as part of a 3D rendering farm is going to be a lot better than two 2.8 GHz (or 3.0 GHz...) Opteron parts. Two Xeon 5355 will also be better than four Opteron 8220 in that specific instance, I'm quite sure


    Actually, that's not true Jarred.
    http://www.anandtech.com/showdoc.aspx?i=2897&p...">Johan's test benchmarked exactly that scenario, and C2D was equal at 4 cores and slightly slower at 8 cores. This was a 2.33 GHz Clovertown vs the 2.4 GHz Opterons...
  • Viditor - Saturday, March 31, 2007 - link

    Let me add that there are cases where it could be true, but only when the apps don't scale at all...and in that case, even a single or dual core sometimes beats the Clovertowns.
  • JarredWalton - Sunday, April 1, 2007 - link

    Okay, wrong example then. Heh. The point is I am sure there are benchmarks where the FSB bottleneck isn't as pronounced. Anything that can stay mostly within the CPU cache will be very happy with the current Xeon 53xx chips. Obviously, the decision as to what is important will be the deciding factor, so companies should research their application needs first and foremost.

    Getting back to the main point of the whole article, clearly there are areas where Opteron can outperform Xeon with an equal number of cores. Frankly, I doubt 1600 FSB is going to really help, hence the need for the new high speed link with Nehalem on the part of Intel. K10 could very well end out substantially ahead in dual and quad socket configurations later this year, even if it only runs at 2.3 GHz. I guess we'll have to wait and see... for all we know, the current memory interface on AMD might not actually be able to manage feeding quad cores any better than Intel's FSB does.
  • Viditor - Sunday, April 1, 2007 - link

    quote:

    The point is I am sure there are benchmarks where the FSB bottleneck isn't as pronounced. Anything that can stay mostly within the CPU cache will be very happy with the current Xeon 53xx chips

    Actually, it appears (at least from the stuff I've seen so far) that the only apps that aren't effected by the bottleneck are the ones that are just as good on a dual core...in other words they don't scale well.
    I agree with the AMD exec who intimated that AMD made a HUGE mistake in not coming out with an MCM quad chip in November...I think that the benches would have been nicely into the Opteron side of things well before Barcelona, but of course only on the quad chip.
    quote:

    I doubt 1600 FSB is going to really help, hence the need for the new high speed link with Nehalem on the part of Intel

    I absolutely agree...I've been saying for the last year that AMD will most likely retake the lead again (even against Penryn), but that Nehalem is a whole nother ballgame...
    quote:

    for all we know, the current memory interface on AMD might not actually be able to manage feeding quad cores any better than Intel's FSB does

    I suppose that's possible, but if it were true then I think every executive at AMD would have dumped all of thier shares by now. :)
    That's just as valid as saying it's possible that there's a flaw in Penryn when it gets over 2.8 GHz...possible, but I strongly doubt it.
  • TA152H - Monday, April 2, 2007 - link

    I'm not sure why you guys don't think an increase in FSB and memory bandwidth (i.e. 1600) isn't going to help. It's seems beyond obvious it will. Will it help enough is the only question.

    With regards to the 2+2 from Intel, why does anyone really care? In some ways it's better than a true four in that you can clock them higher because you can pick pairs that make the grade, instead of hoping that all four core can clock really high. If one of the four can't, well, the whole thing has to be degraded. With Intel's approach, if one set of the cores is not capable at a certain speed, you just match it with one that is fairly close to it and sell it like that. It allows them to clock higher, and sell them less expensively than they would if they made a big quad-core die. The performance is excellent too, so it's a pretty good solution.

    Why would AMD not have problems with Quad-Cores similar to Intel? You still have four cores sucking data through one memory bus, right? Or am I missing something? Is AMD going to have a memory bus for each core? That seems strange to me, so I'm going to assume they are not. The memory controller and point to point bus don't fundamentally change that problem. This comparison was fairly grotesque in that it made the memory subsystem for the Opteron seem better than it was. You had eight processors, yes, but only two cores since you had four sockets and only two cores were fighting for the same bus since you had a point to point. That's the advantage. If you have more sockets, the AMD solution will scale better, although NUMA has horrible penalties when you leave the processors own memory. If you add more processors to the same socket, you still have fundamentally the same problem, and point to point really isn't going to change that. You have four processors hitting the same bus either way.

    With regards to FSB, remember it's also the reason why Intel processors have more cache. It's not a coincidence Intel processor have more cache, it's because AMD uses so much room on the processor for the memory controller. Intel decided they'd rather use the transistors for other things. I'm not speculating either, Intel has actually said this. Intel could have added a memory controller a long time ago, but they didn't. In fact, in the mid 1990s there was a company called NexGen (which AMD bought because they couldn't design a decent processor from scratch at the time, and had a lot of problems with the K5 that alienated companies like Compaq) which had an onboard memory controller with the NX586. Jerry Sanders decided to can it for the NX686 and use a standard Socket 7 platform instead of NexGen's proprietary one for what became the K6. The K6-III+ is a really interesting chip, you can actually change the multiplier on the fly without rebooting (I still use it for some servers, for exactly that reason).
  • Viditor - Monday, April 2, 2007 - link

    quote:

    I'm not sure why you guys don't think an increase in FSB and memory bandwidth (i.e. 1600) isn't going to help. It's seems beyond obvious it will. Will it help enough is the only question


    Certainly it will help...but keep this in mind (going towards your question at the end):
    1. Both this review and the one Johan did show the old K8 clearly doing as well or better than C2D across the board already (with 4 cores or more)...and Johan's numbers were on an Opteron using very old PC2700 memory as well (Jason and Ross didn't list their memory type).
    2. While Barcelona will be HT 2.0, it will be the last one at this speed...the rest of the K10s (the ones that Penryn will be competing with) will be HT 3.0. In other words, while the FSB of Penryn systems will be raised from 1333 to 1600, the K10s will be going from 1 GHz to between 1.8 and 2.6 GHz...

    quote:

    With regards to the 2+2 from Intel, why does anyone really care?

    Mainly because of the way it effects Intel's interchip communication. Remember that as apps become more parallel, they also require more communication between the cores. One of the great advances in C2D was the shared cache, the other was the Benseley platform (individual connections to the MCH). However, with an MCM quad core, the only path for one half of the chip to talk to the other half is through the FSB (MCH). In essence, you have 2 caches (each DC has a single cache, and there are 2 DC per CPU) per MCH connection, so we are back to a shared FSB again (in fact 2 shared FSBs). This recreates the bottleneck that the shared cache and Benseley were designed to get rid of...
    quote:

    if one set of the cores is not capable at a certain speed, you just match it with one that is fairly close to it and sell it like that

    Ummm...that's not how they manufacture their chips (and it would be outrageously expensive to do so!). The testing occurs after the cores have been placed on the chip...
    quote:

    Why would AMD not have problems with Quad-Cores similar to Intel? You still have four cores sucking data through one memory bus, right? Or am I missing something?

    Yes, you are...
    Firstly is the interchip communication I spoke of. HT allows for direct connections between the caches of different chips, and the chips themselves have the cache directly connected on-die through a dedicated internal bus. That bus has 2 memory controllers connected directly to system memory as well as it's own dedicated HT connection (called cHT) to other caches. Remember that contrarily, Intel must route everything through the single MCH...
    quote:

    It's not a coincidence Intel processor have more cache, it's because AMD uses so much room on the processor for the memory controller. Intel decided they'd rather use the transistors for other things

    Actually, the reason Intel gave for not having an on-die memory controller is that memory standards change too quickly. But what they didn't say is that it takes many years (about 5 on average) to design and release a new chip, and an on-die memory controller is a major architectual change. That's why we don't see it on C2D, but we will see it on Nehalem...

Log in

Don't have an account? Sign up now