Intel Clovertown: Quad Core for the Masses

Name: Intel Clovertown: Quad Core for the Masses
Item: Intel Clovertown: Quad Core for the Masses
Author: Jason Clark & Ross Whitehead

by Jason Clark & Ross Whitehead on March 30, 2007 12:15 AM EST

Posted in
IT Computing

56 Comments | Add A Comment

56 Comments

Introduction

The age of multi-core is upon us, and the game of who has the highest clock speed has turned into who has the most cores (at least for now). Intel released Clovertown in Q4 of 2006, a bit ahead of its originally scheduled 2007 launch date. Obviously, the reason for the early launch was at least partially to ensure they were the first to market with quad core, ahead of rival AMD.

Clovertown is targeted at dual socket servers, typically in a 1-2U form factor. It launched with speeds up to 2.66 GHz, with 3.0 GHz on the horizon. Intel has also recently launched low voltage parts, which are rated at 50W and are clocked at 1.86 and 1.60 GHz.

So, what applications could benefit from eight cores? Today, the obvious choice is virtualization, although database servers, exchange servers, and compute clusters would also be good candidates. Virtualization is the primary target for Clovertown; a rack of ESX servers running on 2U Clovertown boxes would consolidate a significant amount of business applications in a relatively small foot-print.

Last year, at an IBM technical conference, one of their senior technical representatives said the following: "In the coming years, the operating systems we use today will be merely applications running in a single operating system". Although you could say that's true today, it's only the beginning of what is going to be a complete shift in the traditional way we approach and think about "servers". Virtualization is growing at an exponential rate, and the shift to multi-core is only going to accelerate that growth.

Although a significant portion of Clovertown systems will be deployed in virtualized environments, there will be some used in the more traditional single purpose server scenarios. However, there's something to keep in mind if you plan to throw eight cores at your database server or any other server that is I/O intensive. You have now increased your processing power by at least two fold relative to a dual core configuration, and ensuring that your I/O subsystem is capable of keeping up with that extra processing power may be difficult. As you will read later in the article, we ran into significant issues with our test suite with eight cores and our I/O subsystem.

Architecture & Roadmap

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

56 Comments

View All Comments

TA152H - Monday, April 2, 2007 - link
Viditor,

Are you making this stuff up, or going by what Intel has said?

Intel has said that the reason they haven't gone up with the on-board memory controller, with respect to the Core 2, is because they preferred to use the silicon for the cache and other things. I think a lot of it is because they sell a lot of IGPs, and didn't want to use the awkward arrangement or either adding another memory controller outside of the processor, or having to use the processors memory controller since the IGP doesn't have it's own memory. The last part is speculation on my part, Intel said they preferred to use the transistors differently, and used cache as an example.

Your argument has now become comparitive, rather than absolute, going back to what I am saying about it helping enough. Also remember that the Penryn will have larger caches, which helps mediate this problem since you will have less contention. Both together should make a reasonably large impact on bandwidth restricted situations.

With regards to 2+2, actually, you're wrong on that. That's exactly what Intel said. They commented that they are able to run them at higher clock speeds than they could if they went native four, since they can test before they are all together rather than have to downbin, or throw away, a whole part if one of the dual cores is a failure or can't clock high. It's not speculation on my part.

Apps becoming more parallel is kind of a bad joke that people who are clueless talk about. Multithreading has been around since 1988 with OS/2, and back then I was writing them. Even for single processors, you did this, because good programmers wanted their application to always be responsive to the user even when you were doing things for them. Admittedly, Windows was quite a bit behind, but multithreading is nothing new, and there are limitations to it that simply can't be overcome. For some applications, it works great, for others you can't use it. Multiple cores are fairly new mainly because AMD and Intel can't think of anything better to do with the transistors, but multiprocessor computers are not, and people have been writing applications for them for many, many years (myself included). ILP applies to everything, TLP does not, and is essentially an admission from CPU makers that they are in a very, very diminishing returns situation with regards to transistors and performance.

With regards to the shared cache, you are also incorrect in saying it is why the Core 2 is so fast. It's a tradeoff, and you seem to ignore the L2 now has four more wait states because it is shared by two processors. I'm not sure how many more they'd have to add if it were shared among four cores, but it wouldn't be a free lunch.

Also keep in mind, theory sounds great, but where the rubber meets the road, the Clovertown does really well, and the main limitations have nothing to do with the trivialities of having a 2+2. In apps that can use it, the quad core shows a dramatic improvement over the dual. The FSB problems show up in these benchmarks rather vividly though, not a percentage or two that aren't that easily noticed.
Viditor - Monday, April 2, 2007 - link
TA152H

I don't "make stuff up", mate...
"Intel does not integrate the memory controller. One reason is that memory standards change. Current Athlon computers, for instance, don't come with DDR II memory because the integrated memory controller connects to DDR I. Intel once tried to come out with a chip, Timna, that had an integrated memory controller that hooked up to Rambus. The flop of Rambus in the market led to the untimely demise of the chip"
http://news.com.com/2061-10791_3-6047412.html">News.com story

While they also listed the large cache and space as a "reason", this was the reason they mentioned most often in interviews.
If by your insinuation you were questioning how long it takes to build a chip, I'm afraid that is just a result of many years of industry knowledge on my part (though if you ask anybody who works in the semi industry, they will confirm this for you).
Nehalem for example began it's design almost 6 years ago, and has been delayed because of necessary architectual changes (similar to the way Itanium was).

quote:
Also remember that the Penryn will have larger caches, which helps mediate this problem since you will have less contention

Actually, the large cache doesn't help at all with the MCH bottleneck problem...in fact it makes it slightly worse. Remember that the data path for interchip communication is from cache to cache, not from system memory to cache. The larger cache (with the help of a good prefetcher) certainly helps reduce memory latency (though not as much as an on-die controller)...
quote:
but multithreading is nothing new, and there are limitations to it that simply can't be overcome...For some applications, it works great, for others you can't use it. Multiple cores are fairly new mainly because AMD and Intel can't think of anything better to do with the transistors

Actually, multi-cores have been around for awhile...The Power4 was dual core back in 2000. What's new is that mainstream consumer level apps are being written for TLP because single cores are to be phased out...
quote:
ILP applies to everything, TLP does not, and is essentially an admission from CPU makers that they are in a very, very diminishing returns situation with regards to transistors and performance

Not true...Intel tried to convert everything to ILP with Itanium and EPIC, but it was the market (and in many cases the software companies) that decided that it was too hard and too expensive for not enough gain. Most (if not all) software companies are now developing for greater TLP efficiency, as this allows a much smoother transition (evolutionary vs revolutionary).
Sure multithreading has been around for a long time, I used many programs on my old Amiga that were multi-threaded...but it's a matter of degree.
To use an anology, when I was a kid, the best TV set you could buy was a 6" black and white set...today I have a 50" plasma that displays native 1080P. The degree to which software is optimized for TLP is increasing every day.
quote:
With regards to the shared cache, you are also incorrect in saying it is why the Core 2 is so fast

I said "one of the reasons"...
quote:
theory sounds great, but where the rubber meets the road, the Clovertown does really well, and the main limitations have nothing to do with the trivialities of having a 2+2. In apps that can use it, the quad core shows a dramatic improvement over the dual

Actually, Clovertown is at the bottom when you're talking 4 cores...
For example, a 2P Woodcrest is significantly faster than a similarly clocked Clovertown, and they are essentially the same thing. The reason for this is that the 2 Woodcrest on the Clovertown must share the connection to the MCH while the 2x2P Woodcrest each have their own connection.
TA152H - Tuesday, April 3, 2007 - link
Actually, if you read the article, it says much more what I am saying. It talks mostly about cache, and in the interviews I have seen, that's what Intel touts. Even this article you present as proof shows the opposite, it mentions the memory changes, and then goes on and on about the extra cache and the performance of Core 2, not how quickly it can change with memory standards. Your whole premise is illogical, you are saying that with the Nehalem all the sudden memory changes will happen slower. That's plain wrong. I am saying with the Nehalem and 45 nm lithography, and the diminishing returns with adding more cache, it makes more sense for Intel to add the controller. Which is more logical to you?

The larger cache makes it unnecessary for the cores to use the FSB, thus removes a bottleneck and causes less collisions. This has always been the case with multiprocessor configurations. If we have a 2+2, and one set needs to access main memory, and the other can access it's cache, you'll have less collisions than if they both needed to access main memory through the FSB. With a larger cache, you'll have less reads to main memory from each set of cores, and thus less contention.

I disagree on your remarks about TLP becoming suddenly important. Have you already forgotten about hyperthreading? Also, as I mentioned, there were ALWAYS advantages to writing multithreaded apps, even with one processor. I gave you one example where you always want your application to respond to a user, even if to tell them that you are doing something in the background for them. Another reason is that it is a lot more efficient, yes, even with a single processor. Even with the mighty 286 (an amazing processor for its day) the processor spent way too much time waiting for the I/O subsystems and a multithreaded application kept the processor busy while one thread waited on the leisurely hard disk. Yes, most programmers are hackers (a term misused now to mean someone that does bad things with code, whereas it meant someone that just sucked and couldn't write elegant code and hacked his way through it with badly written rubbish), but they still knew to write multithreaded stuff before dual cores, particularly with multiprocessing configurations becoming much more common with the P6. I'm not saying you won't see more of an effort, but the way things are being spoken about in the press is the it just takes some effort and these multicores will become absolutely fantastic when the software catches it. It ain't so, it's way overblown and there are a lot of things that will never be multithreaded because they can't be, and others that only benefit somewhat from it. Others will do great with it, it all depends on the type of application. Not every algorithm can be multithreaded effectively, and anyone who tells you otherwise reads too much press and hasn't coded a day in his or her life.

Your remarks about the Itanium are so bad I'm surprised you made them. Are you really this uninformed, or arguing just to argue. I think the latter. The problems with Itanium have nothing to do with ILP, although that was one of Intel's goals with it. The problem is, it remained a goal and has not been realized yet. Are you implying that the Itanium 2 has higher single threaded performance than the Core 2? I hope not.

If it had say 30% higher integer performance per core, on a wide list of applications, you'd have a big point there. It doesn't. It trails, in fact. First of all, I wouldn't call the Itanium a failure, because it's still premature to and I don't like counting out anything that gains market share year after year (albeit at a lower than expected rate). However, to the extent it has failed to gain the anticipated acceptance has a lot to do with cost, failures to meet schedules on Intel's part, the weird VLIW instruction set that people tend to dislike even as much as x86, and the fact it didn't run mainstream software well. Compatibility is so important, and that's why arguable the worst instruction set (aside from Intel's 432) is still king. Motorola's 68K line was much more elegant. Alpha even ran NT and couldn't dethrone it. It's hard to move people from x86, nearly (or possibly) impossible), and if you think this is some indictment against ILP, you're not even with reality.

Six years to design a processor is absurd, and you should know better. If you want to screw around with numbers why not start around 1991 or so when Intel started work on the P6 and say the Nehalem took 17 years, since some of it will come from there. People love throwing around BS numbers like that because it sounds impressive, but you only need to look at how quickly AMD and Intel add technology to their products to see it doesn't take six years. Look at AMD copying SSE, and Intel copying x86-64. Products now are derivative of earlier generations anyway, so you can't go six years back. The Nehalem will build on the Merced, it's not a totally from scratch processor. The Pentium 4 was pretty close, and the Prescott was a massive overhaul of the processor (much more than the Athlon 64 was vis-a-vis the Athlon), and it didn't take them even close to six years.
Viditor - Tuesday, April 3, 2007 - link

quote:
Your whole premise is illogical, you are saying that with the Nehalem all the sudden memory changes will happen slower

???...sigh...I never said anything of the sort. I can see that you are just trying to read into anything published or said just what you want it to say, so I'll stop there. Everyone else can just read the article (and the CC, the other articles Intel published on the subject, etc...). But your misunderstanding comes clear with the following:

quote:
Six years to design a processor is absurd, and you should know better...

Just to pull from a Google at random (this one from http://en.wikipedia.org/wiki/CPU_design">Wikipedia)
"The design cost of a high-end CPU will be on the order of US $100 million. Since the design of such high-end chips nominally take about five years to complete, to stay competitive a company has to fund at least two of these large design teams to release products at the rate of 2.5 years per product generation"

It's my mistake really...I thought that since you used all of these buzz words, you actually knew the industry. I was wrong...

quote:
Look at AMD copying SSE, and Intel copying x86-64. Products now are derivative of earlier generations anyway, so you can't go six years back

This is another misconcenception of the novice...

1. Things like x86-64 and SSE are published many years before they are built. For example, x86-64 was first published for the public in 2001 (and in fact AMD had started work on it in 1998/9) under the name LDT. In fact, it was released to the open Consortium as freely distributable in April of 2001. The first K8 chip wasn't released until 2003.
Likewise, Intel's Yamhill team began work on x86-64 in 2000/1, though they didn't admit it's existence until much later because they wanted to foster support for IA64. The first EM64T chip was released in Q1 2005...

2. Intel and AMD have a comprehensive cross-licensing deal for their patents, and the patents are filed well before development begins...so even before it becomes public, they each know what technology the other is working on many years before release.

There are so many inaccuracies and misunderstandings in your posts that I suggest the following:
1. Use the quote feature so that I can understand just what it is you're responding to. Several of your points have nothing to do with what I said...

2. Try actually posting a link now and then so that we can see that what you're saying isn't just something else you've misunderstood...
TA152H - Wednesday, April 4, 2007 - link
I think you have a problem connecting things you say with their logical foundations, and I'll help you with that defect.

You are said that Intel's main reason for not putting a memory controller on the chip was because changes in memory happen too quickly. Intel is putting a memory controller onchip for the Nehalem. Therefore, the logical conclusion is that this problem will not be as big of one with the Nehalem, since it no longer prevents Intel from doing it. You really didn't understand that? Why am I even arguing with you when you have such gaps in reasoning? I said it was mainly for the real estate savings, and that becomes less of a problem on 45nm since you have more transistors, so it's a logical premise, unlike yours.

It's kind of interesting that you read things, but don't really understand much. First of all, you said six years, now you're down to five. You also assume a completely new design, which isn't the case anymore. They are derivative from previous designs. How long do you think it took to do the original Alpha? Mind you, this is from brainstorming the requirements and what they wanted to do, designing the instruction set, etc... This is when superscalar was extremely unusual, superpipelining was unheard of, and a lot of the features on this processor were very new. Even then, it took less than five years. They have a good story on it from Byte magazine from August 1992.

If could remember anything, you'd know that AMD was against using SSE and was touting 3D Now! instead. Companies get patents, but they don't tell the whole story or for the purpose of designing a processor, any meaningful story. To make the transistor designs, you need to know specifics about how things will act under every situation and the necessary behavior. You are clueless if you think that's in the patents. You also need an actual processor to have so you can test. You wouldn't want to be AMD and implement just based on specs, because inevitably there would be incompatibilities.

You are also using your pretzel logic with regards to Yamhill. The processors had this logic in them way before they were released, and the design was done well before that. You really don't understand that? The only positive from this is you at least admit it's not six years, but is five. You'll slowly worm your way down to a realistic number, but five isn't so bad.

With regards to what I'm responding to, I could paste your stuff, but you have logical deficiencies. You are talking about multi-core, and can't make the connection to me saying multithreading has been going on forever. Even in 1992 (I got a nice batch of Byte Magazines off of eBay, and I am rereading a few of them), they were talking about how multiple cores were the future, in MIMD or SIMD configurations. How multithreading was going to take over the world, and how programmers were working on it, etc... It's funny, people are so clueless, and they just read articles and repeat them (hey, that's what I'm doing!).

My suggestion to you is to go back and get a nice batch of Byte magazines on eBay, and read them and really try to understand what they're saying, instead of being a parrot that repeats stuff you don't understand and try to sound impressive with it.

I'm done arguing with you, you're not informed enough to even interest me, and I won't even waste my time to read your responses.
Viditor - Wednesday, April 4, 2007 - link

quote:
You are said that Intel's main reason for not putting a memory controller on the chip was because changes in memory happen too quickly

You see? That's why I asked you to actually quote (I really was being quite sincere, it will help you)...that's NOT what I said.

What I said was that this was the reason Intel gave publicly, but that the real reason was that redesign of an architecture takes years not months. This is why they couldn't fit it on to C2D but will be able to on Nehalem...

quote:
First of all, you said six years, now you're down to five

I said Nehalem was six years and that the average was five (please go back and reread my posts...or maybe use quote?). I also said that the reason was that Nehalem was changed which is WHY it took 6 years.

quote:
You also assume a completely new design, which isn't the case anymore. They are derivative from previous designs

They are all derivatives of a previous design...for example, the C2D is a derivative of the P3. Did you think that Intel was just twiddling it's thumbs? AMD had several years of advantage over the Netburst architecture...don't you think that they would have released the C2D many years earlier if they could have?

quote:
AMD was against using SSE and was touting 3D Now!

They use both (even now), but of course they would have preferred just 3D Now (just as Intel would have preferred everyone using just IA64). What's your point?

quote:
To make the transistor designs, you need to know specifics about how things will act under every situation and the necessary behavior...You also need an actual processor to have so you can test

Sigh...
1. You need to learn the difference between "transistor design" and microarchitectural design. Both take a long time, but they are entirely different things (transistor design is part of manufacturing).
2. There are certainly ways to test as the product is being developed. For example, AMD released an AMD64 http://www.theregister.co.uk/2000/10/14/amd_ships_...">simulator and debugger to the public in 2000...
3. Even before initial tape-out (this is the first complete mask set), many sets of hand tooled silicon are made to test the individual circuits. This is the reason it takes so long...Each team works on their own specific area, then when the chip is first taped out they work on the processor as a whole unit.
4. Patents are often what initiate parts of the design...but I fail to see your point.

quote:
with regards to Yamhill. The processors had this logic in them way before they were released, and the design was done well before that

The first Intel processors to actually have the circuits in them (not activated) were the initial Prescotts. But saying the design was done is ludicrous...can you give a single reason why Intel included the circuits (and remember that it's expensive to add those transistors) without being able to use them other than the design not quite being finished??

quote:
I could paste your stuff, but you have logical deficiencies

I see...so instead of actually responding to what I've said, you deem it illogical and make up what I said instead?

quote:
I'm done arguing with you

Great idea...best one you've had. And my apologies to everyone for the length of the thread...
TA152H - Tuesday, April 3, 2007 - link
Yikes, holy typos Batman.

I meant to say the Nehalem will build on the Merom. If it built on the Merced, maybe it does take six years, and I'm thinking AMD would have a real good chance of gaining market share.
yyrkoon - Monday, April 2, 2007 - link

quote:
I'm not sure why you guys don't think an increase in FSB and memory bandwidth (i.e. 1600) isn't going to help. It's seems beyond obvious it will. Will it help enough is the only question.

You really havent been following processors for the last 12-14 years have you ? It has been proven, time, and time again, that a faster FSB is paramount to anything else (aside from processor core speed), in performance. Faster FSB == faster CPU->L1->L2. Memory bandwidth not so much (this is only because nothing takes advantage of memory bandwidth currently, and to be honest, I am not sure anything can, as this point), but DEFINATELTY FSB. Since, I do not see a faster core speed in the near future, the only other option for faster processors, aside from 'smarter' branch prediction' HAS to be FSB.

Now, since I have spoken against you, I suppose I am a 'dolt', or a 'moron', right ?
TA152H - Monday, April 2, 2007 - link
Is English your first language? I keep reading your sub-literal drivel and I'm not even sure what you're saying. I think you're agreeing with me that FSB does make a difference, but your writing ability is so poor it's hard to tell.

Either way, you're a moron or dolt, or whatever you choose :P.
yyrkoon - Tuesday, April 3, 2007 - link

quote:
I'm not sure why you guys don't think an increase in FSB and memory bandwidth (i.e. 1600) isn't going to help.

Yeah, ok, I am agreeing with you. Your triple negative threw me off there . . .

Intel Clovertown: Quad Core for the Masses

Post Your Comment

56 Comments

View All Comments

TA152H - Monday, April 2, 2007 - link

Viditor - Monday, April 2, 2007 - link

TA152H - Tuesday, April 3, 2007 - link

Viditor - Tuesday, April 3, 2007 - link

TA152H - Wednesday, April 4, 2007 - link

Viditor - Wednesday, April 4, 2007 - link

TA152H - Tuesday, April 3, 2007 - link

yyrkoon - Monday, April 2, 2007 - link

TA152H - Monday, April 2, 2007 - link

yyrkoon - Tuesday, April 3, 2007 - link

Log in

Don't have an account? Sign up now