Cell’s Dynamic Logic

Although it’s beyond the scope of this article, one of the major problems with static CMOS circuits are the p-type transistors, and the fact that for every n-type transistor, you also must use a p-type transistor.

There is an alternative known as dynamic or pseudo-NMOS logic, which gets around the problems of static CMOS while achieving the same functionality.   Let’s take a look at that static CMOS NOR gate again:

The two transistors at the top of the diagram are p-type transistors.   When either A or B are high (i.e. have a logical 1 value), then the p-type transistor gates remain open, with no current flowing.   In that case, the output of the circuit is ground, or 0 since the complementary n-type transistors at the bottom function oppositely from their p-type counterparts (e.g. current can flow when the input is high).

Thus, the NOR gate outputs a 1 only if all inputs are 0, which is exactly how a NOR gate should function.

Now, let’s take a look at a pseudo-NMOS implementation of the same NOR gate:

There are a few things to notice here.  First and foremost, the clock signal is tied to two transistors (a p-type at the top, and an n-type at the bottom) whereas there was no clock signal directly to the NOR gate in our static CMOS example.  There is a clock signal fed to the gate here.

Cell’s implementation goes one step further.  The p-type transistor at the top of the circuit and the n-type transistor at the bottom are clocked on non-overlapping phases, meaning that the two clocks aren’t high/low at the same time.

The way in which the gate here works is as follows: inputs are first applied to the logic in between the clock fed transistors.   The top transistor’s gate is closed allowing the logic transistors to charge up.  The gate is then opened and the lower transistor’s gate is closed to drain the logic transistors to ground.   The charge that remains is the output of the circuit.

What’s important about this is that since power is only consumed during two non-overlapping phases, overall power consumption is lower than static CMOS.   The downside is that clock signal routing becomes much more difficult.

The other benefit is lower transistor count.  In the example of the 2-input NOR gate, our static CMOS design used 4 transistors, while our pseudo-NMOS implementation used 4 transistors as well.   But for a 3-input NOR gate, the static CMOS implementation requires 6 transistors, while the pseudo-NMOS implementation requires 5.  The reasoning is that for a CMOS circuit, you have 1 p-type transistor for every n-type, while in a pseudo-NMOS circuit you only have two additional transistors beyond the bare minimum required to implement the logic function.   For a 100-input NOR gate (unrealistic, but a good example), a static CMOS implementation would require 200 transistors, while a pseudo-NMOS implementation would only require 102.

By making more efficient use of transistors and lowering power consumption, Cell’s pseudo-NMOS logic design enables higher clock frequencies.   The added cost is in the manufacturing and design stages:
  1. As we mentioned before, clock routing becomes increasingly difficult with pseudo-NMOS designs similar to that used in Cell.   The clock trees required for Cell are probably fairly complex, but given IBM’s expertise in the field, it’s not an insurmountable problem.
  2. Designing pseudo-NMOS logic isn’t easy, and there are no widely available libraries from which to pull circuit designs.   Once again, given IBM’s size and expertise, this isn’t much of an issue, but it does act as a barrier for entry of smaller chip manufacturers.
  3. Manufacturing such high speed dynamic logic circuits often requires techniques like SOI, but once again, not a problem for IBM given that they have been working on SOI for quite some time now.   There’s no surprise that Cell is manufactured on a 90nm SOI process.
On the gate level, just like at the logic level, Cell is architected for high speeds and efficient use of transistors.

Understanding Gates Blueprint for a High Performance per Transistor CPU
Comments Locked

70 Comments

View All Comments

  • Poser - Thursday, March 17, 2005 - link

    There were moments while reading this article that I expected there to be a "Test Yourself" quiz at the end of the chapter ... er, article. Which isn't to say that articles like this are too textbookish, it's to say that they're wonderfully educational. And very, very cool for being so.

    I'm half joking when I say this (but only half) -- a real "test" at the end of the article would be fun. I could see if I really understood what I read, and even get to compare my score to the rest of the, uhm, class.
  • drinkmorejava - Thursday, March 17, 2005 - link

    very nice, how long did it take to write that thing?
  • Eug - Thursday, March 17, 2005 - link

    #42,

    That's an interesting page, cuz everyone on OS X already knows that Word is slow on the Mac. It brings us back to the original statement that some ported software may be problematic performance-wise.

    And the generic comment on the Mac side about Premiere is, well... use Final Cut Pro. :) Here is a test that seems a bit more useful, since it tests Cinema4D and After Effects, two apps that people use on the Mac and both of which are reasonably well optimized:

    http://digitalvideoediting.com/articles/viewarticl...

    That's a good point about the memory scaling though. The IMC with AMD's chips is a definite advantage. I'm sure the G5 970MP dual-core won't get an IMC either.

    Anyways, as far as this article is concerned, the G5 is kinda irrelevant. The interesting part for Apple in Cell is the PPE unit. It's also interesting that Anand says the original SPE was supposed to be VMX/Altivec. But the current SPE is not Altivec so it's less applicable for Apple, at least in the near term.

    It would be interesting to know how fast a dual-core 3 GHz PPE would be in general laptop-type code, and how much power it would put out.
  • MDme - Thursday, March 17, 2005 - link

    #39, 40, 41

    http://www.pcworld.com/news/article/0,aid,112749,p...

    remember that the athlon 64 chips scale better at higher clock speeds due to the mem controller scaling as well.

  • Eug - Thursday, March 17, 2005 - link

    Well, one example is Cinebench 2003:

    The dual G5 2.0 GHz is about the same speed as a dual 0pteron 246 2.0 GHz, with a score at around 500ish.

    http://www.aceshardware.com/read.jsp?id=60000284

    BTW, a dual G5 2.5 GHz scores 633.
  • suryad - Thursday, March 17, 2005 - link

    Hmm that is interesting what you say Eug. I see your point do you have any links on straight comparos between an FX and a top of the line Mac? Or from personal experience folding and such...
  • Eug - Thursday, March 17, 2005 - link

    #38. It's a mistake to say an AMD FX 55 smokes a dual G5 2.5. For instance, if you like scientific dual-threaded stuff, the G5 does very well. However, the AMD FX 55 IS faster than a single G5 2.5. It's got a slight edge clock-for-clock, and it's clocked slightly higher too.

    The real problem is when you have stuff built for x86 ported over to PPC. It just isn't great on the Mac side performance-wise in that situation. And Macs aren't tweaked for gaming either. The AMD is going to smoke the Mac in Doom 3 of course.

    I think with the performance advantage of the Opteron, I'd put a single G5 2.5 in the range of performance of a single Opteron 2.2-2.4 GHz, depending on the app. The real interesting part though will be the coming quarter, when the new G5s are released. They should get a significant clock speed bump (20%?) and information on dual-core G5s are already out there (like with AMD and their dual-core Athlons). They also get a cache boost. Right now they only have 512 KB, but are expected to get 1 MB L2.
  • suryad - Thursday, March 17, 2005 - link

    Well scrotemaninov I am not disputing that the POWER architecture by IBM is brilliantly done. IBM is definitely one of those companies churning out brilliant and elegant technology always in the background.

    But my problem with the POWER technology is from what I understand very limitedly, is that the POWER processors in the Mac machines are a derivative of that architecture right? Why the heck are they so damn slow then?

    I mean you can buy an AMD FX 55 based on the crappy legacy x86 arch and it smokes the dual 2.5 GHz Macs easily!! Is it cause of the OS? Because so far from what I have seen, if the Macs are any indication of the performance capabilities of the POWER architecture, the Cell will not be a big hit.

    I did read though at www.aceshardware.com benchmark reviews of the POWER5 architecture with some insane number of cores if I recall correctly and the benchmarks were of the charts. They are definitely not what the Macs have installed in them...
  • scrotemaninov - Thursday, March 17, 2005 - link

    #35: different approaches to solving the same problem.

    Intel came up with x86 a long time ago and it's complete rubbish but they maintain it for backwards compatibility (here's an argument for Open Source Software if ever there was one...). They have huge amounts of logic to effectively translate x86 into RISC instructions - look at the L1I Trace Cache in the P4 for example.

    IBM aren't bound by the same constraints - their PowerPC ISA is really quite nice and so there's no where near the same amount of pain suffered trying to deal with the same problem. It does seem however, that IBM are almost at the point that Intel want to be in 10 years time...
  • Verdant - Thursday, March 17, 2005 - link

    here is a question...

    it mentions (or alludes) in the article that having no cache means that knowing exactly when an instruction would be executed is possible, is the memory interface therefore a strict "real time system" ?

Log in

Don't have an account? Sign up now