An Unbalanced L1 Cache: We Know Why

The Atom processor is outfitted with fairly large caches, which are quite necessary given its in-order architecture that's very sensitive to high memory latencies. We wrote the following in our initial Atom (Silverthorne) architecture discussion:

"The L1 cache is unusually asymmetric with a 32KB instruction and 24KB data cache, a decision made to optimize for performance, die size, and cost. The L2 cache is an 8-way 512KB design, very similar to what was used in the Core architecture.

While Silverthorne is built entirely on Intel's high-k/metal gate 45nm process, there is one major difference: SRAM cell size. Intel uses a 0.382 um^2 SRAM cell in Silverthorne compared to 0.346 um^2 in Core 2. Each SRAM cell is an 8 transistor design compared to 6 transistors in Core 2. The larger cell size increases the die size of Silverthorne but it draws less power and runs at a lower voltage."

At the time we didn't have a good explanation as to why the Atom's L1 cache wasn't made of equal sized instruction and data caches, which is usually how Intel designs its processors. Since then we have gotten some more insight into the design decision:

Historically, Intel would design a microprocessor for a particular manufacturing process (e.g. 65nm) and shoot for a target voltage, later attempting to lower that voltage when possible. Atom was designed around the absolute minimum voltage the manufacturing process (45nm) was capable of running at and the engineers were left with the task of figuring out what they could do, architecturally, given that requirement.

The perfect example of this approach to design is Atom's L1 instruction and data caches. Originally these two caches were small signal arrays (6 transistors per cell), they were very compact and delivered the performance Intel desired. However during the modeling of the chip Intel noticed that it was a limiter to being able to scale down the operating voltage of the chip.

Instead of bumping up the voltage and sticking with a small signal array, Intel switched to a register file (1 read/1 write port). The cache now had a larger cell size (8 transistors per cell) which increased the area and footprint of the L1 instruction and data caches. The Atom floorplan had issues accommodating the larger sizes so the data cache had to be cut down from 32KB to 24KB in favor of the power benefits. We wondered why Atom had an asymmetrical L1 data and instruction cache (24KB and 32KB respectively, instead of 32KB/32KB) and it turns out that the cause was voltage.

A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC. There were other design decisions at work that prevented Intel from equipping the L1 cache with inline ECC, so the architects needed to go to a larger cell size in order to keep the operating voltage low.

The end result of this sort of a design approach is that the Atom processor is able to operate at its highest performance state (C0) at its minimum operating voltage.

Hardware Prefetchers: So Necessary

Atom features two hardware prefetchers, one that prefetches from the L2 cache into the L1 data cache and one from memory into the L2 cache.

Hardware prefetching is unbelievably important when dealing with an in-order core because as we've mentioned time and time again, not having data available in cache means that the pipelines will stall until that data is available.

The obvious long term solution to the problem of data starvation is to integrate the memory controller on die. With no 45nm MCH design ready by the time the Atom design was complete, Intel has to wait until the second generation Atom (codename: Moorestown) to gain an on-die memory controller.

Fighting Power Consumption...with a Longer Pipeline? Building by FUBs
Comments Locked

46 Comments

View All Comments

  • adntaylor - Tuesday, April 8, 2008 - link

    On that chart with price / power, you need to be clearer...

    For price, you show the combined price for CPU + Chipset. For power, you say just the CPU... so 0.65W for the CPU... but you're conveniently ignoring the >2W figure for the chipset!!! This absolutely flatters Intel wherever possible.

    AMD are just as misleading - they describe the Geode LX as "1W" which excludes the non-CPU core parts of the chip (which is an integrated CPU + GMCH)

    Just please be honest - the figures are out there in the Intel datasheets... it takes 10 minutes to check.
  • Clauzii - Friday, April 4, 2008 - link

    I still have a PowerVR 4MB addon card, runnung in tandem with a Rage128Pro. Quite a combination w. 15 FPS in Tombraider. Constant(!) 15FPS, that is..

    Amazing what they actually achieved back in 95!
  • Clauzii - Friday, April 4, 2008 - link

    Ooops!

    Totally misplaced that. Sorry.
  • wimaxltepro - Friday, April 4, 2008 - link

    The Atom represents a shift in processor architecture that is the most dramatic departure for Intel since introduction of x86 processors... the philosophy of how computing itself occurs from centralized processors to distributed processing based on an extension of the popular x86 instruction set.

    The Atom is not about the immediate prospects for the Atom or Nehalem products: we will likely see members of Intel's new product family be used in embedded applications in consumer products and in areas where specialized communications processors are more the rule. While not optimized for use in specific networking applications, the products capitalize on the wide range of support available in IT/Networking to develop common functions that leverage the low cost, low power/processing capability to be used as a common denominator for a wide range of applications.

    Intel has been built on the 'Wintel' architecture: massively integrated chips needed to handle the massively integrated operating systems and applications of Windows (and Apple) environments. The Atom allows migration and broadening out from that architectural motif to a very highly distributed architecture. So, the increased parallelism found in the internal chip architecture is enabling of changes in external system architectures and device applications that go well beyond the typical domain of Intel.. and right into the domain of 'personal wireless broadband' and SDWN, Smart Distributed Wireless broadband Network.

    The decisions about in-order vs. out of-order instruction streams, memory architecture, I/O architecture have been made in light of the broad vision for how computing, networking and, out of hand, how wireless enabled broadband networking including WiMAX will occur. This should be understood for what it represents as a shift in direction for Intel both in response to broad industry shifts and as a trend setting development.
  • jtleon - Friday, April 4, 2008 - link

    Thanks to all the flash player ads, etc., a mobile web device will continuously avoid switching to low power states. Thus one could argue that advertising will be carbon footprint enemy of the internet's future. This is already becoming the case for desktop/laptop machines.

    Without such continuous (arguably wasted) consumption of CPU power, then Intel's engineered power management might have a significant impact on the value of the Atom.

    Regards,
    jtleon
  • 0WaxMan0 - Friday, April 4, 2008 - link

    I am definatly much impressed and enthused by intels work here, the future looks interesting esp for those of us who like low power cross compatible computing products.

    However I have to point out that a low power modern x86 cpu has allready been done infact 4 years ago with AMD's Geode. While technically much weaker than the Atom and with out any where near the scalability (single core design etc.) the Geode has been available in the same TDP ranges for a good long while. Take a look here http://www.amdboard.com/geode.html">http://www.amdboard.com/geode.html for some old stuff.

    I do hope that the Intel name and hype makes more of an impact than AMD managed.
  • whycode - Thursday, April 3, 2008 - link

    Does the TDP quoted include the chipset? Or is that CPU only?
  • IntelUser2000 - Thursday, April 3, 2008 - link

    Anand, the Pentium M does not feature Macro Ops Fusion. Its Core 2 Duo that started Macro Ops Fusion.
  • Anand Lal Shimpi - Thursday, April 3, 2008 - link

    You're correct, I was referencing micro-op fusion. I've made the appropriate correction :)

    Take care,
    Anand
  • squito - Wednesday, April 2, 2008 - link

    Am I the only one shocked to see that Poulsbo is a 130nm part...

Log in

Don't have an account? Sign up now