An Unbalanced L1 Cache: We Know Why

The Atom processor is outfitted with fairly large caches, which are quite necessary given its in-order architecture that's very sensitive to high memory latencies. We wrote the following in our initial Atom (Silverthorne) architecture discussion:

"The L1 cache is unusually asymmetric with a 32KB instruction and 24KB data cache, a decision made to optimize for performance, die size, and cost. The L2 cache is an 8-way 512KB design, very similar to what was used in the Core architecture.

While Silverthorne is built entirely on Intel's high-k/metal gate 45nm process, there is one major difference: SRAM cell size. Intel uses a 0.382 um^2 SRAM cell in Silverthorne compared to 0.346 um^2 in Core 2. Each SRAM cell is an 8 transistor design compared to 6 transistors in Core 2. The larger cell size increases the die size of Silverthorne but it draws less power and runs at a lower voltage."

At the time we didn't have a good explanation as to why the Atom's L1 cache wasn't made of equal sized instruction and data caches, which is usually how Intel designs its processors. Since then we have gotten some more insight into the design decision:

Historically, Intel would design a microprocessor for a particular manufacturing process (e.g. 65nm) and shoot for a target voltage, later attempting to lower that voltage when possible. Atom was designed around the absolute minimum voltage the manufacturing process (45nm) was capable of running at and the engineers were left with the task of figuring out what they could do, architecturally, given that requirement.

The perfect example of this approach to design is Atom's L1 instruction and data caches. Originally these two caches were small signal arrays (6 transistors per cell), they were very compact and delivered the performance Intel desired. However during the modeling of the chip Intel noticed that it was a limiter to being able to scale down the operating voltage of the chip.

Instead of bumping up the voltage and sticking with a small signal array, Intel switched to a register file (1 read/1 write port). The cache now had a larger cell size (8 transistors per cell) which increased the area and footprint of the L1 instruction and data caches. The Atom floorplan had issues accommodating the larger sizes so the data cache had to be cut down from 32KB to 24KB in favor of the power benefits. We wondered why Atom had an asymmetrical L1 data and instruction cache (24KB and 32KB respectively, instead of 32KB/32KB) and it turns out that the cause was voltage.

A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC. There were other design decisions at work that prevented Intel from equipping the L1 cache with inline ECC, so the architects needed to go to a larger cell size in order to keep the operating voltage low.

The end result of this sort of a design approach is that the Atom processor is able to operate at its highest performance state (C0) at its minimum operating voltage.

Hardware Prefetchers: So Necessary

Atom features two hardware prefetchers, one that prefetches from the L2 cache into the L1 data cache and one from memory into the L2 cache.

Hardware prefetching is unbelievably important when dealing with an in-order core because as we've mentioned time and time again, not having data available in cache means that the pipelines will stall until that data is available.

The obvious long term solution to the problem of data starvation is to integrate the memory controller on die. With no 45nm MCH design ready by the time the Atom design was complete, Intel has to wait until the second generation Atom (codename: Moorestown) to gain an on-die memory controller.

Fighting Power Consumption...with a Longer Pipeline? Building by FUBs
POST A COMMENT

46 Comments

View All Comments

  • highlandsun - Thursday, April 03, 2008 - link

    With all due respect to Fred Weber, with Atom at 47 million transistors, it's pretty obvious that the 10% figure for X86 ISA compatibility is not negligible, particularly in this performance-at-absolute-minimum-power space. Anybody using X86 in tiny embedded systems is automatically giving up a chunk of their power budget that someone using a cleaner instruction set encoding can apply directly to useful work. And as the previous poster already pointed out - source code portability is the only thing that matters to application developers, and that's a non-problem these days. Using the X86 instruction set encoding is stupid. Using it on a low-power-budget device is suicide. Reply
  • Jovec - Thursday, April 03, 2008 - link

    I don't think the 10% reference meant 10% of all chips, but rather 10% of the current chip at the time the statement was made. In other words, x86 instruction decoding requires (roughly) a fixed amount of transistors for any chip, so the smaller the die size and larger the transistor count, less and less space is devoted to it. Reply
  • highlandsun - Thursday, April 03, 2008 - link

    Yes, that's obvious. And it's also obvious that Atom at 47 million transistors is paying a greater proportionate cost than Core2 Duo at 410 million transistors. In 2002 when Fred made that statement, AMD's current chip was the AthlonXP Thoroughbred, with about 37 million transistors. At the same time the Pentium 4 had 55 million. Put in context, I'd guess that the Atom at 47M vs P4 at 55M has more than 10% of its resources devoted to X86 decoding.

    Also, Fred's statement in 2002 didn't take into account the additional complexity introduced by the AMD64 instruction extensions, where now a single instruction may be anywhere from 1 to 16 bytes long. Given that you're doing a completely clean ground-up chip design in the first place, it would have made more sense (from both a power budget and real estate perspective) to design a clean, orthogonal, uniform-length encoding at the same time.

    Cross-platform ABI compatibility is stupid in the context they're aiming for; nobody is going to run their PC version of Crysis or MSWord on their cellphone. All that matters is API compatibility. With a consistent API, you can still run a separate binary translator if you really really want to move a desktop app to your mobile device but in most cases it would be a bad idea because a desktop app is unlikely to take advantage of power-saving APIs that would be important on a mobile. I.e., most of the time you're going to want purpose-built mobile apps anyway.
    Reply
  • floxem - Tuesday, April 15, 2008 - link

    I agree. But it's Intel. What do you expect? Reply
  • maree - Thursday, April 03, 2008 - link

    I dont think MS will be ready before Windows 7 is released, which is another 3-5 years... and might coincide with Moorestown. Microsoft started work on WindowsLite only after releasing Vista. Vista is bloatware as of now. As of now MS has to rely on crippled versions of XP and Vista like starter and home, which is not very ideal.

    Apple and Linux are going to have a free run till then...
    Reply
  • TA152H - Wednesday, April 02, 2008 - link

    Bringing up the Pentium is a little strange, because the whole market is completely different.

    The Pentium wasn't supposed to be for everyone when it came out. The processor market was different back then where previous generations lasted a long, long time. The Pentium wasn't supposed to replace the 486 right away, or even quickly, and being huge and a terrible power hog was acceptable because the initial iteration was just for a very small group of people who absolutely needed it. The original Pentium had a lot of problems, and struggled badly to reach 66 MHz, so they sold most of their processors at 60 MHz. The second generation was intended more for mainstream.

    Nowadays the latest generation replaces the earlier much more quickly, and has to cover more market segments more quickly. I still remember IBM releasing new machines for the 8086 in 1987. That's 9 years after the chip was made. It's just a different market.

    The Pentium is nothing like the Silverthorne though, and it's a strange comparison. The Pentium executed x86 instructions, it wasn't decoupled. It also had both pipes, the U and V, lockstepped, which is limitation the Silverthorne doesn't have.

    Saying the Pentium Pro was the first processor that allowed out of order processing is strange indeed. The only other processor this would have made sense with was the Pentium, since it was the only previous processor that was superscalar. So, they only made one in order processor, and then went to out of order with the next. It's difficult to see the extrapolation from this that it will be five years or more before Silverthorne goes out of order. It might be that long, but the backwards reference shouldn't be used to back that; it does more to contradict it.
    Reply
  • Anand Lal Shimpi - Wednesday, April 02, 2008 - link

    The Pentium reference was merely to show that what was once a huge, 300mm^2 design could now be built on a much, much smaller scale. And starting from scratch it's now possible to build something in-order that's significantly faster.

    The Pentium was an obvious comparison given that it was Intel's last two-issue in-order design, but I didn't mean to imply anything beyond that.

    It won't be too long before we'll be able to have something the speed of a Core 2 in a similarly small/cool running package as well :)

    Take care,
    Anand
    Reply
  • fitten - Wednesday, April 02, 2008 - link

    I remember back in the days of the Mac FX we talked about 'what ifs' like making a 6502 with the (then) modern process technologies and how fast would it run. I wonder what about now :) Reply
  • crimson117 - Wednesday, April 02, 2008 - link

    quote:

    It won't be too long before we'll be able to have something the speed of a Core 2 in a similarly small/cool running package as well :)


    I am SO going to hold you to that! But I can only hope "won't be long" will mean within 12 months rather than within 12 years :P

    Especially after my fiasco mounting a Freezer 7 Pro on an Abit IP35-E, I'd love if a heatsink weren't even necessary.
    Reply
  • Anand Lal Shimpi - Wednesday, April 02, 2008 - link

    12 months won't be a reality unfortunately :) But look at it this way, the first Pentium M came out in 2003? And 5 years later we're able to have somewhat comparable performance with the Atom processor.

    I'm really curious to see what happens with Atom on 32nm...
    Reply

Log in

Don't have an account? Sign up now