The CPU

Medfield is the platform, Penwell is the SoC and the CPU inside Penwell is codenamed Saltwell. It's honestly not much different than the Bonnell core used in the original Atom, although it does have some tweaks for both power and performance.

Almost five years ago I wrote a piece on the architecture of Intel's Atom. Luckily (for me, not Intel), Atom's architecture hasn't really changed over the years so you can still look back at that article and have a good idea of what is at the core of Medfield/Penwell. Atom is still a dual-issue, in-order architecture with Hyper Threading support. The integer pipeline is sixteen stages long, significantly deeper than the Cortex A9's. The longer pipeline was introduced to help reduce Atom's power consumption by lengthening some of the decode stages and increasing cache latency to avoid burning through the core's power budget. Atom's architects, similar to those who worked on Nehalem, had the same 2:1 mandate: every new feature added to the processor's design had to deliver at least a 2% increase in performance for every 1% increase in power consumption.

Atom is a very narrow core as the diagram below will show:

 

There are no dedicated integer multiply or divide units, that's all shared with the FP hardware. Intel duplicated some resources (e.g. register files, queues) to enable Hyper Threading support, but stopped short of increasing execution hardware to drive up efficiency. The tradeoff seems to have worked because Intel is able to deliver performance better than a dual-core Cortex A9 from a single HT enabled core. Intel also lucks out because while Android is very well threaded, not all tasks will continually peg both cores in a dual-core A9 machine. At higher clock speeds (1.5GHz+) and with heavy multi-threaded workloads, it's possible that a dual-core Cortex A9 could outperform (or at least equal) Medfield but I don't believe that's a realistic scenario.

Architecturally the Cortex A9 doesn't look very different from Atom:

 

Here we see a dedicated integer multiply unit (shared with one of the ALU ports) but only a single port for FP/NEON. It's clear that the difference between Atom and the Cortex A9 isn't as obvious at the high level. Instead it's the lower level architectural decisions that gives Intel a performance advantage.

Where Intel is in trouble is if you look at the Cortex A15:

 

The A15 is a far more modern design, also out of order but much wider than A9. I fully expect that something A15-class can outperform Medfield, especially if the former is in a dual-core configuration. Krait falls under the A15-class umbrella so I believe Medfield has the potential to lose its CPU performance advantage within a couple of quarters.

Enhancements in Saltwell

Although the CPU core is mated to a 512KB L2 cache, there's a separate 256KB low power SRAM that runs on its own voltage plane. This ULP SRAM holds CPU state and data from the L2 cache when the CPU is power gated in the deepest sleep state. The reasoning for the separate voltage plane is simple. Intel's architects found that the minimum voltage for the core was limited by Vmin for the ULP SRAM. By putting the two on separate voltage planes it allowed Intel to bring the CPU core down to a lower minimum power state as Vmin for the L2 is higher than it is for the CPU core itself. The downside to multiple power islands is an increase in die area. Since Medfield is built on Intel's 32nm LP process while the company transitions to 22nm, spending a little more on die area to build more power efficient SoCs isn't such a big deal. Furthermore, Intel is used to building much larger chips, making Medfield's size a relative nonissue for the company.

The die size is actually very telling as it's a larger SoC than a Tegra 2 with two Cortex A9s despite only featuring a single core. Granted the rest of the blocks around the core are different, but it goes to show you that the CPU core itself (or number of cores) isn't the only determination of the die size of an SoC.

The performance tweaks come from the usual learnings that take place over the course of any architecture's lifespan. Some instruction scheduling restrictions have been lifted, memory copy performance is up, branch predictor size increased and some microcode flows run faster on Saltwell now.

Clock Speeds & Turbo

Medfield's CPU core supports several different operating frequencies and power modes. At the lowest level is its C6 state. Here the core and L2 cache are both power gated with their state is saved off in a lower power on-die SRAM. Total power consumption in C6 of the processor island is effectively zero. This isn't anything new, Intel has implemented similar technologies in desktops since 2008 (Nehalem) and notebooks since 2010 (Arrandale).

When the CPU is actually awake and doing something however it has a range of available frequencies: 100MHz all the way up to 1.6GHz in 100MHz increments.

The 1.6GHz state is a burst state and shouldn't be sustained for long periods of time, similar to how Turbo Boost works on Sandy Bridge desktop/notebook CPUs. The default maximum clock speed is 1.3GHz, although just as is the case with Turbo enabled desktop chips, you can expect to see frequencies greater than 1.3GHz on a fairly regular basis.

Power consumption along the curve is all very reasonable:

Medfield CPU Frequency vs. Power
  100MHz 600MHz 1.3GHz 1.6GHz
SoC Power Consumption ~50mW ~175mW ~500mW ~750mW

Since most ARM based SoCs draw somewhere below 1W under full load, these numbers seem to put Medfield in line with its ARM competitors - at least on the CPU side.

It's important to pay attention to the fact that we're dealing with similar clock frequencies to what other Cortex A9 vendors are currently shipping. Any performance advantages will either be due to Medfield boosting up to 1.6GHz for short periods of time, inherently higher IPC and/or a superior cache/memory interface.

Introduction The GPU, Process & Roadmap
Comments Locked

164 Comments

View All Comments

  • janderk - Wednesday, January 11, 2012 - link

    The numbers are still impressive, but there isn't such a thing as a stock Gingerbread browser performance.

    The Intel phone currently runs Android 2.3.7 in which browser performance is on par with Honeycomb/ICS. You can't compare those numbers with a S2 or Sensation running old Honeycomb versions. If you do, you are comparing Android versions more than hardware.

    Google seems to have backported some browser code pieces to Gingerbread. A galaxy S2 on 2.3.6 with a stock ROM/browser scores around 90.000 in the Browsermark and around 2200 in the Spidermark. Ask Brian. He double checked and got even better numbers than I got.
  • Lucian Armasu - Wednesday, January 11, 2012 - link

    That's a very good point. I wouldn't put it past Intel to "hype up" their marketing a little too much. I've been watching them very closely regarding this lately, and a lot of what they are saying is simply BS.

    Let's wait until we actually have the product in the market before we evangelize their yet to be seen chips.
  • Wilco1 - Wednesday, January 11, 2012 - link

    While I like your article, you can't really conclude anything about micro archictures based on 2 micro benchmarks which have likely been highly tuned by Intel. Also note the Atom runs at 1.6GHz while the Nexus runs at 1.2GHz, so much of the performance difference is simply due to frequency.

    For a recent comparison between Cortex-A9 and Atom, check out these:

    http://www.phoronix.com/scan.php?page=article&...
    http://openbenchmarking.org/result/1201051-AR-1112...

    In these 1.0 and 1.2GHz Cortex-A9 SoCs completely obliterates 1.6GHz netbook Atoms in performance on mostly single-threaded benchmarks. So in terms of micro architecture comparison, your article is dead wrong. When compared using same compiler and OS, the A9 beats Atom at a much lower frequency due to having higher IPC as a result of out-of-order execution. Note how it scores much higher on most memory benchmarks.
  • milli - Wednesday, January 11, 2012 - link

    Actually the Z2460 runs at 1.3Ghz but can turbo to 1.6Ghz.

    Something might be up with the PandaBoard ES. Phoronix also has a Tegra 2 based review and that one also scores better on some tests than the PB-ES (just like the Exynos). The problem is that the scores are not really comparable because all three (PB-ES, T2, Exynos) use different compilers and kernels. Only the PB-ES uses the same compiler (and probably parameters) as the x86 systems. So you'll need to wait for real Medfield reviews when the time comes (or for Phoronix to do a better comparison). Especially the Exynos results need to be taken with a grain of salt since they used a total of three compilers there.

    It's known that Atom's single threaded performance is bad. It has HyperThreading to cover that up. Since Android's JavaScript engine is multi-threaded, Atom performs well.
  • Wilco1 - Wednesday, January 11, 2012 - link

    True, but you can bet Intel ensured the benchmarks were run at 1.6GHz, even if that wouldn't be feasible in a real phone due to cooling. So we have to wait for an actual phone with a standard Android version for the real comparison.

    There are indeed issues with the Panda board, the Ubuntu version used isn't compatible with the OMAP4460 so it isn't setup correctly. There are also compiler option issues and use of a slow flash card which reduces the scores. In terms of compilers used, GCC 4.5 or 4.6 doesn't make a major difference, so these benchmarks give a reasonable indication how Cortex-A9 would do vs Medfield.

    If Android JavaScript is multithreaded, you'd expect a dual core A9 to do much better than Atom as you get a 100% speedup from the second core, not just 30% from hyperthreading. I suppose we'll see when the Intel improvements are added to the mainstream Android version.
  • Wilco1 - Wednesday, January 11, 2012 - link

    True, but you can bet Intel ensured the benchmarks were run at 1.6GHz, even if that wouldn't be feasible in a real phone due to cooling. So we have to wait for an actual phone with a standard Android version for the real comparison.

    There are indeed issues with the Panda board, the Ubuntu version used isn't compatible with the OMAP4460 so it isn't setup correctly. There are also compiler option issues and use of a slow flash card which reduces the scores. In terms of compilers used, GCC 4.5 or 4.6 doesn't make a major difference, so these benchmarks give a reasonable indication how Cortex-A9 would do vs Medfield.

    If Android JavaScript is multithreaded, you'd expect a dual core A9 to do much better than Atom as you get a 100% speedup from the second core, not just 30% from hyperthreading. I suppose we'll see when the Intel improvements are added to the mainstream Android version.
  • BSMonitor - Wednesday, January 11, 2012 - link

    "If Android JavaScript is multithreaded, you'd expect a dual core A9 to do much better than Atom as you get a 100% speedup from the second core, not just 30% from hyperthreading. I suppose we'll see when the Intel improvements are added to the mainstream Android version. "

    No, you wouldn't get 100% speed bump because there are many more factors besides CPU resources that ultimately affect performance.

    You are clearly a noob fanboy on a rant.
  • Wilco1 - Wednesday, January 11, 2012 - link

    How much speedup you get obviously depends on lots of factors. However the fact remains that 2 cores have much more raw performance than 1 core with hyperthreading, so if JS is really multithreaded then the advantage would be to ARM, not Atom.
  • virtual void - Thursday, January 12, 2012 - link

    You have to keep, at least, two things in mind here

    1. The efficiency on the second HT in Atom is much higher than the 20-30% you see on Sandy Bridge. On Atom, 50-60% is probably a more accurate number based on a number of test I've done myself. And this is not because Atom in anyway is better than Sandy Bridge, it is quite the opposite. The in-order design and simple executions units in Atom will cause a lot more pipeline stalls which means that the other thread will get access to all the power (or lack of power) in the CPU.

    2. You are that two physical cores has more raw power than two HT on the same core. But when you run a single program on two treads and work on the same data, HT has a huge benefit in that the two program threads will communicate via the L1 cache (shared between the HT) while two threads running on different physical cores will communicate via the L2 cache. The L1 cache has a much lower latency and much higher bandwidth compared to the L2 cache.

    So HT can be very efficient in accelerating things where two threads are working on the same data-set. But two physical cores is probably always better when you have two threads running different programs or at least working on a data-set that is completely thread local.
  • Wilco1 - Thursday, January 12, 2012 - link

    You're right if 2 threads belong to the same process and communicate a lot then HT has lower overheads, but the downside is that you quickly start trashing the small L1 caches. HT works better on Atom indeed, but 50-60% on average sounds a bit high, especially since Atom stalls on cachemisses.

Log in

Don't have an account? Sign up now