The Memory Interface

Most SoCs deployed in smartphone designs implement a package-on-package (PoP) stack of DRAM on top of the SoC package. As its name implies, PoP refers to the physical stacking of multiple packages and not layering of raw die. The SoC is typically the lowest level with its memory bus routed to pads on the top of the package. A DRAM package is then stacked on top of the SoC. Avoiding having to route high-speed DRAM lines on the PCB itself not only saves space but it further reduces memory latency.


An example of a PoP stack

The iPhone has always used a PoP configuration for its SoCs and Apple has always been kind enough to silkscreen the part number of the DRAM on the outer package of the SoC. In the past we've seen part numbers from both Samsung and Elpida on Apple SoCs. As both companies can provide similarly spec'd DRAM it makes sense for Apple to source from two suppliers in the event that one is unable to meet demand for a given period.


iPhone 4 mainboard, courtesy iFixit

If we look at iFixit's teardown of the iPhone 4 we see the following DRAM part number: K4X4G643G8-1GC8. Most DRAM vendors do a pretty bad job of providing public data about their part numbers used in chip stacks, so we have to do a little bit of inferring to figure out exactly what Apple used last generation.

The first three characters tell us a bit about the type of DRAM. The K means it's memory, the 4 tells us that it's DRAM and the X tells us that it's mobile DDR (aka LPDDR). The next two characters tell us the density of the DRAM, in this case 4G is translated literally to 4Gbit or 512MB. Characters 6 and 7 are also of importance - they tell us the DRAM organization. Samsung's public documentation only tells us that 16 refers to a 16-bit interface and 32 here would mean a 32-bit interface. Based on that we can safely assume that the 4Gbit DRAM on the A4 is 64-bits wide. In the mobile world a 32-bit interface typically refers to a single channel, which confirms the A4's DRAM interface is two 32-bit channels wide.

The last two characters in the part number, C8, tell us the source clock frequency of the DRAM. Samsung's datasheets tell us that C8 corresponds to a 5ns cycle time with a CAS latency of 3 clocks. Taking the inverse of that gives us 200MHz (frequency = 1 / clock period). Remember, we're talking about double data rate (DDR) SDRAM so data is transferred at both the rising and falling edges of the clock, making the effective data rate 400MHz.

All of this tells us that the iPhone 4's A4 SoC has a 64-bit wide LPDDR1 memory interface with a 400MHz data rate. Multiply all of that out and you get peak theoretical bandwidth of 3.2GB/s. DDR memory interfaces are generally 80% efficient at best so you're looking at a limit of around 2.5GB/s. To put this in perspective, the A4 has as much memory bandwidth as the original AMD Athlon 64 released in 2003.

iPhone 4S mainboard, courtesy iFixit

With the A5 Apple definitely stepped up the memory interface. Once again we turn to iFixit's teardown of the iPhone 4S to lift that oh-so-precious part number: K3PE4E400B-XGC1.

The K once again tells us we're dealing with Samsung memory, while the 3P reveals there are two mobile DDR2 with 4n prefetch (aka LPDDR2-S4) DRAM die on the package. Why not a 4 this time? Technically the 4 refers to a discrete DRAM while the 3 implies a DRAM stack, obviously both are stacked DRAM so I'm not entirely sure why there's a difference here. Each of the next two E4s tell us the density of the two DRAM die. Samsung's public documentation only goes up to E3 which corresponds to a 1Gbit x32 device. Given that we know the A5 has 512MB on-package, E4 likely means 2Gbit x32 (256MB 32-bit). There are two E4 die on package which makes up the 512MB 64-bit DRAM stack.

Once again the final two characters reveal the cycle time of the DRAM: 2.5ns. The inverse of 2.5ns gives us a 400MHz clock frequency, or an 800MHz data rate (source clock frequency is actually 200MHz, but with a 4n prefetch we can transfer at effectively 800MHz). Peak bandwidth to the A5 is roughly double that of the A4: 6.4GB/s. This is as much memory bandwidth as AMD's Athlon 64 platform offered in late 2004, just 7 years later and in a much smaller form factor.

The doubling of memory bandwidth requires a sufficiently large workload to really show it. We see this in Geekbench's memory bandwidth results where the A5 doesn't appear to offer any more bandwidth than the A4 in all but one of the tests:

Memory Bandwidth Comparison - Geekbench 2
  Apple iPhone 4 Apple iPhone 4S
Overall Memory Score 593 700
Read Sequential 318.7 MB/s 302.3 MB/s
Write Sequential 704.9 MB/s 809.2 MB/s
Stdlib Allocate 1.55 Mallocs/sec 1.55 Mallocs/sec
Stdlib Write 1.25 GB/s 2.54 GB/s
Stdlib Copy 724.5 MB/s 490.1 MB/s
Overall Stream Score 280 281
Stream Copy 413.5 MB/s 396.4 MB/s
Stream Scale 313.3 MB/s 317.4 MB/s
Stream Add 518.0 MB/s 527.1 MB/s
Stream Triad 363.6 MB/s 373.9 MB/s

Memory bandwidth tests are extremely sensitive to architecture optimizations, particularly for single threaded tests like these so I wouldn't read too much into the cases where you see no gains or a drop.

The increase in raw memory bandwidth makes a lot of sense. Apple doubled the number of CPU cores on the A5, with each one even more bandwidth hungry than the single A4 core. The 4x increase in GPU compute combined with an increase in clock speeds give the A5 another big consumer of bandwidth. Add things like 1080p video capture and the memory bandwidth increase seems justified.

Looking back at the evolution of the iPhone's memory interface gives us an idea of just how quickly this industry has been evolving. Back in 2007 the original iPhone debuted with a 16-bit wide LPDDR-266 memory interface connected to a meager 128MB of DRAM. The 3GS delivered a huge increase in memory bandwidth by doubling the interface width and increasing the data rate to 400MHz. Scaling since then has been even more dramatic:

Memory capacity on the other hand has seen more of a step-function growth:

By using a mobile optimized OS Apple has been able to get around large memory requirements. The growth pattern in memory size partially illustrates the lag between introducing faster hardware and developers building truly demanding applications that require that sort of performance. Apple was able to leave the iPhone 4S at 512MB of RAM because the target for many iOS apps is still the iPhone 3GS generation. Don't be surprised to see a move to 1GB in the next iPhone release (we won't see 768MB due to the dual-channel memory requirement) as the app developer target moves to 512MB.

The A5 Architecture & CPU Performance GPU Performance Using Unreal Engine 3
Comments Locked

199 Comments

View All Comments

  • robco - Monday, October 31, 2011 - link

    I've been using the 4S from launch day and agree that Siri needs some work. That being said, it's pretty good for beta software. I would imagine Apple released it as a bonus for 4S buyers, but also to keep the load on their servers small while they get some real-world data before the final version comes in an update.

    The new camera is great. As for me, I'm glad Apple is resisting the urge to make the screen larger. The Galaxy Nexus looks nice, but the screen will be 4.65". I want a smartphone, not a tablet that makes phone calls. I honestly wouldn't want to carry something much larger than the iPhone and I would imagine I'm not the only one.

    Great review as always.
  • TrackSmart - Monday, October 31, 2011 - link

    I'm torn on screen size myself. Pocketable is nice. But I'm intrigued by the idea of a "mini-tablet" form factor, like the Samsung Galaxy Note with it's 5.3" screen (1280x800 resolution) and almost no bezel. That's HUGE for a phone, but if it replaces a tablet and a phone, and fits my normal pants pockets, it would be an interesting alternative. The pen/stylus is also intriguing. I will be torn between small form factor vs mini-tablet when I make my phone upgrade in the near future.

    To Anand and Brian: I'd love to see a review of the Samsung Galaxy Note. Maybe Samsung can send you a demo unit. It looks like a refined Dell Streak with a super-high resolution display and Wacom digitizer built in. Intriguing.
  • Rick83 - Wednesday, November 2, 2011 - link

    That's why I got an Archos 5 two years ago. And what can I say? It works.

    Sadly the Note is A) three times as expensive as the Archos
    and B) not yet on Android 4

    there's also C) Codec support will suck compared to the Archos, and I'm pretty sure Samsung won't release an open bootloader, like Archos does.

    I'm hoping that Archos will soon release a re-fresh of their smaller size tablets base on OMAP 4 and Android 4.
    Alternatively, and equally as expensive as the Note, is the Sony dual-screen tablet. Looks interesting, but same caveats apply....
  • kylecronin - Monday, October 31, 2011 - link

    > It’s going to be a case by case basis to determine which 4 cases that cover the front of the display work with the 4S.

    Clever
  • metafor - Monday, October 31, 2011 - link

    "Here we have two hypothetical CPUs, one with a max power draw of 1W and another with a max power draw of 1.3W. The 1.3W chip is faster under load but it draws 30% more power. Running this completely made-up workload, the 1.3W chip completes the task in 4 seconds vs. 6 for its lower power predecessor and thus overall power consumed is lower. Another way of quantifying this is to say that in the example above, CPU A does 5.5 Joules of work vs. 6.2J for CPU B."

    The numbers are off. 4 seconds vs 6 seconds isn't 30% faster. Time-to-complete is the inverse of clockspeed.

    Say a task takes 100 cycles. It would take 1 second on a 100Hz, 1 IPC CPU and 0.77 seconds on a 130Hz, 1 IPC CPU. This translates to 4.62 sec if given a task that takes 600 cycles of work (6 sec on the 100Hz, 1 IPC CPU).

    Or 1W * 6s = 6J = 1.3W * 4.62s

    Exactly the same amount of energy used for the task.
  • Anand Lal Shimpi - Monday, October 31, 2011 - link

    Err sorry, I should've clarified. For the energy calculations I was looking at the entire period of time (10 seconds) and assumed CPU A & B have the same 0.05W idle power consumption.

    Doing the math that way you get 1W * 6s + 0.05W * 4s = 6.2J (CPU B)

    and

    1.3W * 4s + 0.05W * 6s = 5.5J (CPU A)
  • metafor - Monday, October 31, 2011 - link

    Erm, that still presents the same problem. That is, a processor running at 130% the clockspeed will not finish in 4 seconds, it will finish in 4.62s.

    So the result is:

    1W * 6s + 0.05W * 4s = 6.2J (CPU B)
    1.3W * 4.62s + 0.05 * 5.38s = 6.275J (CPU A)

    There's some rounding error there. If you use whole numbers, say 200Hz vs 100Hz:

    1W * 10s + 0.05W * 10s = 10.5W (CPU B running for 20s with a task that takes 1000 cycles)

    2W * 5s + 0.05W * 15s = 10.75W (CPU A running for 10s with a task that takes 1000 cycles)
  • Anand Lal Shimpi - Monday, October 31, 2011 - link

    I wasn't comparing clock speeds, you have two separate processors - architectures unknown, 100% hypothetical. One draws 1.3W and completes the task in 4s, the other draws 1W and completes in 6s. For the sake of drawing a parallel to the 4S vs 4 you could assume that both chips run at the same clock. The improvements are entirely architectural, similar to A5 vs. A4.

    Take care,
    Anand
  • metafor - Tuesday, November 1, 2011 - link

    In that case, the CPU that draws 1.3W is more power efficient, as it managed to gain a 30% power draw for *more* than a 30% performance increase.

    I absolutely agree that this is the situation with the A5 compared to the A4, but that has nothing to do with the "race to sleep" problem.

    That is to say, if CPU A finishes a task in 4s and CPU B finishes a task in 6s. CPU A is more than 30% faster than CPU B; it has higher perf/W.
  • Anand Lal Shimpi - Tuesday, November 1, 2011 - link

    It is race to sleep though. The more power efficient CPU can get to sleep quicker (hurry up and wait is what Intel used to call it), which offsets any increases in peak power consumption. However, given the right workload, the more power efficient CPU can still use more power.

    Take care,
    Anand

Log in

Don't have an account? Sign up now