The Memory Interface

Most SoCs deployed in smartphone designs implement a package-on-package (PoP) stack of DRAM on top of the SoC package. As its name implies, PoP refers to the physical stacking of multiple packages and not layering of raw die. The SoC is typically the lowest level with its memory bus routed to pads on the top of the package. A DRAM package is then stacked on top of the SoC. Avoiding having to route high-speed DRAM lines on the PCB itself not only saves space but it further reduces memory latency.


An example of a PoP stack

The iPhone has always used a PoP configuration for its SoCs and Apple has always been kind enough to silkscreen the part number of the DRAM on the outer package of the SoC. In the past we've seen part numbers from both Samsung and Elpida on Apple SoCs. As both companies can provide similarly spec'd DRAM it makes sense for Apple to source from two suppliers in the event that one is unable to meet demand for a given period.


iPhone 4 mainboard, courtesy iFixit

If we look at iFixit's teardown of the iPhone 4 we see the following DRAM part number: K4X4G643G8-1GC8. Most DRAM vendors do a pretty bad job of providing public data about their part numbers used in chip stacks, so we have to do a little bit of inferring to figure out exactly what Apple used last generation.

The first three characters tell us a bit about the type of DRAM. The K means it's memory, the 4 tells us that it's DRAM and the X tells us that it's mobile DDR (aka LPDDR). The next two characters tell us the density of the DRAM, in this case 4G is translated literally to 4Gbit or 512MB. Characters 6 and 7 are also of importance - they tell us the DRAM organization. Samsung's public documentation only tells us that 16 refers to a 16-bit interface and 32 here would mean a 32-bit interface. Based on that we can safely assume that the 4Gbit DRAM on the A4 is 64-bits wide. In the mobile world a 32-bit interface typically refers to a single channel, which confirms the A4's DRAM interface is two 32-bit channels wide.

The last two characters in the part number, C8, tell us the source clock frequency of the DRAM. Samsung's datasheets tell us that C8 corresponds to a 5ns cycle time with a CAS latency of 3 clocks. Taking the inverse of that gives us 200MHz (frequency = 1 / clock period). Remember, we're talking about double data rate (DDR) SDRAM so data is transferred at both the rising and falling edges of the clock, making the effective data rate 400MHz.

All of this tells us that the iPhone 4's A4 SoC has a 64-bit wide LPDDR1 memory interface with a 400MHz data rate. Multiply all of that out and you get peak theoretical bandwidth of 3.2GB/s. DDR memory interfaces are generally 80% efficient at best so you're looking at a limit of around 2.5GB/s. To put this in perspective, the A4 has as much memory bandwidth as the original AMD Athlon 64 released in 2003.

iPhone 4S mainboard, courtesy iFixit

With the A5 Apple definitely stepped up the memory interface. Once again we turn to iFixit's teardown of the iPhone 4S to lift that oh-so-precious part number: K3PE4E400B-XGC1.

The K once again tells us we're dealing with Samsung memory, while the 3P reveals there are two mobile DDR2 with 4n prefetch (aka LPDDR2-S4) DRAM die on the package. Why not a 4 this time? Technically the 4 refers to a discrete DRAM while the 3 implies a DRAM stack, obviously both are stacked DRAM so I'm not entirely sure why there's a difference here. Each of the next two E4s tell us the density of the two DRAM die. Samsung's public documentation only goes up to E3 which corresponds to a 1Gbit x32 device. Given that we know the A5 has 512MB on-package, E4 likely means 2Gbit x32 (256MB 32-bit). There are two E4 die on package which makes up the 512MB 64-bit DRAM stack.

Once again the final two characters reveal the cycle time of the DRAM: 2.5ns. The inverse of 2.5ns gives us a 400MHz clock frequency, or an 800MHz data rate (source clock frequency is actually 200MHz, but with a 4n prefetch we can transfer at effectively 800MHz). Peak bandwidth to the A5 is roughly double that of the A4: 6.4GB/s. This is as much memory bandwidth as AMD's Athlon 64 platform offered in late 2004, just 7 years later and in a much smaller form factor.

The doubling of memory bandwidth requires a sufficiently large workload to really show it. We see this in Geekbench's memory bandwidth results where the A5 doesn't appear to offer any more bandwidth than the A4 in all but one of the tests:

Memory Bandwidth Comparison - Geekbench 2
  Apple iPhone 4 Apple iPhone 4S
Overall Memory Score 593 700
Read Sequential 318.7 MB/s 302.3 MB/s
Write Sequential 704.9 MB/s 809.2 MB/s
Stdlib Allocate 1.55 Mallocs/sec 1.55 Mallocs/sec
Stdlib Write 1.25 GB/s 2.54 GB/s
Stdlib Copy 724.5 MB/s 490.1 MB/s
Overall Stream Score 280 281
Stream Copy 413.5 MB/s 396.4 MB/s
Stream Scale 313.3 MB/s 317.4 MB/s
Stream Add 518.0 MB/s 527.1 MB/s
Stream Triad 363.6 MB/s 373.9 MB/s

Memory bandwidth tests are extremely sensitive to architecture optimizations, particularly for single threaded tests like these so I wouldn't read too much into the cases where you see no gains or a drop.

The increase in raw memory bandwidth makes a lot of sense. Apple doubled the number of CPU cores on the A5, with each one even more bandwidth hungry than the single A4 core. The 4x increase in GPU compute combined with an increase in clock speeds give the A5 another big consumer of bandwidth. Add things like 1080p video capture and the memory bandwidth increase seems justified.

Looking back at the evolution of the iPhone's memory interface gives us an idea of just how quickly this industry has been evolving. Back in 2007 the original iPhone debuted with a 16-bit wide LPDDR-266 memory interface connected to a meager 128MB of DRAM. The 3GS delivered a huge increase in memory bandwidth by doubling the interface width and increasing the data rate to 400MHz. Scaling since then has been even more dramatic:

Memory capacity on the other hand has seen more of a step-function growth:

By using a mobile optimized OS Apple has been able to get around large memory requirements. The growth pattern in memory size partially illustrates the lag between introducing faster hardware and developers building truly demanding applications that require that sort of performance. Apple was able to leave the iPhone 4S at 512MB of RAM because the target for many iOS apps is still the iPhone 3GS generation. Don't be surprised to see a move to 1GB in the next iPhone release (we won't see 768MB due to the dual-channel memory requirement) as the app developer target moves to 512MB.

The A5 Architecture & CPU Performance GPU Performance Using Unreal Engine 3
Comments Locked

199 Comments

View All Comments

  • metafor - Tuesday, November 1, 2011 - link

    When you say power efficiency, don't you mean perf/W?

    I agree that perf/W varies depending on the workload, exactly as you explained in the article. However, the perf/W is what makes the difference in terms of total energy used.

    It has nothing to do with race-to-sleep.

    That is to say, if CPU B takes longer to go to sleep but it had been better perf/W, it would take less power. In fact, I think this was what you demonstrated with your second example :)

    The total energy consumption is directly related to how power-efficient a CPU is. Whether it's a slow processor that runs for a long time or a fast processor that runs for a short amount of time; whichever one can process more instructions per second vs joules per second wins.

    Or, when you take seconds out of the equations, whichever can process more instructions/joule wins.

    Now, I assume you got this idea from one of Intel's people. The thing their marketing team usually forgets to mention is that when they say race-to-sleep is more power efficient, they're not talking about the processor, they're talking about the *system*.

    Take the example of a high-performance server. The DRAM array and storage can easily make up 40-50% of the total system power consumption.
    Let's then say we had two hypothetical CPU's with different efficiencies. CPU A being faster but less power efficient and CPU B being slower but more power efficient.

    The total power draw of DRAM and the rest of the system remains the same. And on top of that, the DRAM and storage can be shut down once the CPU is done with its processing job but must remain active (DRAM refreshed, storage controllers powered) while the CPU is active.

    In this scenario, even if CPU A draws more power processing the job compared to CPU B, the system with CPU B has to keep the DRAM and storage systems powered for longer. Thus, under the right circumstances, the system containing CPU A actually uses less overall power because it keeps those power-hungry subsystems active for a shorter amount of time.

    However, how well this scenario translates into a smartphone system, I can't say. I suspect not as well.
  • Anand Lal Shimpi - Tuesday, November 1, 2011 - link

    I believe we're talking about the same thing here :)

    The basic premise is that you're able to guarantee similar battery life, even if you double core count and move to a power hungry OoO architecture without a die shrink. If your performance gains allow your CPU/SoC to remain in an ultra low power idle state for longer during those workloads, the theoretically more power hungry architecture can come out equal or ahead in some cases.

    You are also right about platform power consumption as a whole coming into play. Although with the shift from LPDDR1 to LPDDR2, an increase in effective bandwidth and a number of other changes it's difficult to deal with them independently.

    Take care,
    Anand
  • metafor - Tuesday, November 1, 2011 - link

    "If your performance gains allow your CPU/SoC to remain in an ultra low power idle state for longer during those workloads, the theoretically more power hungry architecture can come out equal or ahead in some cases."

    Not exactly :) The OoOE architecture has to perform more tasks per joule. That is, it has to have better perf/W. If it had worse perf/W, it doesn't matter how much longer it remains idle compared to the slower processor. It will still use more net energy.

    It's total platform power that may see savings, despite a less power-efficient and more power-hungry CPU. That's why I suspect that this "race to sleep" situation won't translate to the smartphone system.

    The entire crux relies on the fact that although the CPU itself uses more power per task, it saves power by allowing the rest of the system to go to sleep faster.

    But smartphone subsystems aren't that power hungry, and CPU power consumption generally increases with the *square* of performance. (Generally, this wasn't the case of A8 -> A9 but you can bet it's the case to A9 -> A15).

    If the increase in CPU power per task is greater than the savings of having the rest of the system active for shorter amounts of time, it will still be a net loss in power efficiency.

    Put it another way. A9 may be a general power gain over A8, but don't expect A15 to be so compared to A9, no matter how fast it finishes a task :)
  • doobydoo - Tuesday, November 1, 2011 - link

    You are both correct, and you are also both wrong.

    Metafor is correct because any chip, given a set number of tasks to do over a fixed number of seconds, regardless of how much faster it can perform, will consume more energy than an equally power efficient but slower chip. In other words, being able to go to sleep quicker never means a chip becomes more power efficient than it was before. It actually becomes less.

    This is easily logically provable by splitting the energy into two sections. If 2 chips are both equally power efficient (as in they can both perform the same number of 'tasks' per W), if one is twice as fast, it will consume twice the energy during that time, but complete in half the time, so that element will ALWAYS be equal in both chips. However, the chip which finished sooner will then have to be idle for LONGER because it finished quicker, so the idle expense of energy will always be higher for the faster chip. This assumes, as I said, that the idle power draw of both chips being equal.

    Anand is correct, because if you DO have a more power efficient chip with a higher maximum wattage consumption, race to sleep is the OFTEN (assuming reasonable idle times) the reason it can actually use less power. Consider 2 chips, one which consumes 1.3 W per second (max) and can carry out '2' tasks per second. A second chip consumes 1 W per second (max), and can carry out '1' task per second (so is less power efficient). Now consider a world without race-to-sleep. To carry out '10' tasks over a 10 second period, Chip one would take 5 seconds, but would remain on full power for the full 10 seconds, thereby using 13W. Chip two would take 10 seconds, and would use a total of 10W over that period. Thus, the more power efficient chip actually proved less power efficient.

    Now if we factor in race-to-sleep, the first chip can use 1.3 for the first 5 seconds, then go down to 0.05 for the last 5. Consuming 6.75W. The second chip would still consume the same 10W.

    Conclusion:

    If the chip is not more power effficient, it can never consume less energy, with or without race-to-sleep. If the chip IS more power efficient, but doesn't have the sleep facility, it may not use less energy in all scenarios.

    In other words, for a higher powered chip to reduce energy in ALL situations, it needs to a) be more power efficient fundamentally, and b) it needs to be able to sleep (race-to-sleep).
  • djboxbaba - Monday, October 31, 2011 - link

    Well done on the review Brian and Anand, excellent job as always. I was resisting the urge to tweet you about the eta of the review, and of course I end up doing it the same day as your release the review :).
  • Mitch89 - Monday, October 31, 2011 - link

    "This same confidence continues with the 4S, which is in practice completely usable without a case, unlike the GSM/UMTS iPhone 4. "

    Everytime I read something like this, I can't help but compare it to my experience with iPhone 4 reception, which was never a problem. I'm on a very good network here in Australia (Telstra), and never did I have any issues with reception when using the phone naked. Calls in lifts? No problem. Way outside the suburbs and cities? Signal all the way.

    I never found the iPhone 4 to be any worse than other phones when I used it on a crappy network either.

    Worth noting, battery life is noticeably better on a strong network too...
  • wonderfield - Tuesday, November 1, 2011 - link

    Same here. It's certainly possible to "death grip" the GSM iPhone 4 to the point where it's rendered unusable, but this certainly isn't the typical use case. For Brian to make the (sideways) claim that the 4 is unusable without a case is fairly disingenuous. Certainly handedness has an impact here, but considering 70-90% of the world is right-handed, it's safe to assume that 70-90% of the world's population will have few to no issues with the iPhone 4, given it's being used in an area with ample wireless coverage.
  • doobydoo - Tuesday, November 1, 2011 - link

    I agree with both of these. I am in a major capital city which may make a difference, but no amount or technique of gripping my iPhone 4 ever caused dropped calls or stopped it working.

    Very much an over-stated issue in the press, I think
  • ados_cz - Tuesday, November 1, 2011 - link

    It was not over-stated at all and the argument that most people are right handed does not hold a ground. I live in a small town in Scotland and my usual signal strength is like 2-3 bars. If browsing on net on 3G without case and holding the iPhone 4 naturaly with left hand (using the right hand for touch commands ) I loose signal completely.
  • doobydoo - Tuesday, November 1, 2011 - link

    Well the majority of people don't lose signal.

    I have hundreds of friends who have iPhone 4's who've never had any issue with signal loss at all.

    The point is you DON'T have to be 'right handed' for them to work, I have left handed friends who also have no issues.

    You're the exception, rather than the rule - which is why the issue was overstated.

    For what it's worth, I don't believe you anyway.

Log in

Don't have an account? Sign up now