SoC Analysis: On x86 vs ARMv8

Before we get to the benchmarks, I want to spend a bit of time talking about the impact of CPU architectures at a middle degree of technical depth. At a high level, there are a number of peripheral issues when it comes to comparing these two SoCs, such as the quality of their fixed-function blocks. But when you look at what consumes the vast majority of the power, it turns out that the CPU is competing with things like the modem/RF front-end and GPU.


x86-64 ISA registers

Probably the easiest place to start when we’re comparing things like Skylake and Twister is the ISA (instruction set architecture). This subject alone is probably worthy of an article, but the short version for those that aren't really familiar with this topic is that an ISA defines how a processor should behave in response to certain instructions, and how these instructions should be encoded. For example, if you were to add two integers together in the EAX and EDX registers, x86-32 dictates that this would be equivalent to 01d0 in hexadecimal. In response to this instruction, the CPU would add whatever value that was in the EDX register to the value in the EAX register and leave the result in the EDX register.


ARMv8 A64 ISA Registers

The fundamental difference between x86 and ARM is that x86 is a relatively complex ISA, while ARM is relatively simple by comparison. One key difference is that ARM dictates that every instruction is a fixed number of bits. In the case of ARMv8-A and ARMv7-A, all instructions are 32-bits long unless you're in thumb mode, which means that all instructions are 16-bit long, but the same sort of trade-offs that come from a fixed length instruction encoding still apply. Thumb-2 is a variable length ISA, so in some sense the same trade-offs apply. It’s important to make a distinction between instruction and data here, because even though AArch64 uses 32-bit instructions the register width is 64 bits, which is what determines things like how much memory can be addressed and the range of values that a single register can hold. By comparison, Intel’s x86 ISA has variable length instructions. In both x86-32 and x86-64/AMD64, each instruction can range anywhere from 8 to 120 bits long depending upon how the instruction is encoded.

At this point, it might be evident that on the implementation side of things, a decoder for x86 instructions is going to be more complex. For a CPU implementing the ARM ISA, because the instructions are of a fixed length the decoder simply reads instructions 2 or 4 bytes at a time. On the other hand, a CPU implementing the x86 ISA would have to determine how many bytes to pull in at a time for an instruction based upon the preceding bytes.


A57 Front-End Decode, Note the lack of uop cache

While it might sound like the x86 ISA is just clearly at a disadvantage here, it’s important to avoid oversimplifying the problem. Although the decoder of an ARM CPU already knows how many bytes it needs to pull in at a time, this inherently means that unless all 2 or 4 bytes of the instruction are used, each instruction contains wasted bits. While it may not seem like a big deal to “waste” a byte here and there, this can actually become a significant bottleneck in how quickly instructions can get from the L1 instruction cache to the front-end instruction decoder of the CPU. The major issue here is that due to RC delay in the metal wire interconnects of a chip, increasing the size of an instruction cache inherently increases the number of cycles that it takes for an instruction to get from the L1 cache to the instruction decoder on the CPU. If a cache doesn’t have the instruction that you need, it could take hundreds of cycles for it to arrive from main memory.


x86 Instruction Encoding

Of course, there are other issues worth considering. For example, in the case of x86, the instructions themselves can be incredibly complex. One of the simplest cases of this is just some cases of the add instruction, where you can have either a source or destination be in memory, although both source and destination cannot be in memory. An example of this might be addq (%rax,%rbx,2), %rdx, which could take 5 CPU cycles to happen in something like Skylake. Of course, pipelining and other tricks can make the throughput of such instructions much higher but that's another topic that can't be properly addressed within the scope of this article.


ARMv3 Instruction Encoding

By comparison, the ARM ISA has no direct equivalent to this instruction. Looking at our example of an add instruction, ARM would require a load instruction before the add instruction. This has two notable implications. The first is that this once again is an advantage for an x86 CPU in terms of instruction density because fewer bits are needed to express a single instruction. The second is that for a “pure” CISC CPU you now have a barrier for a number of performance and power optimizations as any instruction dependent upon the result from the current instruction wouldn’t be able to be pipelined or executed in parallel.

The final issue here is that x86 just has an enormous number of instructions that have to be supported due to backwards compatibility. Part of the reason why x86 became so dominant in the market was that code compiled for the original Intel 8086 would work with any future x86 CPU, but the original 8086 didn’t even have memory protection. As a result, all x86 CPUs made today still have to start in real mode and support the original 16-bit registers and instructions, in addition to 32-bit and 64-bit registers and instructions. Of course, to run a program in 8086 mode is a non-trivial task, but even in the x86-64 ISA it isn't unusual to see instructions that are identical to the x86-32 equivalent. By comparison, ARMv8 is designed such that you can only execute ARMv7 or AArch32 code across exception boundaries, so practically programs are only going to run one type of code or the other.

Back in the 1980s up to the 1990s, this became one of the major reasons why RISC was rapidly becoming dominant as CISC ISAs like x86 ended up creating CPUs that generally used more power and die area for the same performance. However, today ISA is basically irrelevant to the discussion due to a number of factors. The first is that beginning with the Intel Pentium Pro and AMD K5, x86 CPUs were really RISC CPU cores with microcode or some other logic to translate x86 CPU instructions to the internal RISC CPU instructions. The second is that decoding of these instructions has been increasingly optimized around only a few instructions that are commonly used by compilers, which makes the x86 ISA practically less complex than what the standard might suggest. The final change here has been that ARM and other RISC ISAs have gotten increasingly complex as well, as it became necessary to enable instructions that support floating point math, SIMD operations, CPU virtualization, and cryptography. As a result, the RISC/CISC distinction is mostly irrelevant when it comes to discussions of power efficiency and performance as microarchitecture is really the main factor at play now.

SoC Analysis: Apple A9X SoC Analysis: CPU Performance
Comments Locked

408 Comments

View All Comments

  • Jumangi - Saturday, January 30, 2016 - link

    Why wouldn't it? It's in a similar price range and is pushed as a "professional" device for use in business.
  • eNT1TY - Wednesday, January 27, 2016 - link

    I only owned the device for 3 weeks before returning it but i must say the apple pencil was fantastic. For my needs the ipad pro wasn't particularly any more "pro" than an ipad air 2 but combined with the pencil comes pretty damn close to being something special for graphics work though you are ultimately still not going to finalize/complete any work on it but you can get a hell of a start. File management sucks, like going around your ass to get to your elbow.

    But back to the pencil, it is amazing when the app takes full advantage. Adobe sketch is not that great even pen optimized but procreate is a different beast. The pencil has no perceptible lag, something even my wacom pro pen on my cintiq 27qhd can't claim and has more accurate angle recognition and doesn't distort drawing on the edges of the screen. Procreated is the real deal and much better at exporting a complex psd's than adobe's own. Adobe Draw fared a bit better than Sketch as far as responsiveness to pencil. uMake is no solidworks and is too basic and weak for a $15 monthly subsciption app but it felt intuitive with the pencil.

    I can wait for the pro 2, it will have a mature selection of apps by then and hopefully that newer version of ios will have better file management solutions. Man apple just needs to make a pencil compatible imac as well and stick it to wacom.
  • jjpcat@hotmail.com - Wednesday, January 27, 2016 - link

    It's interesting to compare A9X and Intel M. I am wondering if Apple has any data to back up its claim that A9X is faster than 80% of portable PCs released in the past year.

    I would like to see more info:

    1. Die size: A9X is 147 mm^2 while is 99 mm^2. So Intel may have an advantage here. But I am not sure if we can come to the conclusion that Intel has a cost advantage.
    2. Where's the GPU comparison?
    3. I don't trust Intel's TDP claim. It's better to include that in your power consumption test.
  • Constructor - Wednesday, January 27, 2016 - link

    1. Processes are different, as are the respective chip designs on the whole (including what's on the chips), so the physical size doesn't say that much.

    2. In other tests. The A9X looks quite good in these.

    3. TDP doesn't say much about actual consumption in real life anyway. It only says how much heat the cooling solution will have to move away at maximum. Battery usage can still vary substantially even at the same nominal TDP if – for instance – one of the chips can do "regular work" at lower power than the other. TDP comes only really into play when the chips are ramping up to maximum performance and try to stay there.

    The CPU comparison part of this test is pretty sketchy. Not necessarily wrong, but likely disregarding crucial influences on the particular benchmarks (vectorization by the compilers being part of it).
  • rightbrain - Friday, January 29, 2016 - link

    Another useful comparison would be die size, since it gives a rough but real cost comparison.
  • Constructor - Friday, January 29, 2016 - link

    Not really, because densities are different and so are yields as well as process and SoC development costs.
  • ads2015 - Monday, February 1, 2016 - link

    Apple's SPEC06 option "-O3 -FLTO" not "-Ofast". All cases are ok
    http://llvm.org/devmtg/2015-10/slides/Gerolf-Perfo...
    and llvm has 30+% performance headroom for SPEC06.
  • Delton Esteves - Wednesday, February 3, 2016 - link

    Biased review.

    Ipad Pro

    No usb ports
    No display port or HDMI
    No memory card
    No Kickstand
    No pen included

    Keyboard:
    Is expensive
    No backlit
    No trackpad
    No function keys
    There is no place to rest the hand
    Very complicated to set up

    Ipad Pro runs a Mobile OS

    Summing up, Ipad Pro cannot be considered a Pro device, so, stop being a Fanboy. Surface Pro 4 wins
  • Crisisis - Thursday, February 4, 2016 - link

    Just.in.the.same.paragraph: "stop being a Fanboy" and "Surface Pro 4 wins". A new definition of irony.
  • Delton Esteves - Wednesday, February 10, 2016 - link

    "A new definition of irony". why do you think Ipad Pro is better? Justify?

Log in

Don't have an account? Sign up now