Back to Article

  • xinthius - Sunday, June 02, 2013 - link

    It is a shame that Swift isn't included. Reply
  • shompa - Sunday, June 02, 2013 - link

    Swift won't be licensed/used in non Apple hardware. So even if Swift was 1000000 faster, it would not help any of us. Reply
  • melgross - Monday, June 03, 2013 - link

    Except for the large percentage of us here who do use Apple hardware. Reply
  • codedivine - Sunday, June 02, 2013 - link

    Author here. Unfortunately I don't have an iPhone, nor a Mac to be able to develop the app for the iPhone. So I couldn't test it. Might look at Swift in the future. Reply
  • bersl2 - Sunday, June 02, 2013 - link

    Can you imagine if Intel had never openly published the x86 instruction set architecture and rarely talked about its microarchitecture?

    Now consider the fact that extreme secrecy as to how to best interact with a big chip or with chipsets is the norm in the hardware world. It's horrible, and computing would be so much better if hardware companies actually talked with software developers openly. Because frankly, they suck at software, including firmware.
  • dishayu - Sunday, June 02, 2013 - link

    Wow, Krait 400 and A15 are really quite close... No wonder the 2 GS4 variants (1.6Ghz A15 vs 1.9Ghz Krait 400) have similar performance. Reply
  • dishayu - Sunday, June 02, 2013 - link

    I meant krait *300

    krait 400 comes with the snapdragon 800 processors of course.
  • phoenix_rizzen - Sunday, June 02, 2013 - link

    Yeah, looks like Krait 300+ hit Qualcomm's targets of "nearly the performance of an A15 at the power levels of an A9" (give or take a bit). I'm very impressed by the S4 Pro SoC in my Optimus G. Reply
  • ifIhateOnAppleCanIbeCoolToo - Sunday, June 02, 2013 - link

    Krait 300's had very limiting memory read/write performance that makes them firmly last-gen in non-synthetic benchmarks compared to A15's. Reply
  • npp - Sunday, June 02, 2013 - link

    Very nice article, would love to see more like this one. I really feel Anandtech should maintain its focus on low-level architecture details alongside the more consumer oriented reviews. Reply
  • codedivine - Sunday, June 02, 2013 - link

    Author here. Thanks for your kind words :) Reply
  • tipoo - Sunday, June 02, 2013 - link

    Thanks for this, I find this very interesting as the floating point performance of ARM chips is now very relevant since so many games are starting to run on ARM platforms, and floating point is the predominant type of math done in games (vs integer).

    I'd be curious to see where a Jaguar core would fall in this (to estimate the XBone and PS4), as well as a PowerPC 750 (wii u) although the latter would be harder to find. ARM cores seem to be closing in on the performance of the low end x86 cores, even if Jaguar is still quite a ways ahead, I wonder how different the FP performance is.
  • codedivine - Sunday, June 02, 2013 - link

    Author here. Jaguar throughput is discussed in the article discussion. Summary: 3 fp64 flops/cycle, 8 fp32 flops/cycle. Reply
  • Wilco1 - Sunday, June 02, 2013 - link

    Here are the Geekbench results of Jaguar vs A15:

    On FP A15 wins by a good margin. On integer Jaguar is slightly faster.
  • tipoo - Sunday, June 02, 2013 - link

    That's unexpected. I would have thought the Jaguar would lead in almost every situation, being higher power. Reply
  • Wilco1 - Monday, June 03, 2013 - link

    Remember A15 is 3-way OoO, supports 1 load and 1 store per cycle and has very wide issue, so it can easily leave Jaguar behind on compute intensive code as the results show. However Jaguar wins on memory intensive code due to its larger L2 and faster memory system. Reply
  • aliasfox - Monday, June 03, 2013 - link

    If historical Mac G3 benchmarks are anything to go by, I don't think the PPC 750 will be much faster at floating point than the best of ARM.

    Apple used the PPC750 and called it the G3 back in the day. New ones are higher clocked, more power efficient, and maybe more/faster cache, but should be fundamentally the same. Assuming this, one could be able to extrapolate synthetic benches based on scaling cores and frequency, no?
  • DanNeely - Sunday, June 02, 2013 - link

    Where's atom stand in the mix? I think it would be a useful datapoint since Intel is positioning the Atom against ARM based systems. Reply
  • Wilco1 - Monday, June 03, 2013 - link

    IIRC Atom has similar peak FP capabilities as Cortex-A9, however actual performance is far lower. Eg. 1.4GHz Cortex-A9 wins most single threaded FP benchmarks against a 2GHz Z2480:

    This also shows how far behind Atom is compared with last-generation phones. Intel needs Silvermont desperately to try to close the gap.
  • watersb - Sunday, June 02, 2013 - link

    Excellent work!

    I wonder if GPU-based floating point will see more rapid adoption in mobile space.
  • oc3an - Sunday, June 02, 2013 - link

    How did you account for time spent not running your benchmark, i.e. when the OS is servicing interrupts or switched to a different task? Reply
  • codedivine - Sunday, June 02, 2013 - link

    Well it is difficult to measure them. But I do not think those were significant issues in this test. Reply
  • phoenix_rizzen - Sunday, June 02, 2013 - link

    If you are running your app via Android, consider installing Diagnosis Pro. It will allow you to add an overlay that shows you the exact frequency of each individual core, as polled every X seconds. Alternatively, it can just log the data to it's internal database for later export.

    Works quite nicely on an Optimus G (quad-core Snapdragon S4 Pro SoC).

    I've been using to test how well different CPU governess and hot plug CPU drivers work.
  • codedivine - Sunday, June 02, 2013 - link

    Thanks for the tip! I will look into it! Reply
  • ChronoReverse - Tuesday, June 04, 2013 - link

    Yeah, the thermal throttling on the Krait devices is very aggressive (I'm currently using a hack on my GS4 to stop it because I like benchmarks).

    Overhead on Android is also pretty high. With your RgbenchMM, the difference on my GS4 is 3000 vs 3400 if I go ahead and kill tasks firsts.
  • whyso - Sunday, June 02, 2013 - link

    Would really be nice to see
    1) jaguar
    2) Atom
    3) Ivy bridge
    in the mix. (Though of course the test would have to be coded differently).
  • codedivine - Sunday, June 02, 2013 - link

    One needs to be careful while comparing instruction throughput across ISAs, because instructions on different ISAs are not equivalent. However, certainly looking into it. Reply
  • Marat Dukhan - Sunday, June 02, 2013 - link

    This is not an artificial benchmark, but it gets close to peak: Reply
  • nakul02 - Sunday, June 02, 2013 - link

    Check out this paper:
    Gaurav Mitra, Beau Johnston, Alistair P. Rendell, Eric McCreath, Jun Zhou. Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms, Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013.
  • ZeDestructor - Monday, June 03, 2013 - link

    So they try to (IMO), but there's only so many architecture launches every generation so you kinda have to do the more consumer-focused stuff to fill in the gap. Reply
  • skiboysteve - Monday, June 03, 2013 - link

    My work is going to be using cortex a9 for a project soon and that team is deciding on NEON vs vFPU3. Can you comment on the precision and performance tradeoffs?

    thanks for the great article!
  • Wilco1 - Monday, June 03, 2013 - link

    Neon supports 32-bit float only, but with Neon A9 can do 2 FMACs per cycle rather than 1 with VFP. There is no tradeoff in precision if your code already uses 32-bit floats (Neon flushes denormals to zero by default, with VFP you can choose - either way, it doesn't affect any real code). Reply
  • eiriklf - Monday, June 03, 2013 - link

    Is there any chance to see the scores from a third krait 200 device, for instance a krait based one x, GS3 or Optimus G? I know all of those devices have about 3x the performance of the nexus 4 in linpack pr. core, so I would love to know if you found a difference with your script. Reply
  • srihari - Monday, June 03, 2013 - link

    can you compare with Intel ? i understand you have Neon instructions in your test but, x86 Vs ARM will be good comparision. Reply
  • srihari - Monday, June 03, 2013 - link

    Performance is not the only criteria to compare. i would conclude Krait 300 clearly leads Considering performance+power. Reply
  • banvetor - Wednesday, June 05, 2013 - link

    Great article, thanks for the work. Looking forward for more in the series... :) Reply
  • Parhelion69 - Wednesday, June 05, 2013 - link

    Could you update this article with numbers from Exynos 5 octa, from the SGS IV?

    I've run some benchmarks and its A15 seems like quite a beast
    Antutu 28086, CPU float-point: 5923
    sunspider: 652 ms
    kraken: 6392ms
    Riabench focus: 1468 ms

    I don't have geekbench but found these numbers:
    Geekbench score: 3598, floating point: 6168
  • Arkantus - Wednesday, June 19, 2013 - link

    Hello, just a dumb question: the article says "I count NEON addition and multiply as four flops and NEON MACs are counted as eight flops.", and the A9 Add(fp32 NEON) is rated for 1/2 flop/cycle.
    So this means that the Add(fp32 NEON) is slower than it's vfp counterpart? since for each cycle the neon version only perform half an operation according to this table.
  • sonsequence@HOTMAIL.COM - Friday, June 27, 2014 - link

    Hey this is good stuff. Can anybody here help explain something for me though?

    I'm a database apps and integration guy, not formally trained and just starting to get interested in this kind of low level stuff. I've just been reading up on DMips and wondering how they relate to flops.

    What I think I know so far:
    A flop is floating point calculation.
    The "ip" in "Mip" is an instruction so a broader term (is a flop a type of ip or does it take 2 ips to make a flop drop?)
    Instructions per second is about the rawest, most non-contextualised metric of computing power you can get. Flops are a close second.

    Squeezing more instructions out of a single CPU cycle is the hard problem. There aren't massive variances in what can be done in this regard. In the Krait 300 manages about 3.3 instructions per cycle which on 4 cores at 1.7Ghz works out at about 22 GigaIps (semi -source:

    My question is, firstly why are GPUs seemingly never measured in DMips and CPUs rarely in flops?
    Secondly, would knowing the answer to the "firstly" explain why despite no huge variance in DMips/Mhz across different devices the top GPUs manage 1000x faster performance measured in than these ARM chips. They get tera not gigaflops whilst using a similar number of cores and lower frequency.

    Obviously the consume a tonne more power to do it so I know it's not something for nothing but what's the heart of that something when it comes to how much you can do in a cycle?

    Ah. It's just occurred to me. Is it that an "instruction" refers to an item in a linear thread but that just 1 of them in a GPU might include setting the RGB values for all pixels in a frame all at once? That's be a few million flops in parallel for one instruction?

    Hmmm, well the real world numbers don't add up for that but is that along the right lines? If so why are these Gigaflop numbers lower than Gigamips?

    Sorry it was a long one. It can be very hard to find an intermediate starting point when you google an advanced subject.

Log in

Don't have an account? Sign up now