Rosetta2: x86-64 Translation Performance

The new Apple Silicon Macs being based on a new ISA means that the hardware isn’t capable of running existing x86-based software that has been developed over the past 15 years. At least, not without help.

Apple’s new Rosetta2 is a new ahead-of-time binary translation system which is able to translate old x86-64 software to AArch64, and then run that code on the new Apple Silicon CPUs.

So, what do you have to do to run Rosetta2 and x86 apps? The answer is pretty much nothing. As long as a given application has a x86-64 code-path with at most SSE4.2 instructions, Rosetta2 and the new macOS Big Sur will take care of everything in the background, without you noticing any difference to a native application beyond its performance.

Actually, Apple’s transparent handling of things are maybe a little too transparent, as currently there’s no way to even tell if an application on the App Store actually supports the new Apple Silicon or not. Hopefully this is something that we’ll see improved in future updates, serving also as an incentive for developers to port their applications to native code. Of course, it’s now possible for developers to target both x86-64 and AArch64 applications via “universal binaries”, essentially just glued together variants of the respective architecture binaries.

We didn’t have time to investigate what software runs well and what doesn’t, I’m sure other publications out there will do a much better job and variety of workloads out there, but I did want to post some more concrete numbers as to how the performance scales across different time of workloads by running SPEC both in native, and in x86-64 binary form through Rosetta2:

SPECint2006 - Rosetta2 vs Native Score %

In SPECint2006, there’s a wide range of performance scaling depending on the workloads, some doing quite well, while other not so much.

The workloads that do best with Rosetta2 primarily look to be those which have a more important memory footprint and interact more with memory, scaling perf even above 90% compared to the native AArch64 binaries.

The workloads that do the worst are execution and compute heavy workloads, with the absolute worst scaling in the L1 resident 456.hmmer test, followed by 464.h264ref.

SPECfp2006(C/C++) - Rosetta2 vs Native Score %

In the fp2006 workloads, things are doing relatively well except for 470.lbm which has a tight instruction loop.

SPECint2017(C/C++) - Rosetta2 vs Native Score %

In the int2017 tests, what stands out is the horrible performance of 502.gcc_r which only showcases 49.87% performance of the native workload – probably due to high code complexity and just overall uncommon code patterns.

SPECfp2017(C/C++) - Rosetta2 vs Native Score %

Finally, in fp2017, it looks like we’re again averaging in the 70-80% performance scale, depending on the workload’s code.

Generally, all of these results should be considered outstanding just given the feat that Apple is achieving here in terms of code translation technology. This is not a lacklustre emulator, but a full-fledged compatibility layer that when combined with the outstanding performance of the Apple M1, allows for very real and usable performance of the existing software application repertoire in Apple’s existing macOS ecosystem.

SPEC2017 - Multi-Core Performance Conclusion & First Impressions
Comments Locked

682 Comments

View All Comments

  • BushLin - Wednesday, November 18, 2020 - link

    Do you have any idea what you're talking about or is it simply that whatever Apple are doing must be ideal?
    I'm sure Intel are gutted about continuing to not meet the huge demand for their x86 chips on a fabrication process two generations out of date.
    Also, the same power envelope AMD is beating out the M1 on an old design and fab process.
  • Spunjji - Monday, November 23, 2020 - link

    @BushLin - "I'm sure Intel are gutted about continuing to not meet the huge demand for their x86 chips on a fabrication process two generations out of date."

    This is the exact kind of nonsense that makes me disappointed every time I see a reply from you to one of my posts on this thread.
  • BushLin - Monday, November 23, 2020 - link

    I was replying to "It’s just simply that x86-64 is a hinderance to AMD and Intel"
    Maybe don't take technical/factual matters so emotionally. Also, wasn't a reply to you unless you have many accounts.
  • NetMage - Monday, November 23, 2020 - link

    Maybe take your own advice?
  • BushLin - Tuesday, November 24, 2020 - link

    How many accounts do you need?
  • markiz - Thursday, November 19, 2020 - link

    Ok, but do you imagine apple will not have advanced by then?
    I'm pretty sure they have a long pipeline ready for the next decade.
  • Steven Choi 4321 - Friday, November 20, 2020 - link

    Sure, amd wins with $700 chip vs $40 M1. Amd and intell are the nokia and blackberry of the time.
  • hagjohn - Tuesday, November 24, 2020 - link

    AMD is killing it. M1 is the entry level CPU (SoC) attempt from Apple. I think it is pretty good, considering it can go up against an i9. The way the M1 SoC is put together has some advantages and disadvantages. A big disadvantage is everything is build on the SoC, so if you want to add memory or change out an SSD, you are out of luck. Anything that breaks in the SoC and you need a new computer. A big advantage is that with everything on the SoC, Apple has removed a lot of the latency that we can see in intel/AMD systems.

    And remember... M1 is the entry level CPU (SoC) from Apple. Wait till we get towards the more Pro versions.
  • name99 - Wednesday, November 18, 2020 - link

    Andrei, are those L1$ bandwidth numbers correct? They look off to me.
    Specifically 100~3*2*16, ie 3GHz times 2 loads/cycle, each 128 hits (ie 16B) wide. (Either load pair of int registers, or load of a neon register).
    BUT the A14 article said there were three load units...
    Are three loads/cycle only sustainable for a very short time?

    A second item of interest is does the test even try to use Load Pair or Load Pair Nontemporal of two NEON registers? Earlier A cores had a 128bit per load/store unit path to L1, so there was no bandwidth win in loading a pair of vectors, but at some point presumably this might change...
  • Frantisek - Sunday, December 20, 2020 - link

    Are you planing to review any of M1 laptops so you can cover results in comparable laptops tests?

Log in

Don't have an account? Sign up now