SPEC - MT Performance (4xlarge 16 vCPU)

The 64-core results were quite interesting and put the Graviton2 in a very competitive performance position, but all this talk about performance scaling varying depending on the loaded core count of the system made me wonder how the EC2 instances would perform at lower vCPU counts.

I fired up the same tests, just this time around with only rate-16 to match the number of vCPUs. These are 4xlarge EC2 instances with corresponding 16 vCPUs, but there’s one large caveat in this comparison that we must keep in mind: The Graviton2 instances very likely have no neighbours at this point in time in the test preview, meaning the performance scaling we’re seeing here is very much a best-case scenario for the Amazon chip. EC2 global capacity floats around at 60% active usage, and I imagine Amazon distributes this horizontally across the available sockets in their datacentres. How these performance figures will look like in the real world once Graviton2 ramps up in public availability is anybody’s guess.

The AMD system likely won’t care too much about such scenarios as their NUMA nature means they’re isolated from noisy neighbours anyhow, and we’re just seeing use of a single 8-core chip with its own memory controllers, but the Intel system will have possibly some neighbours doing some activity on the same socket and shared resources. I only ran one test run here; you’d probably need a lot of data to get a representative figure across EC2 usage.

For the Intel m5n instances, using an 4xlarge instance actually means you're only on on single socket this time around, meaning that the scaling behaviour in favour of higher per-thread performance isn't to be expected as high as on the Graviton2 system, as system DRAM bandwidth and L3 is halved compared to the 16xlarge figures on the previous page.

Also, since we’re testing 16 vCPU setups here, we can have an apples-to-apples comparison between the first- and second-generation Graviton systems which should be a fun comparison.

SPECint2006 Rate Estimated Scores (16 vCPU)

The comparison between the two generations of Graviton processors here is also astounding. Memory intensive workloads favour the newer Graviton2 by at least a factor of 2x, more often 3x, 4x, 5x and even up to 7x in libquantum.

The AMD system as expected doesn’t gain much scaling from using less cores as there’s no more shared resources available on a per-thread basis. The Intel chip fares slightly better per-thread, but doesn’t see the same higher performance scaling (Or should I say, reverse-scaling) as achieved by the Graviton2.

SPECfp2006(C/C++) Rate Estimated Scores (16 vCPU)

In fp2006, we see more or less the same kind of results.

SPEC2006 Rate-16 Estimated Total (4xlarge)

Overall, in the 16-vCPU rate results the Graviton2 surpasses the performance advantage it showcased in the 64-core results, ending up with an even bigger margin.

SPECint2017 Rate Estimated Scores (16 vCPU) SPECfp2017 Rate Estimated Scores (16 vCPU) SPEC2017 Rate-16 Estimated Total (4xlarge)

The SPEC2017 results again show the same conclusion – the Graviton2 really gains a ton of per-thread performance through the ability to use more of the chip’s L3 cache and 8 memory channels. Whilst on the 64-rate results the Graviton2 and the Xeon were neck-in-neck in fp2017, here the Graviton ends up with a 44% performance advantage.

Again, I can’t put enough emphasis on this, but these results are a best-case scenario for the 4xlarge 16vCPU results of the Graviton2. If production instances are able to achieve such figures will very largely depend on the draw of luck on whether you’re going to be alone on the physical hardware or whether you’ll have any neighbours on the chip. And even if you have neighbours, the performance figures will largely depend on what kind of workloads they will be running alongside your use-cases.

I saw a few articles out there comparing the performance between the m6g instances against the m5 generation instances (Skylake-SP hardware), but most of these tests were done only on medium (1 vCPU) to xlarge (4 vCPUs). When reading such pieces, it’s naturally important to keep in mind the vast scaling advantage the Graviton2 chip has – the smaller your instance is the more chance you’ll have noisy neighbours on the hardware, something that currently just doesn’t happen in the Graviton2’s preview phase.

SPEC - MT Performance (16xlarge 64vCPU) Cost Analysis - An x86 Massacre
Comments Locked

96 Comments

View All Comments

  • jbrower - Saturday, July 24, 2021 - link

    Well at least you have a troll -- mark of success for authors, hehe
  • ProDigit - Wednesday, March 11, 2020 - link

    110W is very pessimistic, and would make no sense at all, considering that the ryzen 9 3900x uses 105W at 12 cores 24 threads at 4.6Ghz and 7nm, and the 3950 does the same with 4 more cores.
    Plus, regular arm based (AMLogic) boxes use 3Watt in total under load (that includes CPU+Ethernet+RAM+Emmc) for 4 CPU cores running at 1,9Ghz.
    If you ask me, 64 core arm CPUs running at 2Ghz should run at around just over 1 watt per core, making it a 65W tdp chip
  • Andrei Frumusanu - Wednesday, March 11, 2020 - link

    There's 64 PCIe4 lanes and 8 memory controllers in there as well.
  • cdome - Wednesday, March 11, 2020 - link

    Quick question. Does Graviton2 have support for SVE2 vector extension? if yes how wide are execution units? thank you
  • Andrei Frumusanu - Wednesday, March 11, 2020 - link

    No, there's 2x128b v8 ASIMD/NEON pipes.
  • Soulkeeper - Wednesday, March 11, 2020 - link

    What was used to generate the images on page 2 ?
    ie: https://images.anandtech.com/doci/15578/AMD-Epyc-6...

    Is this app/source available to download ?

    Thanks
  • sharath.naik - Wednesday, March 11, 2020 - link

    Whats behind the name Annapurna? The name is Indian in origin but the company is Israeli.
  • nijimon - Thursday, March 12, 2020 - link

    Judging by the logo it could be referring to the massif in the Himalayas.
    https://en.wikipedia.org/wiki/Annapurna_Massif
  • Andy Chow - Thursday, March 12, 2020 - link

    "I recently had the time to write a new custom microbenchmark for testing synchronisation latencies of CPU cores, exhibiting some of the cache-coherency as well as physical layouts of current designs."

    Wow, and what a benchmark that turned out to be. Please consider packaging it and releasing it. Or giving us the code so we can run it. I would really love to run that test on a few of my machines. I am frustrated with current benchmarks on this area also, and you seem to have built the perfect solution.
  • ballsystemlord - Thursday, March 12, 2020 - link

    1 Grammar error:

    "Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM the is the primary compiler for billions of Arm devices in the mobile space."
    Extra "the":
    "Overall, it's a bit odd to see GCC ahead in that many workloads given that LLVM is the primary compiler for billions of Arm devices in the mobile space."

Log in

Don't have an account? Sign up now