SPEC - Multi-Core Performance Scaling

I did mention the L3 cache of the Graviton2 was shared amongst all its cores, and we also discovered how only 8-16 cores were able to saturate the memory controllers of the system. To put those aspects into better context, I ran the SPEC suites at rate instance numbers, ranging from 16, 32, 48 and the full 64 cores, and normalised the results relative to the per-thread performance showcased in the rate-1 single-threaded runs.

What this attempts to showcase is the performance scaling of the full SoC across varying loads of the different workload types. Scaling linearly across cores might be easy for some workloads, but for anything that even remotely has some kind of memory pressure should see greater slowdowns given that all the threads are competing for the shared L3 and DRAM resources.

The testing here for all figures were done on a 16xlarge instance with 64 vCPUs to avoid the possibility of noisy neighbours, and give better reliability in the lower core count results.

SPECint2006 Speed Graviton2 Core Performance Scaling

As expected, we’re seeing a quite wide range of results here, and it’s also a good showcase of which SPEC workloads are memory and cache intensive and which are not. Workloads such as 445.gobmk and 456.hmmer aren’t surprising in their near linear scaling as they don’t have too much cache pressure, and the Graviton2’s 1MB L2 per core is also more than enough for 464.h264ref.

On the other hand, well known memory intensive workloads such as 462.libquantum absolutely crater in terms of per-thread performance. This memory bandwidth demanding workload is fully saturating the bandwidth of the system early on with very few cores, meaning that performance barely increases the more threads and cores we throw at it. Such a scaling more or less is mimicked in other workloads of varying cache and memory pressure.

The most worrying result though is 403.gcc. Code compilation should have been one of the bigger use-cases for a platform such as Graviton2, but the platform is having issues scaling well with core count, undoubtedly a result of higher cache pressure of the system. In a single-thread scenario in the system a core would have access to 33MB L2+L3, but when having 64 cores doing the same thing at once you’d end up with only 1.5MB per core, assuming things are evenly competitively shared.

SPECfp2006(C/C++) Speed Graviton2 Core Performance Scaling

In SPECfp2006, again, we see the well-known memory intensive workloads such as 433.milc and 470.lbm crater in their per-thread performance the more threads you throw at the system, while other workloads are able to scale near linearly with cores.

SPECint2017 Rate Graviton2 Core Performance Scaling

In SPECint2017, we see the workload changes I referred to previously on the single-threaded page. The new gcc and mcf tests are actually scaling better than their 2006 counterparts due to actually reduced memory pressure on the new tests. It does beg the question of which variant of the test is actually more representative of most workloads of these types.

SPECfp2017 Rate Graviton2 Core Performance Scaling

Compared to the int2017 suite, the fp2017 suite scales significantly worse for a larger number of workloads. When Ampere last week talked about its Altra processor, and that it was “designed for integer workloads”, that didn't make too much sense other than in the context that the N1 cores are missing wider SIMD execution units. What does make sense though is that the floating-point suite of SPEC is a lot more memory intensive and SoCs like the Graviton2 don’t fare as well at higher loaded core-counts.

It will be interesting to see where the Arm chip designers are heading to in regards to this general memory bottleneck. If your workload isn’t too memory intensive then scaling up to such huge core counts is an easy way to scale performance as well. On the opposite end of the spectrum on memory hungry workloads, these chips will just be memory starved. Arm had envisioned 64 core Neoverse N1 systems to have 64-128MB of L3 cache, and the CMN-600 scales up to 256MB total in a 128-core system, which seem like more sensible and balanced targets.

SPEC - Single Threaded Performance SPEC - MT Performance (16xlarge 64vCPU)
Comments Locked

96 Comments

View All Comments

  • eastcoast_pete - Tuesday, March 10, 2020 - link

    While I am currently not in the market for such cloud computing services aside from maybe some video processing, I for one welcome the arrival of a competitive non-x86 solution! Can only make life better and cheaper when and if I do. Also, ARM N1 arch lighting a fire under the x86 makers in their easy chairs will keep AMD and Intel on their feet, and that advance will filter down to my future desktops and laptops.
  • eastcoast_pete - Tuesday, March 10, 2020 - link

    Thanks Andrei! Just out of curiosity, that "noisy neighbor" behavior you saw on the Xeon? I know it's mostly speculation, but would you expect this if someone is running AVX512 on neighboring cores? AVX512 is very powerful if applications can make use of it, but things get very toasty fast. Care to speculate?
  • willgart - Tuesday, March 10, 2020 - link

    where are the real life benchmarks???
    video encoding / decoding ?
    database performance ?
    web performance ?
    https encryption ?
    etc...
  • The_Assimilator - Thursday, March 12, 2020 - link

    Agreed 100%. Without figures of actual real-world applications compiled with actual real-world compilers handling actual real-world workloads, this essentially amounts to an advertorial for Amazon, Graviton2 and Arm.
  • Danvelopment - Wednesday, March 11, 2020 - link

    This may sound stupid as I'm just getting into AWS as backup throughput for local servers on my web project that releases April.

    "If you’re an EC2 customer today, and unless you’re tied to x86 for whatever reason, you’d be stupid not to switch over to Graviton2 instances once they become available, as the cost savings will be significant."

    How do you know whether what you're using is Intel, AMD or Graviton(1/2)? (I'm using T2s right now with no weighting and if our release gets hit hard, will give it weight and and increase its capacity).

    As they're not actually doing anything, then I'd have no issue switching over, but can't tell what I'm on.
  • CampGareth - Wednesday, March 11, 2020 - link

    There's a list here: https://aws.amazon.com/ec2/instance-types/

    If you're on T2 instances you're on Intel chips at the moment.
  • Quantumz0d - Wednesday, March 11, 2020 - link

    No real benchmark. Another SPEC Whiteknighting. I see the AT forums Apple CPU thread being getting creamed over this again.

    ARM is a lockdown POS. You can't even buy them in this case. Altera CPU didn't even came to STH for comparision where it had so many cores against x86 parts. You cannot get them running majority of the consumer workload. One can claim Power from IBM has SMT8 and first Gen4 and all but if its not consumer centric it won't generate much of profit.

    Author seems to love ARM for some reason and hate x86. Its been since Apple articles but in real time we saw how iPhone gets decimated in speed comparison against Android Flagships running the stone age Qualcomm. We have seen this ARM dethroning x86 numerous times and failed. I hope this also fails, a non standard CPU leaves all fun out of equation. And needs emulation for consumer use which slows down performance.

    People want to see all the workloads. Not SPEC. Also where is EPYC Rome comparision Nowhere. Soon Milan is going to hit. Glad that AMD is alive. This stupid ARM BGA dumpster should be dead in its infancy.
  • Wilco1 - Wednesday, March 11, 2020 - link

    LOL - someone feels extremely threatened by Arm servers...

    Mission accomplished!
  • anonomouse - Wednesday, March 11, 2020 - link

    Well that was bizarrely incoherent. What workloads would you want to see instead? Nothing else you wrote made any sense or had any facts behind it.
  • Andrei Frumusanu - Wednesday, March 11, 2020 - link

    He's been doing it for the last year or two, ignore it.

Log in

Don't have an account? Sign up now