Section by Andrei Frumusanu

CPU ST Performance: SPEC 2006, SPEC 2017

SPEC2017 and SPEC2006 is a series of standardized tests used to probe the overall performance between different systems, different architectures, different microarchitectures, and setups. The code has to be compiled, and then the results can be submitted to an online database for comparison. It covers a range of integer and floating point workloads, and can be very optimized for each CPU, so it is important to check how the benchmarks are being compiled and run.

We run the tests in a harness built through Windows Subsystem for Linux, developed by our own Andrei Frumusanu. WSL has some odd quirks, with one test not running due to a WSL fixed stack size, but for like-for-like testing is good enough. SPEC2006 is deprecated in favor of 2017, but remains an interesting comparison point in our data. Because our scores aren’t official submissions, as per SPEC guidelines we have to declare them as internal estimates from our part.

For compilers, we use LLVM both for C/C++ and Fortan tests, and for Fortran we’re using the Flang compiler. The rationale of using LLVM over GCC is better cross-platform comparisons to platforms that have only have LLVM support and future articles where we’ll investigate this aspect more. We’re not considering closed-sourced compilers such as MSVC or ICC.

clang version 10.0.0
clang version 7.0.1 (ssh://git@github.com/flang-compiler/flang-driver.git
 24bd54da5c41af04838bbe7b68f830840d47fc03)

-Ofast -fomit-frame-pointer
-march=x86-64
-mtune=core-avx2
-mfma -mavx -mavx2

Our compiler flags are straightforward, with basic –Ofast and relevant ISA switches to allow for AVX2 instructions.

To note, the requirements for the SPEC licence state that any benchmark results from SPEC have to be labelled ‘estimated’ until they are verified on the SPEC website as a meaningful representation of the expected performance. This is most often done by the big companies and OEMs to showcase performance to customers, however is quite over the top for what we do as reviewers.

We start off with SPEC2006, a legacy benchmark by now, but which still has very well understood microarchitectural behaviour for us to analyse the new Zen3 design:

SPECint2006 Speed Estimated Scores

In SPECint2006, we’re seeing healthy performance upticks across the board for many of the tests. Particularly standing out is the new 462.libquantum behaviour of the Ryzen 9 5950X which is posting more than double the performance of its predecessor, likely thanks to the new much larger cache, but also the overall higher load/store throughput of the new core as well as the memory improvements of the microarchitecture.

We’re also seeing very large performance increases for 429.mcf and 471.omnetpp which are memory latency sensitive: Although the new design doesn’t actually change the structural latency to DRAM all that much, the new core’s much improved and smarter handling of memory through new cache-line replacement algorithms, new prefetchers, seem to have a large impact on these workloads.

400.perlbench is interesting as it’s not really a memory-heavy or L3 heavy workload, but instead has a lot of instruction pressure. I think that Zen3’s large boost here might be due to the new optimised OP-cache handling and optimisations as that would make the most sense out of all the changes in the new design – it’s one of the tests that has a very high L1I cache miss rate.

A simpler test that’s solely integer execution bound and sits almost solely in the L1D is 456.hmmer, and here we’re seeing only a minor uplift in performance only linear with the clock frequency increase of the new design, with only a 1% IPC uplift. Given that Zen3 doesn’t actually change its integer execution width in terms of ALUs or overall machine width, it makes sense to not see much improvements here.

SPECfp2006(C/C++) Speed Estimated Scores

In SPECfp2006, we’re seeing more healthy boosts in performance across the board which is mostly due to the more memory intensive nature of the workloads, and we’re seeing large IPC uplifts in most tests due to the larger L3 as well as the better memory capabilities of the core. 433.milc sees a smaller uplift than the other benchmarks and that’s due to it being more DRAM memory bandwidth bound. 482.spinx is also seeing a smaller 9% IPC uplift due to it not being that memory intensive.

SPEC2006 Speed Estimated Total

In the overall 2006 scores, the new Ryzen 5000 series parts are showcasing very large generational performance uplifts with margins well beyond that of the previous generation, as well as the nearest competition. Against the 3950X, the new 5950X is 36% faster in the integer workloads, and 29% faster in the floating-point workloads, which are both massive uplifts. AMD is also leaving Intel behind in terms of performance here with a 17% and 25% performance advantage against the 10900K.

SPEC2006 Speed Estimated PPC

In the performance per clock uplifts, measured at peak performance, we’re seeing a 20.87% median and 24.99% average improvement for the new Zen3 microarchitecture when compared to last year’s Zen2 design. AMD is still quite behind Apple’s A13 and A14 (review coming soon), but that’s natural given the almost double the microarchitectural width of Apple’s design, running at lower frequencies. It’ll be interesting to get Apple Silicon Mac devices tested and compared against the new AMD parts.

SPECint2017 Rate-1 Estimated Scores

Moving onto the newer SPECint2017, we again see some large improvement of Zen3 depending on the various microarchitectural characteristics of the respective workloads. 500.perlbench_r again shows a massive 37% IPC uplift for the new architecture – again very likely to the new design and optimisations on the part of the OP-cache of the Zen3 design.

520.omnetpp again also shows a 42% IPC uplift thanks to the memory technologies employed in the new design. Execution throughput limited workloads such as 525.x264 are seeing smaller increases of 9.5% IPC due to again overall less changes on this aspect of the microarchitecture.

SPECfp2017 Rate-1 Estimated Scores

In SPECfp2017, we see a similar situation as previous workloads. Execution-bound workloads such as 508.namd or 538.imagick are seeing smaller IPC increases in the 9-6% range. Similarly, DRAM bandwidth starved workloads such as 549.fotonik3d and 554.roms are showcasing also smaller IPC boosts of 2.7 – 8.6%.

The more hybrid workloads which make good use of the caches are seeing larger performance improvements across the board. Up to a 35.6% IPC peak for 519.lbm.

SPEC2017 Rate-1 Estimated Total

In the SPEC2017 suite total performance figures, the new Ryzen 5000 also shine thanks to their frequency and IPC uplifts. Generationally, across the int2017 and fp2017 suites, we’re seeing a 32% and 25% performance boost over the 3950X, which are very impressive figures.

IPC wise, looking at a histogram of all SPEC workloads, we’re seeing a median of 18.86%, which is very near AMD’s proclaimed 19% figure, and an average of 21.38% - although if we discount libquantum that average does go down to 19.12%. AMD’s marketing numbers are thus pretty much validated as they’ve exactly hit their proclaimed figure with the new Zen3 microarchitecture.

SPEC2017 Rate-1 Estimated PPC

On the competitive landscape, this now makes Zen3 the undisputed leader in the x86 space, leaving Intel’s old Skylake designs far behind and also showing more design complexity than the newer Sunny Cove and Willow Cove cores.

Overall, the new Ryzen 5000 series and the Zen3 microarchitecture seem like absolute winners, and there’s no dispute about them taking the performance crown. AMD has achieved this through both an uplift in frequency, as well as a notable 19% uplift thanks to a smarter design.

What I hope to see from AMD in future designs is a more aggressive push towards a wider core design with even larger IPC jumps. In workloads that are more execution bound, Zen3 isn’t all that big of an uplift. The move from a 16MB to a 32MB L3 cache isn’t something that’ll repeated any time soon in terms of improvement magnitude, and it’s also very doubtful we’ll see significant frequency uplifts with coming generations. As Moore’s Law is slowing, going wider and smarter seems to be the only way forward for advancing performance.

TDP and Per-Core Power Draw SPEC2017 Multi-Threaded Results
Comments Locked

339 Comments

View All Comments

  • Luminar - Thursday, November 5, 2020 - link

    Cache Rules Everything Around Me
  • SIDtech - Thursday, November 5, 2020 - link

    Hi Andrei,

    Excellent work. Do you know how this performance shapes up against the Cortex A77 ?
  • t.s - Friday, November 6, 2020 - link

    Seconded. Want to know how the likes of ryzen 4 4350G or 5600 versus Cortex A77 or A78.
  • Kangal - Saturday, November 7, 2020 - link

    It's hard to say, because it really depends on the instruction/software as it is very situational. It also depends on the type of device it is powering, you can move up from Phones, to Thin Tablets, to Thick Laptops, to Large Desktops, and upto a Server. Each device offers different thermal constraints.

    The lower-thermal devices will favour the ARM chip, the mid-level will favour AMD, and the higher-thermal devices will favour Intel. That WAS the rule of thumb. In general, you could say Intel's SkyLake has the single-threaded performance crown, then AMD's Zen+ loses to it by a notable margin but beats it in multi-threaded tasks, and then going to an ARM Cortex A76 will have the lowest single-thread but the highest multi-threaded performance.

    Now?
    Well, there's the newly launched 2021 AMD Zen3 processor. And the upcoming 2021 ARM Cortex-X Overclocked Big-core using the new A78 microarchitecture. Lastly there's the 2022 Intel Rocket Lake yet to debut. So it's too early to tell, we can only make inferences.
  • Kangal - Saturday, November 7, 2020 - link

    Here is my personal (yet amateur) take on the future 2020-2022 standpoints between the three racers. Firstly I'll explain what the different keywords and attributes mean
    (from most technical to most real-world implication)

    Total efficiency: (think Full Server / Tractor) how much total calculations versus total power draw
    Multi-threaded: (think Large Desktop / Truck) how much total calculations
    Single-threaded: (think Thick Laptop / Car) how much priority calculations
    IPC performance: (think Thin Tablet / Motorbike) how much priority calculations at desirable frequency/voltage/power-draw

    *Emulating:
    Having a "simple" ARM chip running "complex" x86 instructions. Such as running 32bit or 64bit OS X or Windows programs, via new techniques of emulation using a partial-hardware and hybrid-software solutions. I think the hit to efficiency will be around x3, instead of the expected x12 degradation.

    So here are the lists (from most technical to most real-world implication)
    Simple Code > Mixed code > Recommended Solution

    Here's how they stack up when running identical new code (ie Modern Apps):
    Total efficiency: ARM >>>> AMD >> Intel
    Multi-threaded: ARM > AMD > Intel
    Single-threaded: Intel = AMD > ARM
    IPC performance: ARM >>> AMD > Intel

    Now what about them running legacy code (ie x86 Program):
    Efficiency + *emulating: AMD > Intel >> ARM
    Multi + *emulating: AMD > Intel >> ARM
    1n + *emulating: Intel = AMD >>> ARM
    IPC + *emulating: AMD > Intel > ARM

    My recommendation?
    Full Server: 60% legacy 40% new code. This makes ARM the best option by a small margin.
    Large Desktop: 80% legacy 20% new code. AMD is the best option with modest margin.
    Thick Laptop: 70% legacy 30% new code. Intel is the best. AMD is very close (tied?) second.
    Thin Tablet: 10% legacy 90% new code. ARM is the best option by huge margin.
  • Tomatotech - Monday, November 9, 2020 - link

    Excellent post, but worth pointing out that *all* modern chips now emulate x86 and x64 code. They run a front end that takes x86 / x64 machine code then convert that into RISC code and that goes through various microcode and translation layers before being processed by the backend. That black box structure has allowed swapping out and optimising the back end for decades while maintaining code compatibility on the front end.

    So it’s not as simple to differentiate between the various chips as you make it out to be.
  • Gondalf - Sunday, November 8, 2020 - link

    I don't know. Looking Spec results, we can say Anandtech is absolutely unable to set a Spec session correctly. From the review Zen 2 is slower per Ghz than old Skylake in integer, that is absolutely wrong in consumer cores (in server cores yes), even worse Ice Lake core is around fast as old Skylake per GHz.
    Basically this review is rushed and very likely they have set all AMD compiler flags on "fast" to do more contacts and a lot of hipe.
    My God, for Anandtech Zen 3 is 35% faster in the global Spec values than Zen 2. Not even AMD worst marketing slide say this. We have Zen 4 here not Zen 3. Wait wait please.
    A really crap review, the author need to go back to school about Spec.

    Obviously the article do not say that 28W Tiger Lake is unable to run at 4.8Ghz for more than a couple of seconds, after this it throttes down, so the same Willow Cove core on a desktop Cpu could destroy Zen 3 without mercy on a CB session. Not to mention the far slower memory subsystem of a mobile cpu.

    Basically looking at games results, Rocket Lake will eclipse this core forever. AMD have nothing of new in its hands, they need to wait Zen 4
  • Qasar - Sunday, November 8, 2020 - link

    yea ok gondalf, trying to find ways that your beloved intel doesnt lose at everything now ??
    accept it, amd is faster then intel across the board.
  • Spunjji - Monday, November 9, 2020 - link

    That's a strange claim about Tiger Lake performance, Gondalf, because I seem to recall Intel seeding all the reviewers with a laptop that could run TGL at 4.8Ghz boost 'til the cows come home - and that's what Anandtech used to get that number. It's literally the best they can do right now. You're right of course - in actual shipping ultrabooks, TGL is a hot PoS that cannot maintain its boost clocks. Maybe by 2022 they'll finally put Willow Cove into a shipping desktop CPU.

    "Basically looking at games results, Rocket Lake will eclipse this core forever"
    If by "eclipse" you mean gain a maximum 5% advantage at higher clock speeds and nearly double the power draw then sure, "eclipse", yeah. 🤭

    I love your posts here. Please, never stop stepping on rakes like Sideshow Bob.
  • macroboy - Saturday, December 12, 2020 - link

    LOL look at AMD's Efficiency and sustained core clocks, Intel runs too hot to stay at 5ghz for very long. meanwhile Zen3 plows along at 55C no problem, *you're the one who needs to check your facts.

Log in

Don't have an account? Sign up now