Section by Andrei Frumusanu

CPU ST Performance: SPEC 2006, SPEC 2017

SPEC2017 and SPEC2006 is a series of standardized tests used to probe the overall performance between different systems, different architectures, different microarchitectures, and setups. The code has to be compiled, and then the results can be submitted to an online database for comparison. It covers a range of integer and floating point workloads, and can be very optimized for each CPU, so it is important to check how the benchmarks are being compiled and run.

We run the tests in a harness built through Windows Subsystem for Linux, developed by our own Andrei Frumusanu. WSL has some odd quirks, with one test not running due to a WSL fixed stack size, but for like-for-like testing is good enough. SPEC2006 is deprecated in favor of 2017, but remains an interesting comparison point in our data. Because our scores aren’t official submissions, as per SPEC guidelines we have to declare them as internal estimates from our part.

For compilers, we use LLVM both for C/C++ and Fortan tests, and for Fortran we’re using the Flang compiler. The rationale of using LLVM over GCC is better cross-platform comparisons to platforms that have only have LLVM support and future articles where we’ll investigate this aspect more. We’re not considering closed-sourced compilers such as MSVC or ICC.

clang version 10.0.0
clang version 7.0.1 (ssh://git@github.com/flang-compiler/flang-driver.git
 24bd54da5c41af04838bbe7b68f830840d47fc03)

-Ofast -fomit-frame-pointer
-march=x86-64
-mtune=core-avx2
-mfma -mavx -mavx2

Our compiler flags are straightforward, with basic –Ofast and relevant ISA switches to allow for AVX2 instructions. We decided to build our SPEC binaries on AVX2, which puts a limit on Haswell as how old we can go before the testing will fall over. This also means we don’t have AVX512 binaries, primarily because in order to get the best performance, the AVX-512 intrinsic should be packed by a proper expert, as with our AVX-512 benchmark.

To note, the requirements for the SPEC licence state that any benchmark results from SPEC have to be labelled ‘estimated’ until they are verified on the SPEC website as a meaningful representation of the expected performance. This is most often done by the big companies and OEMs to showcase performance to customers, however is quite over the top for what we do as reviewers.

Starting off with our SPEC2006 analysis for Tiger Lake, given that we’re extremely familiar with the microarchitectural characteristics of these workloads:

SPECint2006 Speed Estimated Scores

As a note, the Tiger Lake figures published in the detailed sub-scores represent the 28W TDP configuration option of the platform, with the core mostly clocking to 4800MHz and all other aspects the device allowing for maximum speed. This allows us for a pure microarchitectural analysis.

The generational improvements of the new Sunny Cove design here is showing very much its advertised characteristics of the microarchitecture.

Starting off with high-IPC and backend execution-bound workloads such as 456.hmmer we’re seeing a near linear performance increase with clock frequency. Sunny Cove here had larger IPC improvements but the Ice Lake design was rather limited in its clock frequency, most of the time still losing out to higher-clocked Skylake designs.

This time around with the major frequency boost, the Tiger Lake chip is able to even outperform the desktop i7-10900K at 5.3GHz as long as memory doesn’t become a bottleneck.

IPC/performance-per-clock wise, things are mostly flat between generation at +-2% depending on workloads, but 473.astar does seem to like the Willow Cove architecture as we’re seeing a +10% boost. 403.gcc’s 4% IPC improvement also likely takes advantage of the larger L2 cache of the design, whilst 429.mcf’s very latency sensitive nature sees a huge 23% IPC boost thanks to the strong memory controllers of Tiger Lake.

462.libquantum doesn’t fare well at all as we’re not only seeing a 30% reduction in IPC, but absolute performance is actually outright worse than Ice Lake. This workload is bandwidth hungry. The theory is that if it has a mostly cache-resident workload footprint, then it would generally make sense to see such a perf degradation due to the L3’s overall degraded generational performance. It’s an interesting aspect we’ll also see in 470.lbm.

SPECfp2006(C/C++) Speed Estimated Scores

In the floating-point workloads, we again see the Tiger Lake chip doing extremely well, but there are some outliers. As mentioned 470.lbm is which is also extremely bandwidth hungry sees a generational degradation, which again could be L3 related, or something more specific to the memory subsystem.

There’s actually a wider IPC degradation in this set, with 482.sphinx being the only positive workload with a +2% boost, while the rest fall in a -12%, -7%, -14%, -3% and that massive -31% degradation for 470.lbm. Essentially, all workload which have stronger memory pressure characteristics.

SPEC2006 Speed Estimated Total

Overall SPEC2006 score performance for Tiger Lake is extremely good. Here we also present the 15W vs 28W configuration figures for the single-threaded workloads, which do see a jump in performance by going to the higher TDP configuration, meaning the design is thermally constrained at 15W even in ST workloads. By the way, this is a core power consumption limitation, as even small memory footprint workloads see a performance jump.

The i7-1185G7 is at the heels of the desktop i9-10900K, trailing only by a few percentage points.

Against the x86 competition, Tiger Lake leaves AMD’s Zen2-based Renoir in the dust when it comes to single-threaded performance. Comparing it against Apple’s A13, things aren’t looking so rosy as the Intel CPU barely outmatches it even though it uses several times more power, which doesn’t bode well for Intel once Apple releases its “Apple Silicon” Macbooks.

Even against Arm’s Cortex-A77 things aren’t looking rosy, as the x86 crowd just all that much ahead considering the Arm design only uses 2W.

SPECint2017 Rate-1 Estimated Scores

Moving onto the newer SPEC2017 suite, we’re seeing a quite similar story across the scaling between the platforms. Tiger Lake and its Willow Cove cores are showcasing outstanding performance as long as things are execution-bound, however do fall behind a bit to the desktop system when memory comes into play. There are two sets of results here, workloads which have high bandwidth or latency requirements, or those which have large memory footprint requirements.

523.xalancbmk_r seems to be of the latter as it’s posting a quite nice 10% IPC jump for Willow Cove while the rest generally in-between -4% regressions or +3-5% improvements.

SPECfp2017 Rate-1 Estimated Scores

In the FP suite, we mostly see again the same kind of characteristics, with performance most of the time scaling in line with the clock frequency of Tiger Lake, with a few outliers here and there in terms of IPC, such as 544.nab_r gaining +9%, or 549.fotonik3d_r regressing by 12%.

Much like in the 2006 suite, the memory bandwidth hungry 519.lbm_r sees a 23% IPC regression, also regressing its absolute performance below that of Ice Lake.

SPEC2017 Rate-1 Estimated Total

Overall, in the 2017 scores, Tiger Lake actually comes in as the leading CPU microarchitecture if you account both the integer and float-point scores together.

Although the design’s absolute performance here is exemplary, I feel a bit disappointed that in general the majority of the performance gains seen today were due to the higher clock frequencies of the new design.

IPC improvements of Willow Cove are quite mixed. In some rare workloads which can fully take advantage of the cache increases we’re seeing 9-10% improvements, but these are more of an exception rather than the rule. In other workloads we saw some quite odd performance regressions, especially in tests with high memory pressure where the design saw ~5-12% regressions. As a geometric mean across all the SPEC workloads and normalised for frequency, Tiger Lake showed 97% of the performance per clock of Ice Lake.

In a competitive landscape where AMD is set to make regular +15% generational IPC improvements and Arm now has an aggressive roadmap with yearly +30% IPC upgrades, Intel’s Willow Cove, although it does deliver great performance, seems to be a rather uninspiring microarchitecture.

Power Consumption: Comparing 15 W TGL to 15 W ICL to 15 W Renoir CPU MT Performance: SPEC 2006, SPEC 2017
POST A COMMENT

252 Comments

View All Comments

  • blppt - Saturday, September 26, 2020 - link

    Sure, the box sitting right next to my desk doesn't exist. Nor the 10 or so AMD cards I've bought over the past 20 years.

    1 5970
    2 7970s (for CFX)
    1 Sapphire 290x (BF4 edition, ridiculously loud under load)
    2 XFX 290 (much better cooler than the BF4 290x) mistakenly bought when I thought it would accept a flash to 290x, got the wrong builds, for CFX)
    2 290x 8gb sapphire custom edition (for CFX, much, much quieter than the 290x)
    1 Vega 64 watercooled (actually turned out to be useful for a Hackintosh build)
    1 5700xt stock edition

    Yeah, i just made this stuff up off the top of my head. I guarantee I've had more experience with AMD videocards than the average gamer. Remember the separate CFX CAP profiles? I sure do.

    So please, tell me again how I'm only a Nvidia owner.
    Reply
  • Santoval - Sunday, September 20, 2020 - link

    If the top-end Big Navi is going to be 30-40% faster than the 2080 Ti then the 3080 (and later on the 3080 Ti, which will fit between the 3080 and the 3090) will be *way* beyond it in performance, in a continuation of the status quo of the last several graphics card generations. In fact it will be even worse this generation, since Big Navi needs to be 52% faster than the 2080 Ti to even match the 3070 in FP32 performance.

    Sure, it might have double the memory of the 3070, but how much will that matter if it's going to be 15 - 20% slower than a supposed "lower grade" Nvidia card? In other words "30-40% faster than the 2080 Ti" is not enough to compete with Ampere.

    By the way, we have no idea how well Big Navi and the rest of the RDNA2 cards will perform in ray-tracing, but I am not sure how that matters to most people. *If* the top-end Big Navi has 16 GB of RAM, it costs just as much as the 3070 and is slightly (up to 5-10%) slower than it in FP32 performance but handily outperforms it in ray-tracing performance then it might be an attractive buy. But I doubt any margins will be left for AMD if they sell a 16 GB card for $500.

    If it is 15-20% slower and costs $100 more noone but those who absolutely want 16 GB of graphics RAM will buy it; and if the top-end card only has 12 GB of RAM there goes the large memory incentive as well..
    Reply
  • Spunjji - Sunday, September 20, 2020 - link

    @Santoval, why are you speaking as if the 3080's performance characteristics are not already known? We have the benchmarks in now.

    More importantly, why are you making the assumption that AMD need to beat Nvidia's theoretical FP32 performance when it was always obvious (and now extremely clear) that it has very little bearing on the product's actual performance in games?

    The rest of your speculation is knocked out of what by that. The likelihood of an 80CU RDNA 2 card underperforming the 3070 is nil. The likelihood of it underperforming the 3080 (which performs like twice a 5700, non-XT) is also low.
    Reply
  • Byte - Monday, September 21, 2020 - link

    Nvidia probably has a good idea how it performs with access to PS5/Xbox, they know they had to be aggressive this round with clock speeds and pricing. As we can see 3080 is almost maxed, o/c headroom like that of AMD chips, and price is reasonable decent, in line with 1080 launch prices before minepocalypse. Reply
  • TimSyd - Saturday, September 19, 2020 - link

    Ahh don't ya just love the fresh smell of TROLL Reply
  • evernessince - Sunday, September 20, 2020 - link

    The 5700XT is RDNA1 and it's 1/3rd the size of the 2080 Ti. 1/3rd the size and only 30% less performance. Now imagine a GPU twice the size of the 5700XT, thus having twice the performance. Now add in the node shrink and new architecture.

    I wouldn't be surprised if the 6700XT beat the 2080 Ti, let alone AMD's bigger Navi 2 GPUs.
    Reply
  • Cooe - Friday, December 25, 2020 - link

    Hahahaha. "Only matching a 2080 Ti". How's it feel to be an idiot? Reply
  • tipoo - Friday, September 18, 2020 - link

    I'd again ask you why a laptop SoC would have an answer for a big GPU. That's not what this product is. Reply
  • dotjaz - Friday, September 18, 2020 - link

    "This Intel Tiger" doesn't need an answer for Big Navi, no laptop chip needs one at all. Big Navi is 300W+, no way it's going in a laptop.

    RDNA2+ will trickle down to mobile APU eventually, but we don't know if Van Gogh can beat TGL yet, I'm betting not because it's likely a 7-15W part with weaker Quadcore Zen2.

    Proper RDNA2+ APU won't be out until 2022/Zen4. By then Intel will have the next gen Xe.
    Reply
  • Santoval - Sunday, September 20, 2020 - link

    Intel's next gen Xe (in Alder Lake) is going to be a minor upgrade to the original Xe. Not a redesign, just an optimization to target higher clocks. The optimization will largely (or only) happen at the node level, since it will be fabbed with second gen SuperFin (formerly 10nm+++), which is supposed to be (assuming no further 7nm delays) Intel's last 10nm node variant.
    How well will that work, and thus how well 2nd gen Xe will perform, will depend on how high Intel's 2nd gen SuperFin will clock. At best 150 - 200 MHz higher clocks can probably be expected.
    Reply

Log in

Don't have an account? Sign up now