Intel Skylake-X Conclusion

For Skylake-X, and by virtue the Skylake-SP core we will see in upcoming Skylake-SP Xeons, Intel decided to make a few changes this time around. The biggest microarchitecture change comes in three stages: the addition of an AVX-512 unit, the adjustment in the L2/L3 cache structure, and the transition to a mesh-based topology. For the consumer and the prosumer, the biggest wins come two-fold: Intel’s 10-core processors are now set to be around $999, undercutting the previous generation by a sizable amount, but also the introduction of the new X299 chipset based motherboards that act like big PCIe switches and should offer a sizeable amount of attached functionality though additional PCIe controllers.

Microarchitecture

For the AVX-512, part of the mantra is that it should be easier for compilers to vectorize more elements of regular code bases and achieve acceleration, but for the most part it is still an enterprise feature with a focus on cryptography, compute, and the financial services industry. In silicon the unit itself is sizable, and we are told it could almost fit an Atom core inside. This is a big change to make for Intel, as it would noticeably increase the size of the full Skylake-SP core and the full die size, which has a knock-on effect. That being said, this core is targeted towards the enterprise market, which should find plenty of uses for it. It is also worth noting that not all CPUs are equal: the 6 and 8 core parts only have one FMA to play with on AVX-512, whereas the 10-core and above have two FMAs as part of Intel’s feature segmentation strategy.

The L2/L3 cache arrangement adjustments are just as nuanced, moving from a 256KB/core L2 cache to a 1MB/core L2 cache with a slightly higher latency should help with data streams being fed into the core, especially for heavy compute workloads and keeping those AVX512 units fed. The victim, in this case, is the L3 cache, being demoted to a 1.375MB/core non-inclusive victim cache, which will have limited functionality on a number of workloads, most notably compile tests. The overall cache adjustments just about balance each other out, and on average favor the new core by ~1% in our IPC tests, although the edge cases such as compilation, Handbrake (non-AVX512), and Corona can swing as much as -17%, -8% and +17% respectively.

The new mesh topology for the Skylake-SP core was perhaps more of a requirement for consistency than an option over the older ring bus system, which starts to outgrow its usefulness as more cores are added. Intel has already had success with mesh architectures with the Xeon Phi chips, so this isn’t entirely new, but essentially makes the chip a big 2D-node array for driving data around the core. As with the ring bus, core-to-core latency will vary based on the locality of the cores, and those nearest the DRAM controllers will get the best benefit for memory accesses. As Intel grows its core-count, it will be interesting to see how the mesh scales.

Parts and Performance

The three Skylake-X cores launched today are the Core i9-7900X, the Core i7-7820X, and the Core i7-7800X: 10, 8 and 6 core parts respectively using the updated Skylake-SP core, the new cache topology, and the new mesh. With some of the tests benefitting from the new features and others taking a backseat, we had a wide range of results. The most poignant of which should be when we pit this generation 10-core over last generations 10-core. The Core i9-7900X has a frequency advantage, an IPC advantage, and a significant price advantage, which should make for an easy steamrolling.

Rendering: CineBench 15 MultiThreaded

Rendering: Blender 2.78

Encoding: WinRAR 5.40

Total Package Power

In the end, this is what we get: aside from some tests that are L3 memory sensitive such as DigiCortex, WinRAR, and some of the PCMark8 tests, the Core i9-7900X wins every CPU test. For anyone who was unsure about getting the 10-core on the last generation on a compute basis, this new one seems to be the one to get.

The gaming story is unfortunately not quite as rosy. We had last minute BIOS updates to a number of our boards because some of the gaming tests were super underperforming on the new Skylake-X parts. We are told that these early BIOSes are having power issues to do with turboing, as well as Intel’s Speed Shift technology when the GPU is active.

While these newer BIOSes have improved things, there are still some remaining performance issues to be resolved. Our GTX1080 seems to be hit the hardest out of our four GPUs, as well as Civilization 6, the second Rise of the Tomb Raider test, and Rocket League on all GPUs. As a result, we only posted a minor selection of results, most of which show good parity at 4K. The good news is that most of the issues seem to happen at 1080p, when the CPU is more at fault. The bad news is that when the CPU is pushed into a corner, the current BIOS situation is handicapping Skylake-SP in gaming.

I'm going to hold off on making a final recommendation for gaming for the moment, as right now there are clear platform problems. I have no doubt Intel and the motherboard vendors can fix them – this isn't the first time that we've seen a new platform struggle at launch (nor will it be the last). But with pre-orders opening up today, if you're a gamer you should probably wait for the platform to mature a bit more and for the remaining gaming issues to be fixed before ordering anything.

Itching for 18 Cores?

While today is the launch for Skylake-X CPUs up to 10-cores, a lot of talk will be around the 18-core Core i9-7980XE part due later this year, coming out at $1999. Double the price of the 10-core will unlikely equal double the performance, as we would expect lower frequencies to compensate. But users who need 18 lots of AVX-512 support will be rubbing their hands with glee. It will also be an interesting one to overclock, and I suspect that certain companies are already planning ahead to break some world records with it. We’ll try and get a sample in.

Should I wait for the 12-core? For ThreadRipper? Or Just Go Ryzen?

Both the 12-core Core i9-7920X and AMD’s ThreadRipper parts are set to launch this summer, with the Intel part confirmed in the August timeframe. By this time the X299 ecosystem should be settling down, while AMD will have to navigate a new X399 ecosystem, which I’m getting mixed messages about (some motherboard vendors say they are almost ready, others say they’re not even close). Both of these CPUs will be exchanging more cores for frequency, and the cost is a big factor – we don’t know for how much ThreadRipper or the X399 motherboards will retail for.

Ultimately a user can decide the following:

  • To play it safe, invest in the Core i9-7900X today.
  • To play it safe and get a big GPU, save $400 and invest in the Core i7-7820X today.
  • To play it cheaper but competitive, invest in Ryzen 7 today.
  • To invest in PCIe connectivity, wait for ThreadRipper. 60 PCIe lanes are hard to ignore.
  • To invest in AVX512, wait for the next Intel CPUs.

So What’s the Takeaway Here?

From an engineering perspective, Intel is doing new things. The cache, the mesh, and AVX512 are interesting changes from several years of iterative enhancements on the prosumer side, but it will take time to see how relevant they will become. For some enterprise applications, they will make perfect sense.

From a consumer/prosumer perspective, it breaks the mold by offering some CPUs now and some CPUs later. The hardware itself won’t feel too much different, aside from having all the Intel cores and software slowly taking advantage. But Intel’s 10-core, at $999, suddenly got easier to recommend for users in that price bracket. At $599 though, the 8-core saves several hundred dollars for other upgrades if you don’t need AVX-512 or 44 PCIe lanes.

Comparing Skylake-S and Skylake-X/SP Performance Clock-for-Clock
Comments Locked

264 Comments

View All Comments

  • mat9v - Tuesday, June 20, 2017 - link

    To play it safe, invest in the Core i9-7900X today.
    To play it safe and get a big GPU, save $400 and invest in the Core i7-7820X today.

    Then the conclusion should have been - wait for fixed platform. I'm not even suggesting choosing Ryzen as it performs slower but encouraging buying flawed (for now) platform?
  • mat9v - Tuesday, June 20, 2017 - link

    Please then correct tables on 1st page comparing Ryzen and 7820X and 7800X to state that Intel has 24 lines as they leave 24 for PCIEx slots and 4 is reserved for DMI 3.0
    If you strip Ryzen lines to only show those available for PCIEx do so for Intel too.
  • Ryan Smith - Wednesday, June 21, 2017 - link

    The tables are correct. The i7 7800 series have 28 PCIe lanes from the CPU for general use, and another 4 DMI lanes for the chipset.
  • PeterCordes - Tuesday, June 20, 2017 - link

    Nice article, thanks for the details on the microarchitectural changes, especially to execution units and cache. This explains memory bandwidth vs. working-set size results I observed a couple months ago on Google Compute Engine's Skylake-Xeon VMs with ~55MB of L3: The L2-L3 transition was well beyond 256kB. I had assumed Intel wouldn't use a different L3 cache design for SKX vs. SKL, but large L2 doesn't make much sense with an inclusive L3 of 2 or 2.5MB per core.

    Anyway, some corrections for page3: The allocation queue (IDQ) is in Skylake-S is always 64 uops, with or without HT. For example, I looked at the `lsd.uops` performance counter in a loop with 97 uops on my i7-6700k. For 97 billion counts of uops_issued.any, I got exactly 0 counts of lsd.uops, with the system otherwise idle. (And I looked at cpu_clk_unhalted.one_thread_active to make sure it was really operating in non-HT mode the majority of the time it was executing.) Also, IIRC, Intel's optimization manual explicitly states that the IDQ is always 64 entries in Skylake.

    The scheduler (aka RS or Reservation Station) is 97 unfused-domain uops in Skylake, up from 60 in Haswell. The 180int / 168fp numbers you give are the int / fp register-file sizes. They are sized more like the ROB (224 fused-domain uops, up from 192 in Haswell), not the scheduler, since like the ROB, they have to hold onto values until retirement, not just until execution. See also http://blog.stuffedcow.net/2013/05/measuring-rob-c... for when the PRF size vs. the ROB is the limit on the out-of-order window. See also http://www.realworldtech.com/haswell-cpu/6/ for a nice block diagram of the whole pipeline.

    SKL-S DIVPS *latency* is 11 cycles, not 3. The *throughput* is one per 3 cycles for 128-bit vectors, or one per 5 cycles for 256b vectors, according to Agner Fog's table. I forget if I've tested that myself. So are you saying that SKL-SP has one per 5 cycle throughput for 128-bit vectors? What's the throughput for 256b and 512b vectors?

    -----

    It's really confusing the way you keep saying "AVX unit" or "AVX-512 unit" when I think you mean "512b FMA unit". It sounds like vector-integer, shuffle, and pretty much everything other than FMA will have true 512b execution units. If that's correct, then video codecs like x264/x265 should run the same on LCC vs. HCC silicon (other than differences in mesh interconnect latency), because they're integer-only, not using any vector-FP multiply/add/FMA.

    -------

    > This should allow programmers to separate control flow from data flow...

    SIMD conditional operations without AVX512 are already done branchlessly (I think that's what you mean by separate from control-flow) by masking the input and/or output. e.g. to conditionally add some elements of a vector, AND the input with a vector of all-one or all-zero elements (as produced by CMPPS or PGMPEQD, for example). Adding all-zeros is a no-op (the additive identity).

    Mask registers and support for doing it as part of another operation makes it much more efficient, potentially making it a win to vectorize things that otherwise wouldn't be. But it's not a new capability; you can do the same thing with boolean vectors and SSE/AVX VPBLENDVPS.
  • PeterCordes - Tuesday, June 20, 2017 - link

    Speed Shift / Hardware P-State is not Windows-specific, but this article kind of reads as if it is.

    Your article doesn't mention any other OSes, so nothing it says is actually wrong: I'm sure it did require Intel's collaboration with MS to get support into Win10. The bullet-point in the image that says "Collaboration between Intel and Microsoft specifically for W10 + Skylake" may be going too far, though. That definitely implies that it only works on Win10, which is incorrect.

    Linux has supported it for a while. "HWP enabled" in your kernel log means the kernel has handed off P-state selection to the hardware. (Since Linux is open-source, Intel contributed most of the code for this through the regular channels, like they do for lots of other drivers.)

    dmesg | grep intel_pstate
    [ 1.040265] intel_pstate: Intel P-state driver initializing
    [ 1.040924] intel_pstate: HWP enabled

    The hardware exposes a knob that controls the tradeoff between power and performance, called Energy Performance Preference or EPP. Len Brown@Intel's Linux patch notes give a pretty good description of it (and how it's different from a similar knob for controlling turbo usage in previous uarches), as well as describing how to use it from Linux. https://patchwork.kernel.org/patch/9723427/.

    # CPU features related to HWP, on an i7-6700k running Linux 4.11 on bare metal
    fgrep -m1 flags /proc/cpuinfo | grep -o 'hwp[_a-z]*'
    hwp
    hwp_notify
    hwp_act_window
    hwp_epp

    I find the simplest way to see what speed your cores are running is to just `grep MHz /proc/cpuinfo`. (It does accurately reflect the current situation; Linux finds out what the hardware is actually doing).

    IDK about OS X support, but I assume Apple has got it sorted out by now, almost 2 years after SKL launch.
  • Arbie - Wednesday, June 21, 2017 - link

    There are folks for whom every last compute cycle really matters to their job. They have to buy the technical best. If that's Intel, so be it.

    For those dealing more with 'want' than 'need', a lot of this debate misses an important fact. The only reason Intel is suddenly vomiting cores, defecating feature sizes, and pre-announcing more lakes than Wisonsin is... AMD. Despite its chronic financial weakness that company has, incredibly, come from waaaay behind and given us real competition again. In this ultra-high stakes investment game, can they do that twice? Maybe not. And Intel has shown us what to expect if they have no competitor. In this limited-supplier market it's not just about who has the hottest product - it's also about whom we should reward with our money, and about keeping vital players in the game.

    I suggest - if you can, buy AMD. They have earned our support and it's in our best interests to do so. I've always gone with Intel but have lately come to see this bigger picture. It motivated me to buy an 1800X and I will also buy Vega.
  • Rabnor - Wednesday, June 21, 2017 - link

    To play it safe and get a big GPU, save $400 and invest in the Core i7-7820X today.
    You have to spend that $400+ on a good motherboard & aio cooler.
    Are you sold by Intel, anandtech?
  • Synviks - Thursday, June 22, 2017 - link

    For some extra comparison: running Cinebench R15 on my 14c 2.7ghz Haswell Xeon, with turbo to 3ghz on all cores, my score is 2010.

    Pretty impressive performance gain if they can shave off 4 cores and end up with higher performance.
  • Pri - Thursday, June 22, 2017 - link

    On the first page you wrote this:
    Similarly, the 6-core Core i7-7820X at $599 goes up against the 8-core $499 Ryzen 7 1800X.

    The Core i7 7820X was mistakenly written as a 6-core processor when it is in-fact an 8-core processor.

    Kind Regards.
  • Gigabytes - Thursday, June 22, 2017 - link

    Okay, here is what I learned from this article. Gaming performance sucks and you will be able to cook a pizza inside your case. Did I miss anything?

    Oh, one thing missing.

    Play it SMART and wait to see the Ripper in action before buy your new Intel toaster oven.

Log in

Don't have an account? Sign up now