Topology, Memory Subsystem & Latency

The topologies of the new Ice Lake-SP parts are quite straightforward when it comes to their generational evolution from Cascade Lake-SP. The most important thing of note here is that this generation’s HCC (high core count) die employed by Intel is of the same core count as last generation’s XCC die – 28 cores. The new ICX XCC die is now at 40 cores, with the Xeon 8380 we’re testing today being of this flavour.

Unfortunately, Intel didn’t specify which SKUs use XCC dies and which use HCC, only disclosing that XCC goes up to 40 cores, and HCC goes up to 26 cores. We also have a Xeon 6330 available for testing at 28 cores, meaning that also would be of the XCC design.

At the heart, Ice Lake SP is still a monolithic mesh design, with a few differences in the composition of the blocks, such as rearranged UPI positioning, extra 16 PCIe lanes which have been upgraded to 4.0 capability, as well as most importantly the move from a 2-memory controller design with 3-channel granularity, to a 4-controller design with 2-channel granularity, which makes for an important distinction later on in the memory performance of the system.

Starting off with our core-to-core test, the test consists of two threads atomically altering a value on a shared cache line before the threads are spawned from a main housekeeping thread. In essence we’re measuring hardware core-to-cacheline-to-core as well as the hardware coherency round-trip time for the data to be visible from one core to another. Such core-to-core latencies are important in multi-threaded workloads which have lots of shared data operations, such as databases.

At first glance, we’re not seeing all that much different latencies within a socket, however we have to remind ourselves that we’re comparing boost clocks of up to 4GHz on the Cascade Lake SP based Xeon 8280, while our Ice Lake SP Xeon 8380 only boosts up to 3.4GHz. In that regard, maintaining similar core-to-core latencies all whilst increasing the mesh size from 28 cores to 40 cores is actually quite impressive – there is a slight degradation of a few nanoseconds but generally it’s not something overly significant.

What’s a bit odd is the larger latencies between core N to core N+1 at +60ns. What’s even more odd, is that this only ever happens when we measure from physical core N to physical core N+1 on the same enumerated logical CPU, if we actually measure from physical core N to the other logical CPU of physical N+1 then we’re getting normal access latencies as any other combination of physical and logical cores. I’ve got no idea what’s happening here but the measurements seem to be consistent in their behaviour.

The major improvement on Ice Lake seems to be socket-to-socket latencies, which in our measurements have gone down from ~135ns to ~108ns on these particular SKUs at these particular frequencies. That’s a major generational improvement, and further advances Intel’s leadership position in this metric. In fact, the new Xeon 8380’s socket-to-socket latencies are now essentially the same as what AMD has to incur within a single socket between cores in different IOD quadrants, with cores within a quadrant only being slightly faster. AMD’s socket-to-socket latencies naturally fall far behind at around ~190ns, even with the newer Milan based designs which had notably improved upon this metric compared to Rome.

Memory Latency

As noted, the one big change this generation is that the CPU moves from a dual 3-channel memory controller to a quad 2-channel memory controller setup, increasing total available theoretical peak memory bandwidth by 45% through the 33% increase in memory channels, as well as the DRAM frequency increase from DDR4-2933 to DDR-3200.

Counteracting the new controller redesign is however the fact that this is simply a bigger chip, so data has to go through more mesh nodes to reach their destination as well as memory controllers.

Looking at the test results with the chip in a monolithic NUMA configuration, we’re seeing the new chip slightly regress in its latencies, which actually was to be expected given the clock frequency differences between the generation SKUs. It’s not a major disadvantage, however something to keep in mind for later tests. With Intel slightly regressing, and AMD having greatly improved memory latencies in Milan, the gap between the two competitors is smaller than in previous generations where Intel had a more formidable lead.

Intel’s presentations disclosed similar figures, although they’re using a different measuring methodology with MLC and simple patterns with prefetchers disabled, whereas we simply measure full random latency including TLB misses.

We weren’t able to verify the claim, but Intel also advertises advantages in remote socket DRAM latencies. The difference here matches what we’re seeing with the core-to-core latency tests, but it’s a bit of an oddball metric as I have trouble thinking of workloads where this would matter much, unless you’re running a single NUMA node across two sockets, which should be rare.

 

Intel had sub-NUMA clustering in prior generations, however the bigger a chip is and the larger the core count, the more these setup configurations are expected to have a difference in performance and latencies. Running the chip in SNC2 mode, meaning splitting the chip into two NUMA domains, splits the mesh into two logical parts, halving the L3 accessible to a single core, and each half only having access to their local memory controllers.

DRAM latencies here are reduced by 1.7ns, which isn’t very much a significant difference, and the L3 latencies go down by 1.2ns which is around a 5.9% reduction.

Looking at access latencies from a core cycle view, the new Ice Lake SP system is actually quite impressive. The L1 does regress from 4 to 5 cycles with the increase from 32KB to 48KB, however the L2 remains at the same 14 cycles even though it has grown from 1MB to 1.25MB.

What Intel didn’t mention in their presentation as much is that although the absolute latencies in the L3 mesh has slightly regressed from the 8280 to the new 8380, on a core clock cycle basis, it’s actually faster as we’re measuring a reduction from an average 70.5 cycles down to 63.5 cycles, which is a very impressive feat given that the mesh now contains 42% more cores, and increases its size from 38.5MB at 1.375MB/slice to 60MB at 1.5MB/slice.

Memory Bandwidth

In terms of memory bandwidth, we’re falling back to the standard STREAM benchmark, in particular the Triad test which is a simple streaming memory compute test. As I had talked about more extensively in our Ampere Altra review, I find it much more interesting to test the scaling bandwidth with increasing thread count in a system as it can reveal much more nuances of system behaviour than just a simple single figure at the arbitrary maximum thread count.

We’re also using a vanilla compilation of STREAM with GCC without any explicit optimisations that alter memory operations types, as this way it’s a more realistic representation of how most generic workloads will behave on a system.

STREAM Triad is a simple test consisting of a a[j] = b[j]+scalar*c[j]; compute kernel iterating over three memory arrays. The test assumes 3 memory operations: two memory reads and one memory write. From a hardware perspective, this can actually be 4 memory operations as many cores have to first read out a content of a target cache-line before writing to it.

Generally, what we were expecting with Ice Lake SP were figures that were +45% ahead of Cascade Lake SP, thanks to the improved memory controllers and more memory channels. Instead, what we’re seeing here are improvements reaching up to +86%, well beyond the figures we were expecting.

As per the data, the new ICX design appears to be vastly outperforming its predecessor, and also essentially leaving AMD trailing far behind in terms of raw memory performance, only falling behind Ampere’s Altra which is able to dynamically detect streaming memory workloads and transform memory operations into non-temporal ones.

What’s also to be noted is that the per-core bandwidth this generation doesn’t seem to have improved very much from Cascade Lake SP, with AMD’s newest Milan still vastly outperforming Ice Lake SP at lower thread counts, and single-core bandwidth being much higher on the competitor systems.

Inspecting Intel’s prior disclosures about Ice Lake SP in last year’s HotChips presentations, one point sticks out, and that is the “SpecI2M optimisation”, where the system is able to convert traditional RFO (Read for ownership) memory operations into another mechanism. We don’t know exactly what SpecI2M does, but Intel does disclose that it’s meant to optimise bandwidth and data traffic under streaming workloads. It doesn’t seem that this is a full kind of memory operation transformation into non-temporal writes as on the Arm systems we’ve seen lately, but it does significantly boost the bandwidth well beyond what we’ve seen of other x86 systems.

It’s a bit unfortunate that system vendors have ended up publishing STREAM results with hyper optimised binaries that are compiled with non-temporal instructions from the get-go, as for example we would not have seen this new mechanism on Ice Lake SP with those kind of binaries. Intel themselves are only disclosing a +47% increase in STREAM Triad performance – I consider the real-world improvement to be significantly higher than that figure, as this new dynamic mechanism doesn’t depend on specifically tuned software.

Overall, Intel’s overall larger mesh, new memory controllers, and architectural improvements in regards to memory bandwidth are absolutely impressive, and well beyond what I had expected of this generation. The latter STREAM results were really great to see as I view it as a true design innovation that will benefit a lot of workloads.

Test Bed and Setup - Compiler Options Power & Efficiency - 10nm Gains
Comments Locked

169 Comments

View All Comments

  • Oxford Guy - Sunday, April 11, 2021 - link

    'The faulty logic I see is that you seem to believe it's the review's job to...'

    'I think it could be appropriate to do that sort of thing, in articles that...'

    Don't contradict yourself or anything.

    If you're not interested in knowing how fast a CPU is that's ... well... I don't know.

    Telling people to go for marketing info (which is inherently deceptive — the entire fundamental reason for marketing departments to exist) is obviously silly.
  • mode_13h - Monday, April 12, 2021 - link

    > Don't contradict yourself or anything.

    I think the point of confusion is that I'm drawing a distinction between the initial product review and subsequent follow-up articles they often publish to examine specific points of interest. This would also allow for more time to do a more thorough investigation, since the initial reviews tend to be conducted under strict deadlines.

    > If you're not interested in knowing how fast a CPU is that's ... well... I don't know.

    There's often a distinction between the performance, as users are most likely to experience it, and the full capabilities of the product. I actually want to know both, but I think the former should be the (initial) priority.
  • ballsystemlord - Thursday, April 8, 2021 - link

    Spelling and grammar errors (there are a lot!):

    "At the same time, we have also spent time a dual Xeon Gold 6330 system from Supermicro, which has two 28-core processors,..."
    Nonsensical English: "time a duel". I haven't the faintest what you were trying to say.

    "DRAM latencies here are reduced by 1.7ns, which isn't very much a significant difference,..."
    Either use "very much", or use "a significant":
    DRAM latencies here are reduced by 1.7ns, which isn't a very significant difference,..."

    "Inspecting Intel's prior disclosures about Ice Lake SP in last year's HotChips presentations, one point sticks out, and that's is the "SpecI2M optimisation" where the system is able to convert traditional RFO (Read for ownership) memory operations into another mechanism"
    Excess "is":
    "Inspecting Intel's prior disclosures about Ice Lake SP in last year's HotChips presentations, one point sticks out, and that's the "SpecI2M optimisation" where the system is able to convert traditional RFO (Read for ownership) memory operations into another mechanism"

    "It's a bit unfortunate that system vendors have ended up publishing STREAM results with hyper optimised binaries that are compiled with non-temporal instructions from the get-go, as for example we would not have seen this new mechanism on Ice Lake SP with them"
    You need to rewrite the sentance or add more commas to break it up:
    "It's a bit unfortunate that system vendors have ended up publishing STREAM results with hyper optimised binaries that are compiled with non-temporal instructions from the get-go, as, for example, we would not have seen this new mechanism on Ice Lake SP with them"

    "The latter STREAM results were really great to see as I view is a true design innovation that will benefit a lot of workloads."
    Exchange "is" for "this as":
    "The latter STREAM results were really great to see as I view this as a true design innovation that will benefit a lot of workloads."
    Or discard "view" and rewrite as a diffinitive instead of as an opinion:
    "The latter STREAM results were really great to see as this is a true design innovation that will benefit a lot of workloads."

    "Intel's new Ice Lake SP system, similarly to the predecessor Cascade Lake SP system, appear to be very efficient at full system idle,..."
    Missing "s":
    "Intel's new Ice Lake SP system, similarly to the predecessor Cascade Lake SP system, appears to be very efficient at full system idle,..."

    "...the new Ice Lake part to most of the time beat the Cascade Lake part,..."
    "to" doesn't belong. Rewrite:
    "...the new Ice Lake part can beat the Cascade Lake part most of the time,..."

    "...both showcasing figures that are still 25 and 15% ahead of the Xeon 8380."
    Missing "%":
    "...both showcasing figures that are still 25% and 15% ahead of the Xeon 8380."

    "Intel had been pushing very hard the software optimisation side of things,..."
    Poor sentance structure:
    "Intel had been pushing the software optimisation side very hard,..."

    "...which unfortunately didn't have enough time to cover for this piece."
    Missing "we":
    "...which unfortunately we didn't have enough time to cover for this piece."

    "While we are exalted to finally see Ice lake SP reach the market,..."
    "excited" not "exalted":
    "While we are excited to finally see Ice lake SP reach the market,..."

    Thanks for the article!
  • Oxford Guy - Sunday, April 11, 2021 - link

    Perhaps Purch would be willing to take you on as a volunteer unpaid intern for proofreading for spelling and grammar?

    I would think there are people out there who would do it for resume building. So... if it bothers you perhaps you should make an inquiry.
  • evilpaul666 - Saturday, April 10, 2021 - link

    Are the W-1300s going to use 10nm this year?
  • mode_13h - Saturday, April 10, 2021 - link

    You mean the bottom-tier Xeons? Those are just mainstream desktop chips with less features disabled, so that question depends on when Alder Lake hits.

    I'd say "no", because the Xeon versions typically lag the corresponding mainstream chips by a few months. So, if Alder Lake launches in November, then maybe we get the Xeons in February-March of next year.

    The more immediate question is whether they'll release a Xeon version of Rocket Lake. I think that's likely, since they skipped Comet Lake and there are significant platform enhancements for Rocket Lake.
  • AdrianBc - Monday, April 12, 2021 - link

    No, the W-1300 Xeons will be Rocket Lake. The top model will be Xeon W-1390P, which will be equivalent to the top i9 Rocket Lake, with 125 W TDP and 5.3 GHz maximum turbo.
  • rahvin - Tuesday, April 20, 2021 - link

    Andre does some of the best server reviews available, IMO.
  • KKK11 - Tuesday, May 11, 2021 - link

    That is a curious-looking wafer. I thought it was fake at first but then I noticed the alignment notch. Actually, I'm still not convinced it's real because I have seen lots and lots of wafers in various stages of production and I have never seen one where partial chips go all the way out to the edges. It's a waste of time to deal with those in the steppers so no one does that.

Log in

Don't have an account? Sign up now