Memory Subsystem: Latency

AMD chose to share a core design among mobile, desktop and server for scalability and economic reasons. The Core Complex (CCX) is still used in Rome like it was in the previous generation. 

What has changed is that each CCX communicates with the central IO hub, instead of four dies communicating in 4 node NUMA layout (This option is still available to use via the NPS4 switch, keeping each CCD local to its quadrant of the sIOD as well as those local memory controllers, avoiding hops between sIOD quadrants which encour a slight latency penalty). So as the performance of modern CPUs depends heavily on the cache subsystem, we were more than curious what kind of latency a server thread would see as it accesses more and more pages in the cache hierarchy. 

We're using our own in-house latency test. In particular what we're interested in publishing is the estimated structural latency of the processors, meaning we're trying to account for TLB misses and disregard them in these numbers, except for the DRAM latencies where latency measurements get a bit more complex between platforms, and we revert to full random figures.

Mem
Hierarchy
AMD EPYC 7742
DDR4-3200

(ns @ 3.4GHz)
AMD EPYC 7601
DDR4-2400

(ns @ 3.2GHz)
Intel Xeon 8280
DDR-2666

(ns @ 2.7GHz)
L1 Cache 32KB

4 cycles
1.18ns
32KB

4 cycles
1.25ns
32KB

4 cycles
1.48ns
L2 Cache 512KB

13 cycles
3.86ns
512KB

12 cycles
3.76ns
1024KB

14 cycles
5.18ns
L3 Cache 16MB / CCX (4C)
256MB Total

~34 cycles (avg)
~10.27 ns
16MB / CCX (4C)
64MB Total

 
38.5MB / (28C)
Shared

~46 cycles (avg)
~17.5ns
DRAM

128MB Full Random
~122ns (NPS1)

~113ns (NPS4)

~116ns

~89ns
DRAM

512MB Full Random
~134ns (NPS1)

~125ns (NPS4)
 
~109ns

Update 2019/10/1: We've discovered inaccuracies with our originally published latency numbers, and have subsequently updated the article with more representative figures with a new testing tool.

Things get really interesting when starting to look at cache depths beyond the L2. Naturally Intel here this happens at 1MB while for AMD this is after 512KB, however AMD’s L2 has a speed advantage over Intel’s larger cache.

Where AMD has an ever more clearer speed advantage is in the L3 caches that are clearly significantly faster than Intel’s chips. The big difference here is that AMD’s L3’s here are only local to a CCX of 4 cores – for the EPYC 7742 this is now doubled to 16MB up from 8MB on the 7601.

Currently this is a two-edged sword for the AMD platforms: On one hand, the EPYC processors have significantly more total cache, coming in at a whopping 256MB for the 7742, quadruple the amount over the 64MB of the 7601, and a lot more than Intel’s platforms, which come in at 38.5MB for the Xeon 8180, 8176, 8280, and a larger 55MB for the Xeon E5-2699 v4.

The disadvantage for AMD is that while they have more cache, the EPYC 7742 rather consist of 16 CCX which all have a very fast 16 MB L3. Although the 64 cores are one big NUMA node now, the 64-core chip is basically 16x 4 cores, each with 16 MB L3-caches. Once you get beyond that 16 MB cache, the prefetchers can soften the blow, but you will be accessing the main DRAM.

A little bit weird is the fact that accessing data that resides at the same die (CCD) but is not within the same CCX is just as slow as accessing data is on a totally different die. This is because regardless of where the other CCX is, whether it is nearby on the same die or on the other side of the chip, the data access still has to go through the IF to the IO die and back again.

Is that necessarily a bad thing? The answer: most of the time it is not. First of all, in most applications only a low percentage of accesses must be answered by the L3 cache. Secondly, each core on the CCX has no less than 4 MB of L3 available, which is far more than the Intel cores have at their disposal (1.375 MB). The prefetchers have a lot more space to make sure that the data is there before it is needed.

But database performance might still suffer somewhat. For example, keeping a large part of the index in the cache improve performance, and especially OLTP accesses tend to quite random. Secondly the relatively slow communication over a central hub slow down synchronization communication. That is a real thing is shown by the fact that Intel states that the OLTP hammerDB runs 60% faster on a 28-core Intel Xeon 8280 than on EPYC 7601. We were not able to check it before the deadline, but it seems reasonable.

But for the vast majority of these high-end CPUs, they will be running many parallel applications, like running microservices, docker containers, virtual machines, map/reducing smaller chunks of data and parallel HPC Jobs. In almost all cases 16 MB L3 for 4 cores is more than enough.

Although come to think of it, when running an 8-core virtual machine there might be small corner cases where performance suffers a (little) bit.

In short, AMD leaves still a bit of performance on table by not using a larger 8-core CCX. We await to see what happens in future platforms.

Memory Subsystem: Bandwidth Latency Part Two: Beating The Prefetchers
Comments Locked

180 Comments

View All Comments

  • sing_electric - Thursday, August 8, 2019 - link

    Not just Netburst - remember, Intel's plans were ORIGINALLY for Itanium to migrate down through the stack, ending up in consumer machines. Two massively costly mistakes when it came to planing the future of CPUs. Honestly, I hope Intel properly compensated the team behind the P6, an architecture so good that it was essentially brought back a year after release to after those 2 failures.

    OTOH, it's kind of amazing that AMD survived the Bulldozer years, since their margin for error is much smaller than Intel's. Good thing they bought ATI, since I'm not sure the company survives without the money they made from graphics cards and consoles...
  • JohanAnandtech - Thursday, August 8, 2019 - link

    Thank you for the kudos and sympathy. It was indeed hot! At 39°C/102°F, the server was off.

    I agree - I too admire the no-nonsense leadership of Lisa Su. Focus, careful execution and customer centric.
  • WaltC - Thursday, August 8, 2019 - link

    AMD has proven once again that Intel can be beaten, and soundly, too...;) The myth of the indestructible Intel is forever shattered, and Intel's CPU architectures are so old they creak and are riddled with holes, imo. Where would Intel have put us, if there'd been no AMD? You like Rdram, you like Itanium, just for starters? You like paying through the nose? That's where Intel wanted to go in its never-ending quest to monopolize the market! AMD stopped all of that by offering an alternative path the market was happy to take--a path that didn't involve emulators and tossing out your software library just to give Intel a closed bus! Intel licensed AMD's x86-64, among other things--and they flourished when AMD dropped the ball. I chalk all that up to AMD going through a succession of horrible CEOs--people who literally had no clue! Remember the guy who ran AMD for awhile who concluded it made sense for AMD to sell Intel servers...!? Man, I thought AMD was probably done! There's just no substitute for first-class management at the top--Su was the beginning of the AMD renaissance! Finally! As a chip manufacturer, Intel will either learn how to exist in a competitive market or the company over time will simply fade away. I often get the feeling that Intel these days is more interested in the financial services markets than in the computer hardware markets. While Intel was busy milking its older architectures and raking in the dough, AMD was busy becoming a real competitor once again! What a difference the vision at the top, or the lack of it, makes.
  • aryonoco - Thursday, August 8, 2019 - link

    That dude was Rory Read, and while the SeaMicro acquisition didn't work out, he did some great work and restructured AMD and in many ways saved the company while dealing with the Bulldozer disaster.

    Rory stablized the finances of the company by lowering costs over 30%, created the semi-custom division that enabled them to win the contracts for both the Xbox and PS4, creating a stable stream of revenue. Of course Rory's greatest accomplishment was hiring Lisa Su and then grooming her to become the CEO.

    Rory was a transitional CEO and he did exactly what was required of him. If there is a CEO that should be blamed for AMD's woes, it's Dirk Meyer.
  • aryonoco - Thursday, August 8, 2019 - link

    Forgot to mention, Rory also hired Kim Keller to design K12, and in effect he started the project that would later on become Zen.

    Of course Lisa deserves all the glory from then on. She has been an exceptional leader, bringing focus and excelling at execution, things that AMD always traditionally lacked.
  • tamalero - Sunday, August 11, 2019 - link

    Id Blame Hector Ruiz first.
    It was his crown to lose during the Athlon 64 era, and he simply didn't have anything to show. Making the Athlon 64 core arch a one hit wonder for more than a decade.
  • MarcusTaz - Wednesday, August 7, 2019 - link

    Another site's article that starts with an F stated that Rome runs hot and uses 1.4 volts, above TMSC recommended 1.3 volt. Did you need to run 1.4 volts for these tests?
  • evernessince - Wednesday, August 7, 2019 - link

    Well 1st, that 1.3v figure is from TSMC's mobile focused 7nm LPP node. Zen 2 is made on the high performance 7nm node, not the mobile focused LPP. Whatever publication you read didn't do their homework. TSMC has not published information on their high performance node and I think it rather arrogant to give AMD an F based on an assumption. As if AMD engineers are stupid enough to put dangerous voltages through their CPUs that would result in a company sinking lawsuit. It makes zero sense.

    FYI all AMD 3000 series processors go up to 1.4v stock. Given that these are server processors, they will run hot. After all, more cores = more heat. It's the exact same situation for Intel server processors. The only difference here is that AMD is providing 50 - 100% more performance in the same or less power consumption at 40% less cost.
  • DigitalFreak - Thursday, August 8, 2019 - link

    You reading Fudzilla?
  • Kevin G - Wednesday, August 7, 2019 - link

    AMD is back. They have the performance crown again and have decided to lap the competition with what can be described as an embarrassing price/performance comparison to Intel. The only thing they need to do is be able to meet demand.

    One thing I wish they would have done is added quad socket support. Due to the topology necessary, intersocket bandwidth would be a concern at higher core counts but if you just need lots of memory, those low end 8 core chips would have been fine (think memcache or bulk NVMe storage).

    With the topology improvements, I also would have liked AMD to try something creative: a quad chip + low clocked/low voltage Vega 20 in the same package all linked together via Infinity Fabric. That would be something stunning for HPC compute. I do see AMD releasing some GPU in a server socket at some point for this market as things have been aligning in this direction for sometime.

    Supporting something like CCIX or OpenCAPI also would have been nice. A nod toward my previous point, even enabling Infinity Fabric to Vega 20 compute cards instead of PCIe 4.0 would have been yet another big step for AMD as that'd permit full coherency between the two chips without additional overhead.

    I think it would be foolish to ignore AVX-512 for Zen 3, even if the hardware they run it one continues to use 256 bit wide SIMD units. ISA parity is important even if they don't inherently show much of a performance gain (though considering the clock speed drops seen in Sky Lake-SP, if AMD could support AVX-512 at the clocks they're able to sustain at AVX2 on Zen 2, they might pull off an overall throughput win).

    With regards to Intel, they have Cooper Lake due later this year. If Intel was wise, they'd use that as a means to realign their pricing structure and ditch the memory capacity premium. Everything else Intel can do in the short term is flex their strong packaging techniques and push integrated accelerators: on package fabric, FPGA, Optane DIMMs etc. Intel can occupy several lucrative niches in important, growing fields with that they have in-house right now but they need to get them to market and at competitive prices. Otherwise it is AMD's game for the next 12 to 15 months until Ice Lake-SP arrives to bring back the competitive landscape. It isn't even certain that Intel can score a clean win either as Zen 3 based chips may start to arrive in the same time frame.

Log in

Don't have an account? Sign up now