Frequency Ramp, Latency and Power

Frequency Ramp

One of the key items of a modern processor is its ability to go from an idle state up to a peak turbo state. For consumer workloads this is important for the responsiveness of a system, such as opening a program or interacting with a web page, but for the enterprise market it ends up being more relevant when each core can control its turbo, and we get situations with either multi-user instances or database accesses. For these systems, obviously saving power helps with the total cost of ownership, but being able to offer a low latency transaction in that system is often a key selling point.

For our 7F52 system, we measured a jump up to peak frequency within 16.2 milliseconds, which links in really well with the other AMD systems we have tested recently.

In a consumer system, normally we would point out that 16 milliseconds is the equivalent to a single frame on a 60 Hz display, although for enterprise it means that any transaction normally done within 16 milliseconds on the system is a very light workload that might not even kick up the turbo at all.

Cache Latency

As we’ve discussed in the past, the key element about Cache Latency on the AMD EPYC systems is the L3 cache – the way these cores are designed, with the quad-core core complexes, means that the only L3 each core can access is that within its own CCX. That means for every EPYC CPU, whether there is four cores per CCX enabled, or if there is only one core per CCX enabled, it only has access to 16 MB of L3. The fact that there is 256 MB across the whole chip is just a function of repeating units. As a result, we can get a cache latency graph of the following:

This structure mirrors up with what we’ve seen in AMD CPUs in the past. What we get here for the 7F52 is:

  • 1.0 nanoseconds for L1 (4 clks) up to 32 KB
  • 3.3 nanoseconds for L2 (13 clks) up to 256 KB,
  • 4.8-5.6 nanoseconds (19-21 clks) at 256-512 KB (Accesses starting to miss the L1 TLB here)
  • 12-14 nanoseconds (48-51 clks) from 1 MB to 8 MB inside the first half the CCX L3
  • Up to 37 nanoseconds (60-143 clks) at 8-16 MB for the rest of the L3 
  • ~150 nanoseconds (580-600+ clks) from 16 MB+ moving into DRAM

Compared to one of our more recent tests, Ryzen Mobile, we see the bigger L3 cache structure but also going beyond the L3 into DRAM, due to the hop to the IO die and then out to the main memory there’s a sizeable increase in latency in accessing main memory. It means that for those 600 or so cycles, the core needs to be active doing other things. As the L3 only takes L2 cache line rejects, this means there has to be a lot of reuse of L3 data, or cyclical math on the same data, to take advantage of this.

Core-to-Core Latency

By only having one core per CCX, the 7F52 takes away one segment of its latency structure.

  • Thread to Thread in same core: 8 nanoseconds
  • Core to Core in same CCX: doesn't apply
  • Core to Core in different CCX on same CPU in same quadrant: ~110 nanoseconds
  • Core to Core in different CCX on same CPU in different socket quadrant: 130-140 nanoseconds
  • Core to Core in a different socket: 250-270 nanosecons

 

All of the Power

Enterprise systems, unlike consumer systems, often have to adhere to a strict thermal envelope for the server and chassis designs that they go into. This means that, even in a world where there’s a lot of performance to be gained from having a fast turbo, the sustained power draw of these processors is mirrored in the TDP specifications of that processor. The chip may offer sustained boosts higher than this, which different server OEMs can design for and adjust the BIOS to implement, however the typical expected performance when ‘buying a server off the shelf’ is that if the chip has a specific TDP value, that will be the sustained turbo power draw. At that power, the system will try and implement the highest frequency it can, and depending on the microarchitecture of the power delivery, it might be able to move specific cores up and down in frequency if the workload is lighter on other cores.

By contrast, consumer grade CPUs will often boost well beyond the TDP label, to the second power limit as set in the BIOS. This limit is different depending on the motherboard, as manufacturers will design their motherboards beyond Intel specifications in order to supplement this.

For our power numbers, we take the CPU-only power draw at both idle and when running a heavy AVX2 load.

Load Power Per Socket

When we pile on the calories, all of our enterprise systems essentially go to TDPmax mode, with every system being just under the total TDP. The consumer processors give it a bit more oomph by contrast, being anywhere from 5-50% higher.

Idle Power Per Socket

In our high performance power plan, the AMD CPUs idle quite high compared to the Intel CPUs – both of our EPYC setups are at nearly 70 W per processor a piece, while the 32C Threadripper is in about that 45 W region. Intel seems to aggressively idle here.

AMD’s New EPYC 7F52 Reviewed SPEC2006 and SPEC2017 (Single Thread)
POST A COMMENT

100 Comments

View All Comments

  • hehatemeXX - Thursday, April 16, 2020 - link

    No one in their right mind would evaluate a server CPU, designed for datacenters against a consumer CPU that will never see the light of day. WTB a real data center oriented website.. you consumers are just annoying when it comes to this stuff. Reply
  • Oxford Guy - Saturday, April 18, 2020 - link

    " No one in their right mind would evaluate a server CPU, designed for datacenters against a consumer CPU that will never see the light of day. WTB a real data center oriented website.. you consumers are just annoying when it comes to this stuff."

    Is there any kind of metric about ECC vs. non-ECC RAM, in terms of cost-benefit ratio? It's not all about CPU speed. It's also about data stability, correct?

    How much value does ECC RAM bring to the table? That information seems to be critical when comparing consumer CPUs to enterprise and prosumer CPUs.
    Reply
  • twtech - Monday, April 20, 2020 - link

    Generally that may be the case. However, this CPU may be considered as an option for workstations. For that use case, it's nice to know how it stacks up against consumer CPUs, HEDT, TR, etc. Reply
  • DanNeely - Tuesday, April 14, 2020 - link

    Can you borrow Johan's server benchmarks as a baseline for building your own out? Reply
  • Ivan Argentinski - Tuesday, April 14, 2020 - link

    Hi Ian,

    I am big fan of Anandtech. But I have always missed articles, relevant to me. I am a decision maker for database servers for ERP (among other things). We heavily employ frequency-optimized processors and I feel I can shed some light on the subject.
    Unfortunately, I feel like the article (and just about everybody) is partially missing the point of these processors. Frequency optimized processors are a niche product. They have only one use-case - for enterprise software, which is licensed per-core (like Microsoft SQL Server). So, it is irrelevant to discuss them in any other role.
    Per-core performance is not the same as single-threaded performance. Also, it is not lightly threaded performance. It is also not multi-thread performance, e.g. total CPU power. All these are irrelevant. We pay for SQL Server per core, per month. And it is costly. The CPU cost is nothing, compared to this. However, the total number of frequency-optimized cores we can cram in a server matters to some extent. Hence, the new 7F52 totally makes sense and I guess it will be the best-selling 7Fx2 CPU. If I can get 32 high-per-core-performant cores in a 2P server, it would be great.
    For example, our servers are usually 40-70% loaded during peak times of the day (with some 100% bursts). The thing that matters the most is how each core is handling, while all cores are loaded. This can be roughly stated as:

    Per-core Perf = Total Perf / Number of cores

    Hence, it is meaningless for this niche to:
    - Measure single threaded workloads
    The CPU can trick us by leveraging single-thread boost, which never happens in production.
    - Compare total CPU performance
    If the CPUs have different number of cores, this is meaningless. If we need more performance, we can just purchase more CPUs/servers.
    - Compare the CPU to non-frequency optimized CPUs
    These will just plain loose in per-core performance. But, on second thought, it would be fun to know what the actual difference is!
    - Compare to desktop or other kinds of CPUs.
    We just can’t use these in the data center. And if you are not purchasing for a data center and for a per-core software, then frequency optimized CPUs are not for you. Again, maybe just for fun.

    What if meaningful to compare for F-CPUs:
    - Per-core performance
    Throw heavy multi-threaded workload, then divide by the number of cores and see what you get for each CPU.
    - Watts for a unit of per-core perf
    Power is the other thing we are paying for.
    - $ for a unit of per-core perf
    Not of utmost importance, but still relevant.

    Ideas for relevant test scenarios:
    - 1P * 7F52
    - 2P * 7F52
    - 2P * 7F32
    - 2P * Gold 6250
    - 2P * Gold 6244 (our current setup)
    - 2P * Gold 6244, but with less DIMMs than memory channels (if you initially buy with less RAM, how much perf are you loosing?)

    Tests, relevant to databases:
    - OLTP - tpm
    - OLAP - qph

    If I have these figures, it can actually alter my purchasing behavior.

    Good Luck and all the best to you and the team!
    Reply
  • romrunning - Tuesday, April 14, 2020 - link

    Agreed - more enterprise-focused tests would be more relevant to this enterprise-focused EPYC.

    I also would have liked to see VM tests and database tests.
    Reply
  • Icehawk - Wednesday, April 15, 2020 - link

    Agreed, not quite as relevant for this specific SKU but I’ve been wanting to see VM testing for ages along with a lot of other server related testing like SQL performance. Of course consumer is this site’s focus and that’s OK. Reply
  • Atari2600 - Tuesday, April 14, 2020 - link

    Hey Ivan, you prob need to go look at servethehome.com

    As I'm sure your well aware, Anand is much more consumer focused with a benchmarking philosophy geared toward that.
    Reply
  • Oxford Guy - Saturday, April 18, 2020 - link

    What is the point of discussing $7000 CPUs, or even $3000 CPUs if you're only going to be "consumer-focused"?

    The only point I can think of is to try to convince people to buy some company's other product via mindshare (i.e. marketing).
    Reply
  • brucethemoose - Tuesday, April 14, 2020 - link

    Could you link those 2 benchmarks? Reply

Log in

Don't have an account? Sign up now