NVIDIA's Carmel CPU Core - SPEC2006 Speed

While the Xavier’s vision and machine processing capabilities are definitely interesting, it’s use-cases will be largely out-of-scope for the average AnandTech reader. One of the aspects of the chip that I was personally more interested in was NVIDIA’s newest generation Carmel CPU cores, as it represents one of the rare custom Arm CPU efforts in the industry.

Memory Latency

Before going into the SPEC2006 results, I wanted to see how NVIDIA’s memory subsystem compares against some comparable platform in the Arm space.

In the first logarithmic latency graph, we see the exaggerated latency curves which make it easy to determine the various cache hierarchy levels of the systems. As NVIDIA advertises, we see the 64KB L1D cache of the Carmel cores. What is interesting here is that NVIDIA is able to achieve quite a high performance L1 implementation with just under 1ns access times, representing a 2-cycle access which is quite uncommon. The second hierarchy cache is the L2 that continues on to the 2MB depth, after which we see the 4MB L3 cache. The L3 cache here looks to be of a non-uniform-access design as its latency steadily rises the further we go.

Switching back to a linear graph, NVIDIA does have a latency advantage over Arm’s Cortex-A76 and the DSU L3 of the Kirin 980, however it loses out at deeper test depths and latencies at the memory controller level. The Xavier SoC comes with 8x 32bit (256bit total) LPDDR4X memory controller channels, representing a peak bandwidth of 137GB/s, significantly higher than the 64 or 128bit interfaces on the Kirin 980 or the Apple A12X. Apple overall still has an enormous memory latency advantage over the competition as its massive 8MB L2 cache as well as the 8MB SLC (System level cache) allows for significant lower latencies across all test depths.

SPEC2006 Speed Results

A rarity for whenever we're looking at Arm SoCs and products built around them, NVIDIA’s Jetson AGX comes with a custom image for Ubuntu Linux (18.04 LTS). On one hand, including a Linux OS gives us a lot of flexibility in terms of test platform tools; but on the other hand, it also shows the relatively immaturity of Arm on Linux. One of the more regretful aspects of Arm on Linux is browser performance; to date the available browsers are still lacking optimised Javascript JIT engines, resulting in performance that is far worse than any commodity mobile device.

While we can’t really test our usual web workloads, we do have the flexibility of Linux to just simply compile whatever we want. In this case we’re continuing our use of SPEC2006 as we have a relatively established set of figures on all relevant competing ARMv8 cores.

To best mimic the setup of the iOS and Android harnesses, we chose the Clang 8.0.0 compiler. To keep things simple, we didn’t use any special flags other than –Ofast and a scheduling model targeting the Cortex-A53 (It performed overall better than no model or A57 targets). We also have to remind readers that SPEC2006 has been retired in favour of SPEC2017, and that the results published here are not officially submitted scores, rather internal figures that we have to describe as estimates.

The power efficiency figures presented for the AGX, much like all other mobile platforms, represent the active workload power usage of the system. This means we’re measuring the total system power under a workload, and subtracting the idle power of the system under similar circumstances. The Jetson AGX has a relatively high idle power consumption of 8.92W in this scenario, much that can be simply be attributed from a relatively non-power optimised board as well as the fact that we’re actively outputting via HDMI while having the board connected to GbE.

In the integer workloads, the Carmel CPU cores' performance is quite average. Overall, the performance across most workloads is extremely similar to that of Arm’s Cortex-A75 inside the Snapdragon 845, with the only outlier being 462.libquantum which showcases larger gains due to Xavier’s increased memory bandwidth.

In terms of power and efficiency, the NVIDIA Carmel cores again aren’t quite the best. The fact that the Xavier module is targeted at a totally different industry means that its power delivery possibly isn’t quite as power optimised as on a mobile device. We also must not forget that the Xavier has an inherent technical disadvantage of being manufactured on a 12FFN TSMC process node, which should be lagging behind Samsung’s 10LPP processes of the Exynos 9810 and the Snapdragon 845, and most certainly represents a major disadvantage against the newer 7nm Kirin 980 and Apple A12.

On the floating point benchmarks, Xavier fares overall better because some of the benchmarks are characterised by their sensitivity to the memory subsystem; in 433.milc this is most obvious. 470.lbm also sees the Carmel cores perform relatively well. In the other workloads however, again we see Xavier having trouble to differentiate itself much from the performance of a Cortex A75.

Here’s a wider performance comparison across SPEC2006 workloads among the most recent and important ARMv8 CPU microarchitectures:

Overall, NVIDIA’s Carmel core seems like a big step up for NVIDIA and their in-house microarchitecture. However when compared against most recent cores from the competition, we see the new core having trouble able to really distinguish itself in terms of performance. Power efficiency of the AGX also lags behind, however this is something that was to be expected given the fact that the Jetson AGX is not a power optimised platform, beyond the fact that the chip’s 12FFN manufacturing process is a generation or two behind the latest mobile chips.

The one aspect which we can’t quantize NVIDIA’s Carmel cores is its features: This is a shipping CPU with ASIL-C functional safety features that we have in our hands today. The only competition in this regard would be Arm’s new Cortex A76AE, which we won’t see in silicon for at least another year or more. When taking this into account, it could possibly make sense for NVIDIA to have gone with its in-house designs, however as Arm starts to offer more designs for this space I’m having a bit of a hard time seeing a path forward in following generations after Xavier, as competitively, the Carmel cores don’t position themselves too well.

Machine Inference Performance & What's it For? NVIDIA's Carmel CPU Core - SPEC2006 Rate
POST A COMMENT

51 Comments

View All Comments

  • linuxgeex - Friday, November 8, 2019 - link

    Add this line to the following files (linux/bsd or windows)

    /etc/hosts or c:/windows/system32/driver/hosts

    127.0.0.1 ads.servebom.com

    job done.
    Reply
  • TheinsanegamerN - Friday, January 4, 2019 - link

    auto video ads are hell incarnate. Reply
  • Yojimbo - Friday, January 4, 2019 - link

    Regarding NVIDIA's future CPU core development, I think it's important to note that NVIDIA has developed all major IP blocks on the SoC. That probably allows them to work on integration sooner than if they relied on externally developed IP blocks. Also, they have the ability to tune their cores and fabric to their intended application, which is a narrow subset of what ARM is developing for. I'm guessing NVIDIA doesn't tune the performance of their CPU cores using specint or specfp. They probably look at much more specific and realistic benchmarks.

    And by the time the Cortex A76AE is available for NVIDIA to use they will probably have a next iteration of their CPU which perhaps will show up in Orin in early 2021 or even late 2020. It's not clear to me what delayed Xavier from NVIDIA's original schedule. It's possible they'll be able to get the next one out with less time between the launch of the underlying GPU architecture and the availability of the SoC. There was a lot of new stuff that went into Xavier other than the GPU architecture, such as the increased safety features, the DLA, and the PVA.
    Reply
  • DeepLearner - Friday, January 4, 2019 - link

    I hope they'll send you a T4 soon! I'm dying for numbers on those. Reply
  • eastcoast_pete - Friday, January 4, 2019 - link

    @Andrei: thanks for this review. I wonder if the recent loss of a larger client in the automotive sector (Audi/Volkswagen) to Samsung played a role in Nvidia's willingness to make samples available to you for review. As of model year 2021, Audi will stop using Tegra-based units and move to Samsung's Exynos Auto V9 SoC, which actually features an 8 A76 cores based on ARM's A76 AE design for automotive/vehicular use.
    While that specialized SoC is still awaiting mass production, I also wonder if Samsung's choice to use straight-up ARM A76 cores (yes, they are AE, so not standard A76) portends a sea change for the mainstream Exynos lines also? As you pointed out, Mongoose turned out to be quite disappointing, so is there a change coming? Would appreciate your insights and comment!
    Reply
  • webdoctors - Friday, January 4, 2019 - link

    I was also confused by the news of Audi using Samsung chips. I don't think this changes the Audi/Nvidia relationship from googling: http://fortune.com/2017/01/05/audi-nvidia-2020/

    I think in the infotainment sector there's just a lot of competition for cheap chips and a low bar for entry. Any Mediatek or run of the mill cellphone chip should do. I doubt you'd care about ECC or safety in the HW playing your music or watching movies. My current car has an aftermarket unit that's 10 years old that can play DVD movies, has GPS maps and integrates a backup camera.

    I'm not sure how you'd program a beast of a chip here, or even what the right benchmarks are since you wouldn't need it just play movies, show maps or run CPU benchmarks. With all the inferencing and visual processing it'd be a waste of resources and money to use it for the traditional tasks done today in cars.

    I'm really curious how Anandtech evaluates these specialized products that aren't your run of the mill CPU/GPU/HDD.
    Reply
  • unrulycow - Saturday, January 5, 2019 - link

    This is obviously overkill for the entertainment system. It's main purpose is for the semi-autonomous driving systems like Cadillac's SuperCruise or Tesla's Autopilot. Reply
  • Andrei Frumusanu - Friday, January 4, 2019 - link

    As far as I know their mobile roadmap still uses custom cores, there's probably different requirements for automotive or they could have simply said that 8 A76s make a lot more sense than 8 custom cores. Reply
  • eastcoast_pete - Saturday, January 5, 2019 - link

    Thanks Andrei! Yes, design requirements for automotive/vehicle-embedded are different in key areas (safety/security). However, I was/am struck by Samsung not adapting their own Mongoose design for AE use. Maybe their client (Audi) preferred the stock A76 AE design, and it wasn't economical to adapt Mongoose. However, this now means that the most powerful Samsung SoC design (A76 octacore) might be found in - Audi cars. Reply
  • unrulycow - Saturday, January 5, 2019 - link

    They are also losing Tesla as a client. Tesla decided to create their own chip which will theoretically start going into cars in Q2. I would love to see a comparison between the two chips. Reply

Log in

Don't have an account? Sign up now