Cavium Thunder-X

A few months ago, we talked briefly with the people of Cavium. Cavium is specialized in designing MIPS SoCs that enable intelligent networking, communications, storage, video, and security applications. The picture below sums it all up: present and future.

Cavium's "Thunder Project" started from Cavium's existing Octeon III network SoC, the CN78xx. Cavium's bread and butter has been integrating high speed network capabilities in SoCs, so you will be able to choose between SoCs that have 100 Gbit Ethernet and 10GBit Ethernet. PCI-Express roots and multiple SATA ports are all integrated. There is no doubt that Cavium can design a highly integrated feature-rich SoC, but what about the processing core?

The MIPS cores inside the Octeon are much simpler – dual-issue in-order – but also much smaller and need very little power compared to a typical server core. Four (28nm) MIPS cores can fit in the space of one (32nm) Sandy Bridge core.

Replace the MIPS decoders with ARMv8 decoders and you are almost there. However, while the Cavium Thunder-X is definitely not made to run SAP, server workloads are bit more demanding than network processing, so Cavium needed to beef up the Octeon cores. The new Thunder-X cores are still dual-issue, but they're now out-of-order instead of in-order, and the pipeline length has been increased from eight to nine stages to allow for higher clocks. Each core has a 78KB L1 Instruction cache and a 32KB data cache.

The 37-way 78KB L1 I cache is certainly odd, but it might be more than just "network processor heritage". Our own testing and a few academic studies have shown that scale-out workloads such as memcached have a higher than normal (meaning the typical SPECIntRate2006 characterization) I-cache miss rate. The reason is that these applications run a lot of kernel code, and more specifically the code of the network stack. As a result, the software footprint is much higher than expected.

Another reason why we believe Cavium has done it's homework is the fact that more die area is spent on cores (up to 48) than on large caches; an L3 cache is nowhere to be found. The Thunder-X has only one centralized relatively low latency 16MB L2 cache running at full core speed. A lot of academic studies have confirmed that a large L3 cache is a waste of transistors for scale-out workloads. Besides the most used instructions that reside in the I-cache, there is a huge amount of less frequently used kernel code that does not fit in an L3 cache. In other words, an L3 cache just adds more latency to requests that missed the L1 cache and that will end up in the DRAM anyway. That is also the reason why Cavium made sure that a beefy memory controller is available: the Thunder-X comes with four DDR3/4 72-bit memory controllers and it currently supports the fastest DRAM available for servers: DDR4-2133.

On the flip side, having 48 cores with a relatively small 32KB D-cache that access one centralized 16MB L2 cache also means that the Thunder-X is less suited for some "traditional" server workloads such as SQL databases. So a Thunder-X core is simpler and probably quite a bit weaker than an ARM Cortex-A57 in some ways, let alone an X-Gene core. The fact that the Thunder-X spends a lot less transistors on cache than on cores clearly indicates that it is targeting other workloads. Single-threaded performance is likely to be lower than that of the AMD Seattle and X-Gene, but it could be close enough: the Thunder-X will run at 2.5GHz, courtesy of Global Foundries' 28nm process technology. Cavium is claiming that even the top SKU will keep the TDP below 100W.

There is more. The Thunder-X uses Cavium's proprietary Coherent Processor Interconnect (CCPI) and can thus work in a dual socket NUMA configuration. As a result, a Thunder-X based server can have up to 96 cores and is capable of supporting 1TB of memory, 512GB per socket. Multiple 10/40GBE, PCIe Root Complex, and SATA controllers are integrated in the SoC. Depending on SKU, TCP/IP Sec offload and SSL accelerators are also integrated.

The recent launch of Cavium's Thunder-X SKUs make it clear that Cavium is trying to compete with the venerable Xeon E5 in some niche but large markets:

  1. ThunderX_CP: For cloud compute workloads such as public and private clouds, web caching, web serving, search, and social media data analytics.
  2. ThunderX_ST: For cloud storage, big data, and distributed databases.
  3. TunderX_NT: For telecom/NFV server and embedded networking applications.
  4. ThunderX_SC: For secure computing applications

Considering Cavium's background and expertise, it is pretty obvious that ThunderX_NT and SC should be very capable challengers to the Xeon E5 (and Xeon-D), but only a thorough review will tell how well the ThunderX_CP will do. One of the strongest points of Calxeda was the highly integrated fabric that lowered the total power consumption and network latency of such a server cluster. Just like AMD/Seamicro, Cavium is well positioned to make sure that the Thunder-X based server clusters also have this high level of network/compute integration.

The ARM Based Challengers: AppliedMicro AMD Opteron A1100
Comments Locked

78 Comments

View All Comments

  • esterhasz - Thursday, December 18, 2014 - link

    But this is exactly why a wider array of machines based on their chips would make sense: the R&D cost is already spent anyways, since iPhone and iPad need chips, selling more units thus reduces R&D cost per unit. Economies of scale.

    I don't believe a MBA variant with ARM is down the road either, but the rumored iPad Pro could develop into something similar rather quickly.
  • OreoCookie - Tuesday, December 16, 2014 - link

    If you want to talk about ARM on the desktop, that's a whole other discussion, but one that most certainly needs to include price: if the price difference between a Broadwell-based Core M and a fictitious Apple A9X is $200~$230, then this changes the discussion completely. Two other factors are graphics performance (the Core M has »only« 1.3 billion transistors, the A8X ~2 billion, indicating that the mythical A9X may have faster graphics) and the fact that Apple controls the release schedule and can spec the SoC to meet its projected needs. To view this topic solely through the lens of CPU performance is myopic.
  • darkich - Friday, December 19, 2014 - link

    Your comparisons missed the picture spectacularly.
    A8X is a 20nm 2-4W TDP chip with a price that is probably around 70$.
    Top of the line Core M5Y70 is a 14nm 4.5 W TDP chip with a price of 270$.
    And it has a weaker GPU, btw. (raw performance). And it throttles massively, effectively giving only 50% of the benchmark performance.

    If you're going to compare that to an Apple chip, compare it to a 14nm A9X with custom derived PowerVR series 7 GPU,(scales up to 1,4 TFLOPS) vastly expanded memory controllers connected to a much faster RAM (compared to one in the iPad) upclocked to 2GHz, that are available at any time.
  • darkich - Friday, December 19, 2014 - link

    .. *with cores upclocked to about 2GHz
  • Flunk - Tuesday, December 16, 2014 - link

    Nintendo already sells ARM systems, the 3DS and the DS before it are both ARM-based. The PSVita is ARM too. I don't see an ARM Macbook Air anytime soon, they need a bigger and higher-clocking chip for that and it doesn't look like that's going to happen anytime soon.
  • Nintendo Maniac 64 - Tuesday, December 16, 2014 - link

    Even the Game Boy Advance used an ARM7 for its main CPU.
  • jjj - Tuesday, December 16, 2014 - link

    Obviously there are handhelds using ARM but the point was about bigger cores and clearly not handhelds.
  • DLoweinc - Tuesday, December 16, 2014 - link

    Don't quote Wikipedia, not suitable for this level of writing.
  • garbagedisposal - Tuesday, December 16, 2014 - link

    Says DLoweinc, master of knowledge and scholarly writing.
    In contrast to your childish and outdated opinion, Wikipedia is a perfectly valid source of information, go read about it and quit crying.
  • Daniel Egger - Tuesday, December 16, 2014 - link

    The problem really is the custom solutions can simply not compete with Intel on any level for general purpose computing (which the majority of applications are), not on performace/price, performance/power and not even on features/price.

    For instance I can see a huge market for sub-Xeon (or Atom C) performance at a corresponding price -> not going to happen because everyone is targeting > Xeon performance at ridiculous prices because they're expecting the margin to be there however there're simply to many compromises to be made by the buyers so that has to fail.

    Also I can see a huge demand for Atom C - Xeon performance at lower power consumption however no one seems to be really targetting this, all we get are Raspberry Pi's and a bit beefier but close from even Atom C. The new virtualisation techniques (Docker et al) opened a whole new can of possibilities for non-x86(_64) devices because virtualisation is suddenly possible and much more lightweight than ever before but no one seems to want to jump this opportunity.

    I'd really like to buy some affordable general purpose (BYOM/BYOS) hardware which has a little bit of oomph and takes little power which should be the powerful sides of any of the contenders but somehow all fail to deliver and I don't even see an attempt to change that.

    If I want mind-boggling performance at decent performance/price ratio with real virtualisation and 100% standard software compatibility there's no way around the high end Xeons (and maybe AMD iff they manage to get their asses back up) and none of the contenders is ever going to challenge that so they might as well stop trying.

Log in

Don't have an account? Sign up now