Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

by Anton Shilov on December 16, 2016 6:00 PM EST

88 Comments | Add A Comment

88 Comments

Qualcomm this month demonstrated its 48-core Centriq 2400 SoC in action and announced that it had started to sample its first server processor with select customers. The live showcase is an important milestone for the SoC because it proves that the part is functional and is on track for commercialization in the second half of next year.

Qualcomm announced plans to enter the server market more than two years ago, in November 2014, but the first rumors about the company’s intentions to develop server CPUs emerged long before that. In fact, being one of the largest designers of ARM-based SoCs for mobile devices, Qualcomm was well prepared to move beyond smartphones and tablets. However, while it is not easy to develop a custom ARMv8 processor core and build a server-grade SoC, building an ecosystem around such chip is even more complicated in a world where ARM-based servers are typically used in isolated cases. From the very start, Qualcomm has been rather serious not only about the processors themselves but also about the ecosystem and support by third parties (Facebook was one of the first companies to support Qualcomm’s server efforts). In 2015, Qualcomm teamed up with Xilinx and Mellanox to ensure that its server SoCs are compatible with FPGA-based accelerators and data-center connectivity solutions (the fruits of this partnership will likely emerge in 2018 at best). Then it released a development platform featuring its custom 24-core ARMv8 SoC that it made available to customers and various partners among ISVs, IHVs and so on. Earlier this year the company co-founded the CCIX consortium to standardize various special-purpose accelerators for data-centers and make certain that its processors can support them. Taking into account all the evangelization and preparation work that Qualcomm has disclosed so far, it is evident that the company is very serious about its server business.

From the hardware standpoint, Qualcomm’s initial server platform will rely on the company’s Centriq 2400-series family of microprocessors that will be made using a 10 nm FinFET fabrication process in the second half of next year. Qualcomm does not name the exact manufacturing technology, but the timeframe points to either performance-optimized Samsung’s 10LPP or TSMC’s CLN10FF (keep in mind that TSMC has a lot of experience fabbing large chips and a 48-core SoC is not going to be small). The key element of the Centriq 2400 will be Qualcomm’s custom ARMv8-compliant 64-bit core code-named Falkor. Qualcomm has yet has to disclose more information about Falkor, but the important thing here is that this core was purpose-built for data-center applications, which means that it will likely be faster than the company’s cores used inside mobile SoCs when running appropriate workloads. Qualcomm currently keeps peculiarities of its cores under wraps, but it is logical to expect the developer to increase frequency potential of the Falkor cores (vs mobile ones), add support of L3 cache and make other tweaks to maximize their performance. The SoCs do not support any multi-threading or SMP technologies, hence boxes based on the Centriq 2400-series will be single-socket machines able to handle up to 48 threads. The core count is an obvious promotional point that Qualcomm is going to use over competing offerings and it is naturally going to capitalize on the fact that it takes two Intel multi-core CPUs to offer the same amount of physical cores. Another advantage of the Qualcomm Centriq over rivals could be the integration of various I/O components (storage, network, basic graphics, etc.) that are now supported by PCH or other chips, but that is something that the company yet has to confirm.

From the platform point of view, Qualcomm follows ARM’s guidelines for servers, which is why machines running the Centriq 2400-series SoC will be compliant with ARM’s server base system architecture and server base boot requirements. The former is not a mandatory specification, but it defines an architecture that developers of OSes, hypervisors, software and firmware can rely on. As a result, servers compliant with the SBSA promise to support more software and hardware components out-of-the-box, an important thing for high-volume products. Apart from giant cloud companies like Amazon, Facebook, Google and Microsoft that develop their own software (and who are evaluating Centriq CPUs), Qualcomm targets traditional server OEMs like Quanta or Wiwynn (a subsidiary of Wistron) with the Centriq and for these companies having software compatibility matters a lot. On the other hand, Qualcomm’s primary server targets are large cloud companies, whereas server makers do not have their samples of Centriq yet.

During the presentation, Qualcomm demonstrated Centriq 2400-based 1U 1P servers running Apache Spark, Hadoop on Linux, and Java: a typical set of server software. No performance numbers were shared and the company did not open up the boxes so not to disclose any further information about the CPUs (i.e., the number of DDR memory channels, type of cooling, supported storage options, etc.).

Qualcomm intends to start selling its Centriq 2400-series processors in the second half of next year. Typically it takes developers of server platforms a year to polish off their designs before they can ship them, normally it would make sense to expect Centriq 2400-based machines to emerge in the latter half of 2H 2017. But since Qualcomm wants to address operators of cloud data-centers first and companies like Facebook and Google develop and build their own servers, they do not have to extensively test them in different applications, but just make sure that the chips can run their software stack.

As for the server world outside of cloud companies, it remains to be seen whether the server industry is going to bite Qualcomm’s server platform given the lukewarm welcome for ARMv8 servers in general. For these markets, performance, compatibility, and longevity are all critical factors in adopting a new set of protocols.

88 Comments

View All Comments

name99 - Monday, December 19, 2016 - link
"ARM has a longer code to run."
I assume this is supposed to mean that ARM has lower code density. Except that this is wrong.
ARM 32-bit using Thumb2 has better code density than x86-32.
ARMv8 Aarch64 has better code density than x86-64.
These are both academically verified numbers.
beginner99 - Saturday, December 17, 2016 - link
Krysto you are only half right. The article compared tegra 3 on 40 nm vs Clover Trail on 32 nm. So yes it wasn't fair but it was not planar vs tri-gate.

ARM server SOCs will only be able to compete if they get close to intels ST performance. They thing is you don't really need tons of slow cores. Less faster cores is actually a lot better. Why? The cores can easily be split up by assigning more than available to all Virtual machines. Besides that VMs make management a lot easier especially if you are running dev, test, and prod servers. Just clone them instead of complete physical reinstall...

What however also often gets forgotten isn't actual throughput but also latency for the end user. if you have a complicated web service that needs some serious grunt, the CPU with faster ST performance wins. And with VMs all your applications running on that server profit from that fast ST speed or latency advantage compared to many slow cores.

Then there is also the by-core-licensing of certain software which also hurts slow cores. So it's a very steep battle for ARM servers.
serendip - Saturday, December 17, 2016 - link
Wouldn't a lot of distributed stuff like Hadoop run better on more but slightly slower cores? Qualcomm isn't targeting typical hosting companies running VMs for the initial rollout, it's going for the big cloud providers who run their own custom software stacks. Some things respond well to throwing lots and lots of low power cores as long as there's a lot of shared RAM and fast interconnects.
deltaFx2 - Saturday, December 17, 2016 - link
If the only thing you ran was hadoop, then that _may_ possibly be true. Most data centers would run things other than hadoop as well, to utilize the server to near-100%. It's important to also remember that the more you parallelize, the more cost you pay in terms of overhead. Also, a parallel workload is only as fast as its slowest thread, so at some point single threaded performance will show up. And then that's Amdahl's law.
Kevin G - Monday, December 19, 2016 - link
@deltaFx2
It depends on the data center and redundancy. Running servers at 100% is not actually advised as that doesn't leave room for fail over in application clusters. If you have two nodes in a cluster, then it is advised to keep each node under 50% load so that if one dies, the other can can handle the additional workload without issue.

With turbo functionality being common place, this actually works out slightly better than a single server at 100% load due to the marginally higher clocks obtainable at 50% load.
FunBunny2 - Saturday, December 17, 2016 - link
-- Wouldn't a lot of distributed stuff like Hadoop run better on more but slightly slower cores?

2 points:
a) there are only a handful of embarrassingly parallel user state problems, so massive cores only solve such problems; web servers being the most likely
b) Intel cpu run a RISC hardware; one might expect it looks at least as fancy as ARM's since Intel has been building X86 "decoding" to-RISC cpu for nearly 2 decades. caching the JIT RISC instructions makes an X86 run much like (or, just like) and ARM cpu.
patrickjp93 - Saturday, December 17, 2016 - link
No. The overhead of broadcasting data to more nodes, of launching more threads on more cores, adds up very quickly, more quickly than Gustafson's Law and Amdahl's Law predict scaling does even in the best case. It's one major reason IBM sticks with a scale-up core design philosophy.
Wilco1 - Sunday, December 18, 2016 - link
"the CPU with faster ST performance wins". Note Xeons don't have the fastest ST performance. High-end Xeons typically have a 2GHz base frequency and can just about reach 3GHz for a single thread. Note 48 cores is similar to 48-thread Xeons (which support 8 sockets), so clearly there are markets for lots and lots of medium-performance threads.

So ARM servers only need to get close to 2GHz Xeon performance to beat high-end Xeons. And that's a much lower barrier than you suggest.
serendip - Sunday, December 18, 2016 - link
My initial reasoning as well. The way Qualcomm talks about the chip, they're not concerned about single threaded performance of a few big cores, they'd rather focus on a lot of medium-performance cores. If this chip eats into Xeon territory, Intel should be very worried.
extide - Saturday, December 17, 2016 - link
Did you even read the article? First of all, TSMC never had a 32nm gen, and no ARm chips ever made it to the market on 32nm as far as I know. This was first gen Atom he was testing with so the ATOM was 32nm, but the ARM cores were 40nm,best available at the time.

Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling

Post Your Comment

88 Comments

View All Comments

name99 - Monday, December 19, 2016 - link

beginner99 - Saturday, December 17, 2016 - link

serendip - Saturday, December 17, 2016 - link

deltaFx2 - Saturday, December 17, 2016 - link

Kevin G - Monday, December 19, 2016 - link

FunBunny2 - Saturday, December 17, 2016 - link

patrickjp93 - Saturday, December 17, 2016 - link

Wilco1 - Sunday, December 18, 2016 - link

serendip - Sunday, December 18, 2016 - link

extide - Saturday, December 17, 2016 - link

Log in

Don't have an account? Sign up now