Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux

Name: Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux
Item: Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux
Author: Dr. Ian Cutress

by Ian Cutress on August 20, 2017 11:00 AM EST

41 Comments | Add A Comment

41 Comments

The Duplex and Power Management

Like many processors on the market, design companies will use building blocks to assemble their complete processors. Equip those blocks with the right protocols, put them together, optimize, and create an advanced piece of sand that can decompress cat gifs if we prod it in the right way. Qualcomm’s main building block at the SoC level is the Falkor duplex, containing two Falkor cores, a shared L2 cache, QSB/fabric connectivity, and represents the lowest level of power management.

For SoC design followers, one might look at this design and think they see similarities with other dual-core designs such as AMD’s original Bulldozer design from 2011 or Intel's Xeon Phi. Internally, the cores are completely separate in terms of instruction throughput with no shared resources before the L2 cache. Consequently, between the two ends of the spectrum, Falkor is much closer to a Xeon Phi dual-core module, where each core has its own set of execution ports and vector extensions, but share an L2 cache and network connectivity.

But before diving into the cores, the L2 cache and power control require some explaining.

The L2 cache is a unified cache between both cores with ECC support, and inclusive of the L1-Data caches on both. Accesses are 128-byte interleaved with 128-byte lines, with 32-bytes per direction per interleave per cycle and 8-way associativity. ECC is using SEC-DED methodology, and the overall result as a minimum 15 cycle latency for an L2 hit, which is very competitive in the market. Qualcomm isn’t stating the size of the L2 cache at this time, which is somewhat of a surprise. In the market we see a variety of L2 cache options, so Qualcomm might end up offering a series of processors with different amounts of L2, especially if L2 defects are a factor in the manufacturing.

For power control, Qualcomm uses this unified design to control both cores. During our briefing we were told that both cores have to share the same frequency for L2 consistency; however the voltage per core can be adjusted and optimized for the best power implementation. As a result, power states between the cores can vary, and depending on the workflow needed, the cores and the L2 can also have different power states.

The cores in the duplex are powered by a block head switch or a low-dropout regulator (LDO), depending on the requirement. This allows for a variety of power down modes for the core logic, registers and caches:

Light Sleep: CPU Clock is gated/lowered
Voltage Retention: Registers and Caches retain state, logic is effectively off
Register Retention: Registers retail state using main chip power rail, caches are off
Collapse: Registers and L1 state not retained

The power control also maintains the state of the L2 cache, which offers modes similar to the CPU and may clock gate completely even with the CPUs in use. We confirmed that the L2 cache can only be on or off, and not in a half-use state.

Overall for power, Qualcomm is also implementing hardware state machines, to enable quick entry and exit to and from low power modes. Qualcomm explained that CPU use in data centers can be super low depending on time of day and requirements, so the ability to save power wake quickly was a fundamental design aspect for this chip, with the aim of reducing the electricity bill.

We know that these chips are built on a 10nm process, and when questioned Qualcomm stated that they will run above 2.0 GHz, while talking in about 1.0 V to do so while also being competitive in performance per watt. Unfortunately requests regarding TDP were returned with 'competitive for a data center environment'.

The SoC: 48 Falcor Cores, DDR4, PCIe Getting Intimate with Falkor: The Front End

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

41 Comments

View All Comments

tipoo - Sunday, August 20, 2017 - link
Big ARM server CPUs will be interesting. The ISA is very sane and scalable, if the investment and demand was there it would have no issue getting to where large x86 cores are, the ISA was never the limit.

Then we can see if they can actually exceed them.
Kevin G - Sunday, August 20, 2017 - link
This makes me wish that Apple would license their cores to 3rd parties. Recent Apple cores are getting very close to where x86 lies per clock and they've certainly exceeded x86 in performance/watt in the ultra mobile space (granted Intel's last round of ultra mobile chips was flat out cancelled, skewing such a comparison).

Considering Apple's work in ultra mobile, I find it believable that a higher performance per clock design in the server space is feasible for an ARM design. A company with enough resources just needs to do it.
iwod - Sunday, August 20, 2017 - link
If the leaked numbers for A11 were true then Apple may have exceeded the performance / clock against Intel x86 as well.

While Apple are highly unlikely to ever license their Cores out, I wish they could use those Cores and make an Xserve Server Come back.
peevee - Monday, August 21, 2017 - link
XServe died because of their own OS. Nobody is interested in anything but Linux (and sometimes a little Windows).
But they could have sold it with Linux though.
Dr. Swag - Sunday, August 20, 2017 - link
Apple never will though, since it's Apple we're talking about. They keep their tech to themselves to give themselves the advantage.
name99 - Sunday, August 20, 2017 - link
The only benchmarks that exist are geekbench4 and the browser benchmarks against Apple laptop hardware. By THOSE benchmarks A9X matched Intel in IPC and A10X exceeds by around 15%.

This is clearly an area that draws out the crazies in full screaming mode because a lot of assumptions have to be made (for example the most realistic assumption is that the high-end Intel scores occur at the maximum turbo frequency, but the crazies will insist that, no, you have to normalize to the baseline intel frequency for that particular CPU). Or you get the insistence that the ONLY measurement that matters is against SPEC2006 compiled with icc, which runs into the issues that icc has MASSIVE effects on SPEC; and that no SPEC numbers in any form exist for the A10/A10X.

At the end of the day, it boils down to "what is your goal?" If your goal is an honest comparison of the two processor families, the best data available suggests the summary I gave. If your goal is "my CPU can beat up your CPU" then all the data in the world presumably won't change your mind, and the best data of all is non-existent data (like the certain claims as to how the A10X would or would not behave on SPEC2006).

Final point. It is not at all implausible, IMHO, that Apple have a plan, and have already started proceeding down it, for ARM in their data centers. After all, why not? It saves them money, it allows them to run at their pace not Intel's (eg install AI or compression or encryption accelerators as they need them) and provides better security (both security through obscurity and not having as large an attack surface as Intel).
But why would they talk about it? Apple says nothing ever, unless they have to. No way they would advertise to their competitors the extent to which they have comparative advantage through use of their own data warehouse chips (for at least some purposes).
zodiacfml - Monday, August 21, 2017 - link
Not sa fast. Apple's SoC's are huge in die size which is the reason for their performance. They are as big or bigger than Intel Core. The best part for comparison are the Core M parts. There is little or no business for Apple to do this. There are rumors using Apple SoC on a Macbook Air but that will make little sense as they will to need port OSX to ARM. Again, that is not a good idea as Macbook Pro nor the Mac Pros will continue with OSX .
cdillon - Monday, August 21, 2017 - link
Apple has already ported OSX to ARM, and they call it "iOS". It's not going to be as big a deal as you think to get OSX as we know it to run in ARM. Not only that, but they already have experience with juggling two processor architectures (PPC and x86) at the same time in one OS.
extide - Monday, August 21, 2017 - link
And 68k to PPC, back in the day
name99 - Monday, August 21, 2017 - link
Apple's SoCs are not huge, neither are their cores.
The iPhone SoC's tend to hover around 100 to 120mm^2, the iPad SoCs sometime reach 150, though the A10X is below 100.
The cores are a few mm^2. Eyeballing it, I'd say the entire CPU complex (2 large cores, two small cores, and L2) is about 12mm^2. This is substantially larger than ARM cores (four A73s+their L2 in the same process technology would fit in 8mm^2) but substantially smaller than Intel (an Intel core these days runs at around 8mm^2 in Intels 14nm).

Analyzing Falkor’s Microarchitecture: A Deep Dive into Qualcomm’s Centriq 2400 for Windows Server and Linux

The Duplex and Power Management

Post Your Comment

41 Comments

View All Comments

tipoo - Sunday, August 20, 2017 - link

Kevin G - Sunday, August 20, 2017 - link

iwod - Sunday, August 20, 2017 - link

peevee - Monday, August 21, 2017 - link

Dr. Swag - Sunday, August 20, 2017 - link

name99 - Sunday, August 20, 2017 - link

zodiacfml - Monday, August 21, 2017 - link

cdillon - Monday, August 21, 2017 - link

extide - Monday, August 21, 2017 - link

name99 - Monday, August 21, 2017 - link

Log in

Don't have an account? Sign up now