There has been a lot of recent movement in the ARM Server SoC space, with three major players. The third player, AppliedMicro, has been acquired by MACOM. MACOM has announced that the third generation 16-nanometer FinFET Server-on-a-Chip (SoC) solution, X-Gene 3, is sampling to "lead customers". Despite all the products so far on ARMv8, the server world continues to mature and to move forward. 

The AppliedMicro X-Gene 3

Back in 2015, we reviewed the 40 nm 8-core X-Gene 1 (2.4 GHz, 45W), which found a home in HP's Moonshot processors. Performance wise the SoC was on par with the Atom C2750 (8 cores @ 2 GHz), but consumed twice as much power, which led in our review to an overall negative conclusion. The power consumption issue was understandable: it was baked on a very old 40 nm process. But the performance was rather underwhelming, as we expected more from a 4-issue superscalar processor at 2.4 GHz. The Atom core, by comparison, was only a dual-issue design and offered similar performance at a lower frequency.

Moving forward, we got the X-Gene 2. This was a refresh of the first design, but built on 28 nm. It was still at 2.4 GHz, but with a lower power consumption (35 W TDP) and a smaller die size of around 100 mm².  Despite the relatively lackluster CPU performance, the overall efficiency increase meant that the X-Gene 2 did find a home in several appliances where CPU performance was not the top priority, such as switches and storage devices. 

MACOM, the new owners of the X-Gene IP, claim that the new X-Gene 3 is a totally different beast. The main performance claim is that it should be >6 times faster in SPECintRate than X-Gene 1 or 2. That performance increase is mostly because the new SoC has 4 times as many cores: 32 rather than 8.  Besides the 32 ARMv8-A 64-bit cores in X-Gene 3, it will also include eight ECC capable DDR4-2667 memory channels, supporting up to 16 DIMMs (max. 1 TB), and 42 PCIe Gen 3.0 lanes. 

MACOM's reference X-Gene 3 platform has everything working at near full speed: all 32 cores are functional and run as fast as 3.3 GHz. The SoC design gives 32 MB of L3 cache through a coherent network, and we are told is 'at full speed'. PCIe, USB and integrated SATA ports all work at full speed also. Memory is initially limited to 2400 MT/s instead of 2667 MT/s, but considering that the current memory market only offers buffered DDR4 DIMMs at 2400, that is not an immediate issue. 

That set of specifications is impressive, but if the X-Gene 3 really wants to be a "Cloud SoC", performance has to be competitive. We look forward to testing.

The ARM Competition

The other two players are Cavium and Qualcomm. 

Cavium has been on a buying spree as of late, acquiring Broadcom Vulcan IP and also Qlogic, a network/storage vendor. If Cavium can inject all that IP in it's Thunder-X server SoC line, its next generation could be a very powerful contender. 

Qualcomm will have its 48-core Centriq-2400 SoC ready by the second half of this year, and it will run Windows Server. 

Predicted Performance Analysis: Xeon-D Alternative

The only performance figures for X-Gene 3 we have seen so far are the ones found in a Linley Group white paper that can be accessed here:

Based on testing of the current configuration of 3.0GHz CPU frequency and DDR4-2400, the company expects the chip to deliver a SPECint_rate2006 (peak) score of at least 500 when running at its peak speed of 3.3GHz and DDR4-2667 and with some additional hardware and compiler tuning. 

That benchmark value is the basis for the claim of "6x more powerful than the predecessor". We can somewhat predict how this can be possible, since SPECInt_Rate2006 scales almost perfectly: 32 cores instead of 8 already give us a 4 times increase. In order to get an overall 6x bump in performance then, each core must be (overall, including frequency) about 50% faster. 

Most of the performance boost will come from the frequency: as the SoC can boost to 3.3 GHz on X-Gene 3 over the 2.4 GHz X-Gene 2, this translates to a 37.5% increase. The rest of the gains are most likely related to IPC improvement, in branch prediction and TLB architecture. All in all, 6 times higher performance is not an outrageous claim, but there are few snakes in the grass to consider. 

Firstly, MACOM extrapolates from numbers at 3 GHz to 3.3 GHz. Thus the final frequency for the parts is still at the whim of tweaking and optimization, and may result in an increase in TDP over 125W. Also to note is that "additional hardware and compiler tuning is necessary", which is a general term for expected software improvements. While that might turn out to be true, other companies have promised similar and been unable to deliver, so until there's some proof it might be hard to determine at this point.

Last year APM estimated that the new X-Gene 3 would achieve 550 SPECInt_Rate2006 at 3 GHz. That claim has been revised to 500 at 3.3 GHz. 

The graph above also seems to show SPEC scores run with GCC, as most published scores place the Xeon E5-2580v4 at 669. While we favor results obtained with GCC too as they more realistic, based on experience we are wary that the graph above could paint a rosy picture of X-Gene's performance. 

The Linley Group states:

 “X-Gene 3 can handle a broad range of cloud workloads, including scale-up and scale-out applications. The processor excels on big data, particularly in-memory databases, because of its high memory bandwidth."

The 8-channel 32 core X-Gene 3 achieves 67 GB/s. It is weird that the paper, written in March 2017, still mentions the use of DDR4-2133. If we compare the results to the typical Xeon scores we have measured in previous reviews, we get the following: 

Our testing methodology is described here

Stream Triad

For those of you who are not familiar with Stream: the CPU does not matter much. When there are enough cores/threads to generate sufficient demand on the memory subsystem, the peak bandwidth numbers are observed regardless of additional cores (see testing done by Dell). In some circumstances adding more cores actually gets a net decrease. So despite including Intel's top model in the graph above, there is no performance benefit.

The 8-channel X-Gene 3 achieves, with 32 cores, somewhere between 24% (compared to the best result of the Xeon in ICC) and 63% better bandwidth than a similar single Xeon system with DDR4 at the same speed. But an Intel system with the same amount of memory channels would still be better. For comparison, but not listed on the chart, in our test of a single CPU Power8 system, it achieved 91 GB/s due to its memory subsystem (using Centaur chips), despite our relatively simple GCC settings and the use of DDR3-1333. The X-Gene 3 bandwidth numbers are vastly superior to those of the X-Gene 1 (19 GB/s, see here), but it is worth noting that X-Gene 1 had only 4 channels using DDR3-1600.  

The X-Gene 3 results are more than respectable, but the official quotes from the Linley paper that 'the processor excels on big data' would seem to come across as an exaggeration without any direct benchmark data to back it up. As always at AnandTech, we like to make conclusions on hard data, and look forward to being able to verify the claims.

Conclusion

From the announcement and released data, the new X-Gene 3 core would appear to be the fastest ARM-v8a server SoC announced in the market so far. The engineers behind X-Gene deserve some applause for their tenacity, and for gradually improving their product to the point where it is a serious threat to the lower to mid-end of the Xeon E5 range. But those numbers need to be externally verified.

There are still a number of uncertainties, for sure. The bandwidth numbers are good, but not impressive. The power usage has not been tested, and only publishing SPECInt_rate2006 estimates (that already have been revised downwards) does not by itself as a guarantee of good overall server performance for the platform. 

One thing is interesting: the arrival of the X-Gene 3 puts a lot of pressure on Intel's decision to artificially curtail the Xeon D* platform. Intel's fastest Xeon D (D-1587) offers lot of performance with 16 cores and 32 threads as 2.3 GHz, all inside a low 65W TDP - but the Xeon D has only 2 memory channels, can support only 128 GB of memory, and costs $1754 list price.

*We say curtail, based on Xeon-D being based on Broadwell and rather than updating to the latest microarchitecture. Intel's recent release of new networking focused Broadwell-based Xeon-D parts suggests that an update to the platform might be far off in the distance.

From what we can tell, the X-Gene 3 is rumored to cost less than $1200. At that price, it offers much more memory bandwidth and capacity, given its 8-channels and support for up to 1 TB. So although we have some reservations, we welcome the X-Gene 3 to be the cat among the Xeon D pigeons.

Additional Images

While at MWC, Anton was able to score some images of an X-Gene 3 system being demonstrated at the show. Despite it being a mobile show, given the size of ARMs presence, perhaps it might not be unexpected to see some of them on display. The unit was at the Kontron booth, and the date code on the heatspreader puts the manufacturing timing at 2016, week 53.

 

Related Reading

 

 

POST A COMMENT

24 Comments

View All Comments

  • milli - Wednesday, March 15, 2017 - link

    'Back in 2015, we reviewed the 40 nm 8-core X-Gene 1 (2.4 GHz, 45W), which found a home in HP's Moonshot processors. Performance wise the SoC was on par with the Atom C2750 (8 cores @ 2 GHz), but consumed twice as much power, which in our review to an overall negative conclusion.'

    I'm reading the first paragraph and it's already barely readable. Wonder how the rest will be.
    HP's Moonshot processors = HP's Moonshot servers
    ... which in our review to an overall negative conclusion = which in our review led to an overall negative conclusion
    Reply
  • Ian Cutress - Wednesday, March 15, 2017 - link

    Sorry, I take responsibility for that - I edited the piece last night. Reply
  • Yojimbo - Wednesday, March 15, 2017 - link

    I wonder who will end up buying the X-Gene line, as Macom said they will divest themselves of it. Reply
  • rahvin - Wednesday, March 15, 2017 - link

    Just like the HP Moonshot, no one at all. It'll cost more, use more power and perform less work than a cheaper Xeon if we can go by all the other Arm server attempts. Someone like Facebook might buy a few to serve a need that involves some specialized load but probably not. But ultimately I suspect it'll go the same way all the others have, they'll tape out, get to the test servers, realize no ones going to buy it and then dump it like a hot potato.

    There's a growing pile of ARM server chips that never were at this point. ARM's attempts to develop a real server spec that uses ACPI, UEFI and others might have a chance down the road but all these early attempts have been underwhelming and total failures and I suspect that's going to continue. Ultimately if one of them succeeds Intel will just pull a generation forward 6 months or cut prices so that it never has a chance.
    Reply
  • webdoctors - Wednesday, March 15, 2017 - link

    Is applied micro's server CPU still relevant? With AMD cores finally competitive, and their bargain basement pricing, most of the motivation for ARM servers is gone, the power advantages is negligible in IO heavy environments, and performance abysmal in other cases.

    Plus with AMD you're getting tried and true x86 compatible performance. The risk to jump to an ARM server now just doesn't make any sense.
    Reply
  • deltaFx2 - Monday, March 20, 2017 - link

    @ webdoctors: Precisely. The whole noise for ARM servers started as a response to AMD's Bulldozer fiasco. Vendors want an alternative even if they only buy Intel, as a way of getting Intel to price their parts appropriately (See Dell, circa 2005?). I expect a lot of the noise Microsoft is making re. ARM servers is also to this end. Intel has raised prices on server parts following AMD's de facto exit from the market. With Naples, this changes, at least in certain segments of the market. HPC will probably want intel for AVX-256/512, but ARM hasn't a chance in that market yet.

    Here's the other thing: Intel and AMD make the same core for 3 different markets: laptop, desktop, and server. Say what you will about shrinking PC market, it's still a massive market in which to amortize your development costs. Unless ARM vendors can plug the exact same core in multiple markets like Qualcomm using Falcor in both phones and servers, or ARM builds a beefy core for anyone to license, x86 will have the economy of scale that ARM does not. Building custom cores for server will not be profitable. x86 won in servers vs DEC, SPARC, IBM etc because it was able to amortize the cost, and was able therefore to justify more resources into pushing performance/features while keeping costs down. Of the big iron mainframes, only IBM remains (as I understand, Oracle has all but killed SPARC). For how long remains to be seen. Again, volumes is the issue.
    Reply
  • Krysto - Friday, March 24, 2017 - link

    I think AMD is only competitive against Intel, price-wise. I doubt either of them are competitive against ARM chips in terms of pricing. At least compared to these latest and upcoming ARM server chips. Reply
  • Krysto - Friday, March 24, 2017 - link

    And I'm basing this on the fact that Intel wasn't anywhere near competitive in terms of pricing in the mobile market with ARM. So I imagine this will be the case for servers, too, especially considering that both Intel and AMD will charge premiums for their server chips. Reply
  • deltaFx2 - Saturday, March 25, 2017 - link

    @Krysto: Intel did follow a "contra-revenue" strategy in the mobile market. As I understand it, they were effectively giving it away free. Furthermore, the fallacy here is that the server market is not the same as the mobile market. Server is driven by Total Cost of Ownership (TCO). In a rack, the CPU accounts for ~30% of the cost of acquisition. Datacenters are provisioned for a certain performance; for the sake of this discussion, let's assume it's transactions/second and response/latency. The ARM vendors so far have been throwing lots of weak cores (low IPC) at the problem (not sure what Qualcomm's doing). The problem with this approach is that while it does well at spec int rate, its performance on actual workloads is lacking. To get the same transactions/second, you need to scale up to more cores, which translates into more racks. At some point, your TCO equation gets out of hand, because adding more racks costs more money; even if the CPU were free, it would be cheaper to use Intel. Weak cores may have worse latency than strong cores depending on the workload, and in that case, you're hosed.

    This is precisely why Atom servers and AMD's Bulldozer-based servers have little market share. AMD had huge incentive to price their product competitively and yet, they found themselves effectively priced out of the market. This strategy of many weak cores might get you 1% of the server market, if at all, and be sure that Intel will price Atom servers to undercut any ARM vendor that's a threat. Not to make money, but to keep ARM out of the data center.

    AMD is highly unlikely to charge big premiums for their server chips. They want to grow market share; they've even stated as much. And their cost structure is lower than Qualcomm's or Cavium's as the same design cost gets amortized over multiple markets (see previous post). The way I see it, AMD's resurgence in x86 has pushed the ARM server market back by another 5 years.

    Additional reading: http://www.realworldtech.com/vax-cpu-economics/ You may find this insightful.
    Reply
  • deltaFx2 - Saturday, March 25, 2017 - link

    PS: As to your comment "I think AMD is only competitive against Intel, price-wise." That's certainly not true. Price alone cannot get you into the server market for reasons mentioned earlier. (AMD tried with Bulldozer). Naples has a lot of things going for it that Intel does not offer: 8-channel vs 4 (Broadwell)/6 (Purley), 128 PCIe 3 lanes in 1P, 64 in 2P (vs intel's 44/88 in 1P/2P), solid power numbers, very competitive single threaded performance*, and doesn't need an external chipset. And, according to reports, AMD claims 170GB/s bandwidth vs the rather paltry numbers posted for X-gene. AMD is a solid option for applications that need a lot of cores, lots of I/O bandwidth, and lots of memory capacity and bandwidth. Intel's 32-core offering is likely to have an eye-watering price, given that unlike AMD, it builds all 32 on the same die, necessitating spectacular yields.

    * Except in AVX-256/512.
    Reply

Log in

Don't have an account? Sign up now