Soft Machines

To put it succinctly, having a thread take resources from multiple cores - when the performance can be extracted - sounds like the long-desired solution to the problem making multi-core designs more useful in lightly-threaded scenarios. Having multiple threads use resources on a single core on the same clock cycle is an even bigger leap in the same direction. Now obviously Soft Machines didn’t come up with this overnight.

Soft Machines came out of stealth mode at the 2014 Linley Conference. Their main goal was to increase performance-per-watt using better IPC designs, which is often one of the better ways if you can keep a design fed with data. One big challenge to this is that IPC has been somewhat flat these past few years - we're seeing small sub-10% yearly increases from the big players using standard designs. Soft Machines were already six years old at the time, with $150M+ raised from investors that include Samsung Ventures, GlobalFoundries, AMD, Mubadala and others (with another $25M since). If those names all seem interlinked, it’s because they all have historic business or investment dealings with each other (AMD/GloFo, Samsung/GloFo, AMD/Mubadala etc.). The team at Soft Machines is 250+ strong, with ex Intel, ex Qualcomm, ex AMD engineers on staff from processor design to platform architects. Half the staff is currently located in California.

At the 2014 conference, aside from explaining what they were doing, Soft Machines also exhibited working silicon of their design. The first generation proof of concept was fabbed at 28nm at TSMC and running at 500 MHz.

It seems odd to say that it was done at TSMC, especially with Samsung and Global Foundries as investors. We were told that this was due to timing and positioning with IP more than anything else, and the same is true for the next generation at 16nm FF+, rather than 14nm.

VISC and Roadmaps

The first generation chip wasn’t perfect – there were some design flaws in silicon that required specific workarounds relating to cache flushing and various methods, but at the time it was compared to a single thread Cortex A15 running at a similar frequency in a Samsung processor. The results with SPEC2000, SPEC2006, Denbench and Kraken gave a corresponding IPC relative to A15 of 1.5x to 7x, or as Soft Machines likes to put it: 3-4x "on average." It was estimated that access to a second physical core improves performance by an average of 50-60%, or an average IPC of 1.3 per core compared to 0.71 for Cortex A15, which explains the 3-4x average.

The roadmap for Soft Machines put their second generation VISC core, Shasta, in line for 2016. It was formally announced at the 2015 Linley Conference, with this month’s announcement being more about availability for licensing on 16FF+. The Shasta core on this node is designed as a 2C/2VC design, or two of these can be put together using a custom protocol interconnect to form a dual 2C/2VC design.

The custom interconnect fabric here is capable of over 200 GB/s, although in current designs only a single interface is present, allowing only two chips to be connected.

The dual processor design is going to be part of the Mojave IP as a fully integrated SoC.

Along with the requisite VISC cores, the Mojave SoC includes PowerVR graphics, a DDR4 memory controller, virtualization management, a PCIe root complex capable of eight lanes of PCIe 3.0, USB ports, support for SATA, UFS, OpenCL 2.0 and other standards.

Looking forward, Soft Machines would like to see production move to 10nm in 2017 to take advantage of further power and area scaling. Meanwhile along that same timeframe they also want to expand the Shasta design to allow for four virtual cores per two physical cores, essentially allowing more threads to be in flight at one time and fully use the resources better. 2018 sees the move to four physical cores and eight virtual cores per design, while still supporting SMP and SoC designs as well.

Dealing with Guest ISAs and a Translation Layer Show Me the Proof
Comments Locked

97 Comments

View All Comments

  • xdrol - Saturday, February 13, 2016 - link

    The Cruzoe was (and Denver is) a VLIW design, it needed software translation to run *anything*, telling what pipeline ports to schedule (a hard optimization problem). Here the translation is supposedly just an ARMv8 to internal ISA mapping, scheduling is still done by hardware like with a normal superscalar design.
  • Jtaylor1986 - Friday, February 12, 2016 - link

    Excellent article Ian. Thanks
  • jjj - Friday, February 12, 2016 - link

    1 more thing.
    Any clue about thermal management? Can they turn off individual physical cores or they just lower clocks? Being able to do both would be interesting.
  • matt321 - Friday, February 12, 2016 - link

    This would make sense for someone like Apple to buy/invest/license the technology for their own processor development. They could have common cores with translations for both ARM and x86 (for iOS and OS X respectively) with the long-term goal of migrating completely to VISC ISA.
  • extide - Friday, February 12, 2016 - link

    This is interesting, because I have thought of doing a processor design somewhat like this for a long time. Remember when BD was coming out, there were rumors of "reverse Hyperthreading" well this is kinda that.

    I had thought that someone should make a suuuper wide cpu, like 20 or 30 wide, put TONS of execution resources on it, and then put a bunch of hyperthreads. That way a single thread could use all 20-30 execution resources, if possible, or you could have multiple threads sharing all that. Like instead of a quad core, with 2 threads/core have like a super core with 8+ threads, and then maybe a couple of those.
  • extide - Friday, February 12, 2016 - link

    Although, I had always thought that engineers had thought of this already, and that maybe it was a bad idea due to some reason I don't understand, and that's why we haven't ever seen a design like that. Well, this is pretty similar to my idea, except they aren't making a super core, they are allowing a thread to use resources from several cores, if it needs.
  • Exophase - Friday, February 12, 2016 - link

    The problem is that going wider decreases efficiency and slows down critical paths. So the processor that's N * 2 wide will have to be a lot slower and/or less efficient than the one that's N wide. If software can rarely extract enough parallelism to go beyond N wide then the N * 2 wide version will almost always be worse. There's a good balance point to be found here.

    Some components in the CPU even scale worse than linearly as they increase in width. The wiring can increase quadratically or even exponentially.

    In practice, a lot of the code that you could realistically extract a ton of ILP from is the type of code that's easiest to vectorize or thread (and a lot of vector + thread friendly can run well on GPUs). What remains, outside of some benchmarks anyway, is mostly a lot of code that has fairly limited ILP due to eventually hitting mispredicted branches or from very long dependency chains. Branch mispredictions are particularly bad on a CPU that has a ton of instructions in flight due to being very wide because that much more energy is wasted on failed speculation.
  • Oxford Guy - Friday, February 12, 2016 - link

    So why wasn't Prescott really great (narrow and deep) versus the G5 (very wide and shallow)?
  • Exophase - Saturday, February 13, 2016 - link

    It's like I said, "there's a good balance point to be found here."

    Faster clocks need higher voltage which scales super-linearly with power consumption. They require longer pipelines which have worse branch misprediction penalties. They take more cycles to talk to other components that don't scale with CPU clock like RAM. More transistors (more space, power) are thrown at these things to try to compensate, like better branch predictors and more reordering, more aggressive prefetching, etc.

    So there's a balancing act between two extremes and what makes the most sense will depend on the manufacturing process, target market and various other things.

    G5 was actually not very wide and shallow anyway. It was a 2.7GHz processor in 2003 and was supposed to hit 3GHz. It had a 16-21 stage pipeline with up to > 200 instructions in flight. That's not shallow at all. 4 wide decode with 2x ALU + 2x L/S is not really that wide either.
  • AlexTi - Friday, February 12, 2016 - link

    If algorithm is developed which can split current single-threaded code into "threadlets", which can be run in parallel, why can't it be used in compilers to make multi-threaded code to run on existing architecture? Especially in enviroments which use JIT?

Log in

Don't have an account? Sign up now