But 16FF+ Silicon Exists

One of the salient points of our talk with Soft Machines was the fact that silicon talks louder than simulations. Their CTO was very honest and said this before I even had the chance to. The 28nm design was shown in 2014 and data was provided, but no 16FF+ design had since been made public. Soft Machines were happy enough to share with us that they do have the core design for 16nm at HQ being examined:


16nm Silicon of a Shasta design

This is literally a test chip of cores rather than a full SoC, and they are currently running the correlation data between simulation and silicon. We were told that the design errors that the 28nm silicon had, such as cache flushing properly, were fixed. The new silicon also includes power plane management, although customers are welcome to use their own power plane adjustments.

The goal, according to Soft Machines' numbers, is to provide a Shasta core on an optimized 16nm FF+ process at 2GHz at around 2W. Their goal includes scaling the design from SoC to server, meaning that there is the goal to reach a range of 0.5W per core up to 5W per core. Because there’s only one 16FF+ part-SoC early run currently at their headquarters it remains to be seen if that is possible, and requires a partner or investor to get their hands dirty with the technology first.

Before someone jumps up and says "is platform XYZ going to use VISC?", it should be fairly obvious from most public roadmaps covering the next 1-2 years that major platforms will not be using VISC. What we see on public roadmaps is a mix of ARM and x86, and the fact that VISC is a different ISA under the hood (which can run native VISC code without translation) means that there has to be an ecosystem change. Soft Machines, with their announcement last week, is at this time principally fishing for clients, investors, and potentially something more.

The big thing about why this design has got a lot of attention in the media and between analysts is because of the potential. Being able to have many light-weight cores that can share resources between threads would be a major milestone in semiconductor design and the next point in the CISC/RISC lineage. It epitomizes the idea of having all the hardware working on a task no matter what it is, such that you can have many slower power efficient cores working on a single task or one inefficient high power but fast core. If you can spare the die area and have a good ISA translation layer, this opens up some of the power budget in a power limited device. A lot of discussion on laptops or smartphones is all about the power, although Soft Machines believes this can impact servers just as easily. 

Arguably one could state that future processors will have to do something like VISC in order to get better IPC – when a thread needs a large wide core, then a VISC design can be one when needed. Technically we already have semiconductor designs that work very well on prepared data – vector calculations and graphics are handled by lots of small, simple cores in their thousands. But these only work with consistent data and when the same calculation on all the data points is needed; with a VISC design, the code can be complex with dependencies and the virtual cores will shrink/expand as needed. A lot of questions surrounding the translation layer are to be expected, and if it can be as water-tight as possible when other ISAs are passed through (ARM to VISC, x86 to VISC) and also take advantage of compiler benefits as to SMI’s claims.

As it stands the design promises a lot, but because we really need to see the proper silicon implementation, it might be hard to visualize until a company in the technology ecosystem decides to make that step. It would be an interesting differentiation point for sure, but it requires investment to reach utility in mass production. That makes a number of analysts wary and conservative with good reasons, especially with the assumptions made on that data graph.

Soft Machines has invited us to their offices next time I’m in the Bay Area, which I will probably take them up on.

Sources:

Soft Machines
Microprocessor Report
2014 Linley Conference Video
2015 Linley Conference Video

Show Me the Proof
Comments Locked

97 Comments

View All Comments

  • extide - Friday, February 12, 2016 - link

    Because a compiler can only schedule instructions to the CPU's front end. This is scheduling of instructions to different ports on the back end of the cpu. The compiler can't tell the CPU what port an instruction goes down, the CPU picks that. THe compiler only gets to pick what instructions are issues, and in what order, and of course, modern CPU's can even change that order if they deem it faster to do so.
  • Exophase - Friday, February 12, 2016 - link

    To have any realistic chance of working the threading speculation/detection has to have a large dynamic component (detecting threadlets as they become desirable at runtime) and has to have architectural support for very lightweight thread splitting, merging, and inter-thread communication.

    That can't be provided by compilers targeting existing instruction sets.
  • AlexTi - Friday, February 12, 2016 - link

    Thanks, I think I got the point finally. This looks similar to what instruction scheduler currently does for execution units in conventional CPU. Virtualization layer + CPUs will be a kind of very wide core. Right?
    It was already noted in the article, making curent CPU wider is problematic and not universally beneficial. So this new engine should be much more efficient than current implementations.

    Good thing is that we'll see eventually :)
  • Senti - Friday, February 12, 2016 - link

    Bullshit. I have no idea how technically incompetent writers should be to reprint that marketing nonsense again and again.

    First of all, this brings absolutely no advantages over existing fat core + SMT concept. More IPC per core with more pipelines is not done because it's hard to do without that 'virtual cores' nonsense, but simply because there are not enough actually independent instructions that can be automatically extracted from real code during parts where performance matters.

    "Alternatively, if multiple programs or threads want to use the hardware, then a single core is inaccessible to additional threads while the first thread is still in use (though this can be avoided somewhat by simultaneous multithreading or SMT which will let another thread have access when the first has encountered a stall such as waiting for L1/L2 memory)." - total lies. That describes coarse-grained multithreading which is not very popular atm. For example, Intel HT allows usually 2 threads to execute simultaneously dynamically sharing pipelines of the same core all the time. POWER8 uses 8 'virtual' threads per core.

    Why no one splits instructions from the same thread over several cores (other than the obvious reason that there are not enough independent instructions to split)? Almost quote from the text: "cross-core communication adds latency and reduces performance".

    Instruction set emulation? Far from new concept. Why not popular? Reason is very simple: significant overhead. Try translating something non-trivial like AVX/NEON instructions to some generic internal instruction set.

    Finally, the last point: everyone can draw cute performance graphs and huge numbers in marketing presentations, but how about giving actually working chips for independent reviews of performance and power efficiency on real code?
  • vladx - Sunday, February 14, 2016 - link

    Skim the article again, there's a roadmap so let's see how things will go from here.
  • Exophase - Friday, February 12, 2016 - link

    There's another big question with their power measurements. They take differences between idle and 1C and 1C and 2C to cancel out the static contribution of other peripherals. But this still ignores the dynamic contribution.

    For example, we can look at Cortex-A72, where ARM claims that one core at 2.5GHz on the TSMC 16FF+ process will consume about 750 mW. In Kirin 950, the power consumption appears to be about 900 mW at 2.3GHz. Is ARM exaggerating or is Huawei's implementation inferior to ARM's expectations? The discrepancy can actually be pretty easily explained by losses in the PMIC/VRMs, the SoC's memory controller, and the DRAM - all components which use more power the more the CPU load increases.

    This is especially a factor for wall measurements because they take into effect an additional AC/DC convertor. While it's possible that Soft Machines included these figures in their power estimations I doubt it since they didn't mention it, and like ARM it's more practical and beneficial for them to work with core power estimations only.

    So there could easily be another 25+% that the non-VISC platforms are being penalized.

    Something else that raises a red flag to me is the 16FF+ test chip. There are only 100 pins. When you take out power, ground, and various control signal are accounted for that leaves a very small interface either to a memory controller or memory (if the controller is integrated). Even a single channel 32-bit interface would be a hard fit. So does this chip really represent both realistic power consumption or realistic performance? I think they're trading one for the other on this one and that makes me question the applicability of the power numbers they've given for it.
  • Arnulf - Friday, February 12, 2016 - link

    Since one cannot buy these "scaled" chips, IMHO it'd make more sense for SMI to publish performance per watt figures of real hardware and let the market decide whether their concept is attractive enough. Yes, Intel may have process node advantage, yes, different CPUs are targetting different performance and power profiles but at least it's a straight comparison and if VISC doesn't beat its entire competition at at least one metric then it's destined to fail anyway.

    Oh and the remark in the article regarding "VISC advantage" because of it using twice the number of cores while running a single thread in tests - who cares as long as it comes out on top in performance per watt? If they can beat other CPUs by using more cores, kudos to them!
  • ppi - Saturday, February 13, 2016 - link

    Regarding core count, I would direct you to recent AT article on Android usage of multiple cores. Simplified conclusion may be, that Android tends to utilise 4 cores pretty well.

    In real world, this significantly reduces impact of distributing single thread over multiple cores.
  • kgardas - Friday, February 12, 2016 - link

    Interesting stuff, but to be honest, combining "simpler" cores into more complex is also done by software on SPARC64. At least Fujitsu mentions this on some of their hotchip presentation for SPARC64 VII. So you have 4 cores CPU with 4-wide core and you can combine this by software (compiler) into 8-wide or more depending on your needs for instruction parallelism.
    Another thing is that something like that is IIRC also supported by POWER8 where you do have a lot of duplicated resources, but not enough so in case 1 thread is able to consume all core resources you may switch-off 7 others. IIRC IBM's compilers contains some optimizations for this too.
    Pity think you have mentioned Itanium only in this negative way. Honestly speaking Itanium design was really great and really pity that Intel stopped developing it and not provided any OoO designs on this architecture. If Denver will be successful we will see, but NVidia still counts with it for some designs which may be interesting in a light that they are using ARM's core (A57) for some time now and don't need Denver that much. Also automotive does not care if Denver is there or not yet nVidia pushes it there so I would bet they needs to have really good reason for it. Perhaps their VLIV is good for some special tasks...
    So to me whole this looks like they are on another round for money.
  • Oxford Guy - Friday, February 12, 2016 - link

    "Honestly speaking Itanium design was really great and really pity that Intel stopped developing it"

    The market disagreed so, if you're right, it's a pity the market dictates product success to such a degree.

Log in

Don't have an account? Sign up now