But 16FF+ Silicon Exists

One of the salient points of our talk with Soft Machines was the fact that silicon talks louder than simulations. Their CTO was very honest and said this before I even had the chance to. The 28nm design was shown in 2014 and data was provided, but no 16FF+ design had since been made public. Soft Machines were happy enough to share with us that they do have the core design for 16nm at HQ being examined:


16nm Silicon of a Shasta design

This is literally a test chip of cores rather than a full SoC, and they are currently running the correlation data between simulation and silicon. We were told that the design errors that the 28nm silicon had, such as cache flushing properly, were fixed. The new silicon also includes power plane management, although customers are welcome to use their own power plane adjustments.

The goal, according to Soft Machines' numbers, is to provide a Shasta core on an optimized 16nm FF+ process at 2GHz at around 2W. Their goal includes scaling the design from SoC to server, meaning that there is the goal to reach a range of 0.5W per core up to 5W per core. Because there’s only one 16FF+ part-SoC early run currently at their headquarters it remains to be seen if that is possible, and requires a partner or investor to get their hands dirty with the technology first.

Before someone jumps up and says "is platform XYZ going to use VISC?", it should be fairly obvious from most public roadmaps covering the next 1-2 years that major platforms will not be using VISC. What we see on public roadmaps is a mix of ARM and x86, and the fact that VISC is a different ISA under the hood (which can run native VISC code without translation) means that there has to be an ecosystem change. Soft Machines, with their announcement last week, is at this time principally fishing for clients, investors, and potentially something more.

The big thing about why this design has got a lot of attention in the media and between analysts is because of the potential. Being able to have many light-weight cores that can share resources between threads would be a major milestone in semiconductor design and the next point in the CISC/RISC lineage. It epitomizes the idea of having all the hardware working on a task no matter what it is, such that you can have many slower power efficient cores working on a single task or one inefficient high power but fast core. If you can spare the die area and have a good ISA translation layer, this opens up some of the power budget in a power limited device. A lot of discussion on laptops or smartphones is all about the power, although Soft Machines believes this can impact servers just as easily. 

Arguably one could state that future processors will have to do something like VISC in order to get better IPC – when a thread needs a large wide core, then a VISC design can be one when needed. Technically we already have semiconductor designs that work very well on prepared data – vector calculations and graphics are handled by lots of small, simple cores in their thousands. But these only work with consistent data and when the same calculation on all the data points is needed; with a VISC design, the code can be complex with dependencies and the virtual cores will shrink/expand as needed. A lot of questions surrounding the translation layer are to be expected, and if it can be as water-tight as possible when other ISAs are passed through (ARM to VISC, x86 to VISC) and also take advantage of compiler benefits as to SMI’s claims.

As it stands the design promises a lot, but because we really need to see the proper silicon implementation, it might be hard to visualize until a company in the technology ecosystem decides to make that step. It would be an interesting differentiation point for sure, but it requires investment to reach utility in mass production. That makes a number of analysts wary and conservative with good reasons, especially with the assumptions made on that data graph.

Soft Machines has invited us to their offices next time I’m in the Bay Area, which I will probably take them up on.

Sources:

Soft Machines
Microprocessor Report
2014 Linley Conference Video
2015 Linley Conference Video

Show Me the Proof
Comments Locked

97 Comments

View All Comments

  • xdrol - Saturday, February 13, 2016 - link

    I somehow fail to see why should be scheduling 2 threads to 2 cores of 4-wide pipelines - including overhead from 'in-thread' cross-core communication - should be more effective than a 2-thread 8-wide SMT core (aka Skylake - it's not 6-wide, and SMT is fine-grained, threads don't 'wait' like the article suggests)
  • Alexvrb - Saturday, February 13, 2016 - link

    Right! It's like getting the best of narrow and wide designs at the same time. You can go wide or narrow and more/less threads as needed. It'll probably need a lot of OS support to work well. Still, the concept is interesting, and if their translation layer is fast, it could eventually handle legacy well enough.
  • FunBunny2 - Saturday, February 13, 2016 - link

    -- You can go wide or narrow and more/less threads as needed.

    but no known processor (or design algorithm) can create parallelism in serial code. just because a cpu wants to implement ILP to a greater extent than extant processors, it can't make parallel from nothing; it can only discover "hidden" parallel that extant processors are missing. not something I'd bet on.
  • Alexvrb - Sunday, February 14, 2016 - link

    I was talking about the processor itself. The CPU can act like a wide or narrow design on demand. Whether or not a particular piece of code will benefit was not something I was discussing. My point is that where it helps, it can go wider than current designs. Where it doesn't help it can scale back and go narrow, leaving more cores available for other threads.

    In other words this doesn't displace multi-threading. A single piece of demanding software may still want to run multiple threads concurrently, to indirectly extract more parallel performance - such as a game splitting up AI, physics, audio, rendering, networking, etc into their own threads. I don't think their design eliminates the necessity of doing this sort of thing.

    However they can boost average efficiency with a narrow design and lots of cores (similar to mobile ARM designs), without losing performance vs high-power designs (and in some cases gaining performance) because they can act as a wide pipeline by combining cores It's a flexible form of virtual cores that people will tend to just simplify as "reverse HyperThreading".

    This is all just in theory of course, their implementation has to prove itself. Not to mention the difficulties they'll face with ISA translation, at least in the near term. If their technology takes off and gets licensed out, there will be ports of modern OS and APIs, and thus apps will be ported to run native (in the case of Windows, cloud compilation would handle the majority of RunTime apps).
  • Samus - Monday, February 15, 2016 - link

    If the translation layer does what they say it does, that is exactly what this processor can do. It can break up serial code for parallel processing. I don't know how, or how efficient, it can do this. To analyze serial code and say ohh, so there's this complex part in the middle and the rest is simple, and send the complex part to one core and the rest to another, and somehow reassemble it after its processed, seems impossible. We have all seen promising tech flop before, Cyrix and Transmetta had some radical ideas for the way x86 worked, in the end neither could trump Intel or AMD.
  • Alexvrb - Monday, February 15, 2016 - link

    What he was saying is that some code can NOT be made parallel. They CAN take single threads and break them up, and when it's possible they can find parallel processing opportunities. But some tasks are inherently serial. Neither the programmer, nor the compiler, nor the VISC processor can make inherently serial tasks parallel. For example, if A has to happen before you can work on B.

    Uh, at least not with a conventional binary architecture. I don't know much about quantum processors.
  • easp - Friday, February 12, 2016 - link

    I think uninformed pundits/press/commenters will miss the limits imposed by Amdahl's law.

    Its still plausible to me though that this approach will allow more efficient use of silicon and power by allowing better allocation of processor resources at runtime than is possible with traditional compilers, operating system scheduling, hardware scheduling and organization of execution resources.

    Whether they can establish a viable foothold in todays competitive landscape is another issue.
  • Sufiyan - Saturday, February 13, 2016 - link

    If anything this shows that Amdahls law is still true.
  • Bleakwise - Tuesday, March 14, 2017 - link

    They never said it violates Amdahl's law.

    I fact they said that 2 cores gives a speedup of 53%.

    Amdahl's law says it would be a 100% speedup maximum.

    Since when is 53% > 100%?
  • Bleakwise - Tuesday, March 14, 2017 - link

    Of course, it does beg the question...

    Why don't we just make 24 wide CPU pipelines and allow for 3-way SMT and fatten the cores up with more units instead?

Log in

Don't have an account? Sign up now