Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

Name: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Item: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Author: Dr. Ian Cutress

by Ian Cutress on February 12, 2016 8:00 AM EST

97 Comments | Add A Comment

97 Comments

But 16FF+ Silicon Exists

One of the salient points of our talk with Soft Machines was the fact that silicon talks louder than simulations. Their CTO was very honest and said this before I even had the chance to. The 28nm design was shown in 2014 and data was provided, but no 16FF+ design had since been made public. Soft Machines were happy enough to share with us that they do have the core design for 16nm at HQ being examined:

16nm Silicon of a Shasta design

This is literally a test chip of cores rather than a full SoC, and they are currently running the correlation data between simulation and silicon. We were told that the design errors that the 28nm silicon had, such as cache flushing properly, were fixed. The new silicon also includes power plane management, although customers are welcome to use their own power plane adjustments.

The goal, according to Soft Machines' numbers, is to provide a Shasta core on an optimized 16nm FF+ process at 2GHz at around 2W. Their goal includes scaling the design from SoC to server, meaning that there is the goal to reach a range of 0.5W per core up to 5W per core. Because there’s only one 16FF+ part-SoC early run currently at their headquarters it remains to be seen if that is possible, and requires a partner or investor to get their hands dirty with the technology first.

Before someone jumps up and says "is platform XYZ going to use VISC?", it should be fairly obvious from most public roadmaps covering the next 1-2 years that major platforms will not be using VISC. What we see on public roadmaps is a mix of ARM and x86, and the fact that VISC is a different ISA under the hood (which can run native VISC code without translation) means that there has to be an ecosystem change. Soft Machines, with their announcement last week, is at this time principally fishing for clients, investors, and potentially something more.

The big thing about why this design has got a lot of attention in the media and between analysts is because of the potential. Being able to have many light-weight cores that can share resources between threads would be a major milestone in semiconductor design and the next point in the CISC/RISC lineage. It epitomizes the idea of having all the hardware working on a task no matter what it is, such that you can have many slower power efficient cores working on a single task or one inefficient high power but fast core. If you can spare the die area and have a good ISA translation layer, this opens up some of the power budget in a power limited device. A lot of discussion on laptops or smartphones is all about the power, although Soft Machines believes this can impact servers just as easily.

Arguably one could state that future processors will have to do something like VISC in order to get better IPC – when a thread needs a large wide core, then a VISC design can be one when needed. Technically we already have semiconductor designs that work very well on prepared data – vector calculations and graphics are handled by lots of small, simple cores in their thousands. But these only work with consistent data and when the same calculation on all the data points is needed; with a VISC design, the code can be complex with dependencies and the virtual cores will shrink/expand as needed. A lot of questions surrounding the translation layer are to be expected, and if it can be as water-tight as possible when other ISAs are passed through (ARM to VISC, x86 to VISC) and also take advantage of compiler benefits as to SMI’s claims.

As it stands the design promises a lot, but because we really need to see the proper silicon implementation, it might be hard to visualize until a company in the technology ecosystem decides to make that step. It would be an interesting differentiation point for sure, but it requires investment to reach utility in mass production. That makes a number of analysts wary and conservative with good reasons, especially with the assumptions made on that data graph.

Soft Machines has invited us to their offices next time I’m in the Bay Area, which I will probably take them up on.

Sources:

Soft Machines
Microprocessor Report
2014 Linley Conference Video
2015 Linley Conference Video

Show Me the Proof

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

View All Comments

KAlmquist - Sunday, February 14, 2016 - link
If I understand the article correctly, the difference between VISC and SMT is that in SMT there is a single scheduler which manages all of the execution units. VISC implements a two stage scheduling algorithm. In the first stage, an operation is assigned to a core. In the second stage, the scheduler for that core assigns the operation to an execution unit.

The downside of SMT is that the amount of silicon required to implement the scheduler grows faster than the number of execution units. So as you add more threads and more execution units, it becomes harder and harder to keep the cost of the scheduler to a reasonable level.

In the second stage of VISC, you have multiple schedulers, each feeding a small number of execution units, which keeps these schedulers simple. In the first stage, the schedulers require at least some awareness of all the execution units. For example, if you have an integer multiply instruction, you want to send it to a core that doesn't have other integer multiply operations outstanding rather than just chosing the core with the smallest total number of outstanding operations. What may keep the first stage scheduling reasonably simple is that it doesn't appear to do any instruction reordering (though it does have to do the bookwork to keep track of which instructions have been retired).

In short, VISC appears to be intended to scale better than SMT as you add more threads and execution units.

What is strange, then, is that Soft Machines isn't talking about building an 8 thread device like IBM's POWER8. Instead, they have a two and four thread designs, and are mostly talking about the former. A two thread VISC design makes sense only if you believe that the SMT approach is already hitting its limits with two threads.

My sense is that VISC is not going to be a game changer, but Soft Machines could be successful if ARM Holdings screws up. If ARM has has a major screw up technologically (like AMD did with Bulldozer), Soft Machines could end up with a superior product. Conversely, if ARM screws up on customer relations, all Soft Machines would need is something close to technological parity with ARM to win customers.
Shadowmaster625 - Monday, February 15, 2016 - link
When Intel purchased Altera I immediately began to visualize all sorts of great potential breakthroughs in single threaded IPC. I imagine that within 5 years, we will have at least a modest number of FPGA cells integrated within Intel CPU cores. These cells will be programmed on-the-fly with application specific DSPs that will be capable of completing commonly used combinations of instructions MUCH faster than the general x86 instruction set would allow. I expect this to be the singularly largest breakthrough in computing of the last 20 years. Within 10 years, I expect the CPU itself to create its own DSP code on the fly as it profiles its own instruction loading in real time. The potential here is utterly massive. Think about what ASICs have done for bitcoin mining... Soon they will be able to do that for javascript!
FunBunny2 - Monday, February 15, 2016 - link
-- capable of completing commonly used combinations of instructions MUCH faster than the general x86 instruction set would allow. I expect this to be the singularly largest breakthrough in computing of the last 20 years.

that's what the real cpu/RISC core/micro-architecture has done for decades. twerked continually.

-- I imagine that within 5 years, we will have at least a modest number of FPGA cells integrated within Intel CPU cores.

done: http://www.extremetech.com/extreme/184828-intel-un...
"This new Xeon+FPGA chip will fit in the standard E5 LGA2011 socket, but the integrated FPGA will allow each chip to be customized to specific workloads."
Shadowmaster625 - Monday, February 15, 2016 - link
That's not what I mean. That is of course a good start, but what I'm talking about is programmable logic linked tightly to the actual execution units of the CPU core. Smaller blocks, probably only a square millimeter or perhaps even less. But many of them. Just like Skylake has 6 execution units. One of these programmable blocks would be only about the same size as one of those existing execution units. They would have direct access to the prefetcher and scheduler and instruction/data caches. They would be power gated.
dustwalker13 - Saturday, February 20, 2016 - link
yes it looks good on paper ... but up to now that is all that it does.

silicon existing at HQ is so much smoke and mirrors until some independant source has an actual go at it and publishes results.

it looks promising, but so did a million other things that ended up as just another failiur or worse scam.

i will keep an eye on this one but for now there simply is nothing to see than mirror images produced by a lot of hot air.
mikato - Saturday, February 20, 2016 - link
So why did they come out of stealth mode?
TruePath - Saturday, April 16, 2016 - link
I've been curious for a long time why more wasn't done to use parallel resouces to extract instruction level parrelism.

However, what puzzles me is why do so much of the work on the fly at run time. Sure, one needs to be able to respond to dynamic performance information like failed speculation but it seems like there is substantial overhead in translating the host ISA into native instructions and (I assume) encoding information into the native instructions about resource needs and dependencies.

Even before a program is run knowledge of the exact processor would enable software to translate the ISA (targeting the exact chip), hint at resources needs and perform a degree of instruction reordering (over a larger window than in hardware).

So why not push as much of this into the software as possible. One can even cache the results of software ISA translation. Is it just a desire to be totally hardware compatible?

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

But 16FF+ Silicon Exists

Post Your Comment

97 Comments

View All Comments

KAlmquist - Sunday, February 14, 2016 - link

Shadowmaster625 - Monday, February 15, 2016 - link

FunBunny2 - Monday, February 15, 2016 - link

Shadowmaster625 - Monday, February 15, 2016 - link

dustwalker13 - Saturday, February 20, 2016 - link

mikato - Saturday, February 20, 2016 - link

TruePath - Saturday, April 16, 2016 - link

Log in

Don't have an account? Sign up now