Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

Name: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Item: Examining Soft Machines' Architecture: An Element of VISC to Improving IPC
Author: Dr. Ian Cutress

by Ian Cutress on February 12, 2016 8:00 AM EST

97 Comments | Add A Comment

97 Comments

But 16FF+ Silicon Exists

One of the salient points of our talk with Soft Machines was the fact that silicon talks louder than simulations. Their CTO was very honest and said this before I even had the chance to. The 28nm design was shown in 2014 and data was provided, but no 16FF+ design had since been made public. Soft Machines were happy enough to share with us that they do have the core design for 16nm at HQ being examined:

16nm Silicon of a Shasta design

This is literally a test chip of cores rather than a full SoC, and they are currently running the correlation data between simulation and silicon. We were told that the design errors that the 28nm silicon had, such as cache flushing properly, were fixed. The new silicon also includes power plane management, although customers are welcome to use their own power plane adjustments.

The goal, according to Soft Machines' numbers, is to provide a Shasta core on an optimized 16nm FF+ process at 2GHz at around 2W. Their goal includes scaling the design from SoC to server, meaning that there is the goal to reach a range of 0.5W per core up to 5W per core. Because there’s only one 16FF+ part-SoC early run currently at their headquarters it remains to be seen if that is possible, and requires a partner or investor to get their hands dirty with the technology first.

Before someone jumps up and says "is platform XYZ going to use VISC?", it should be fairly obvious from most public roadmaps covering the next 1-2 years that major platforms will not be using VISC. What we see on public roadmaps is a mix of ARM and x86, and the fact that VISC is a different ISA under the hood (which can run native VISC code without translation) means that there has to be an ecosystem change. Soft Machines, with their announcement last week, is at this time principally fishing for clients, investors, and potentially something more.

The big thing about why this design has got a lot of attention in the media and between analysts is because of the potential. Being able to have many light-weight cores that can share resources between threads would be a major milestone in semiconductor design and the next point in the CISC/RISC lineage. It epitomizes the idea of having all the hardware working on a task no matter what it is, such that you can have many slower power efficient cores working on a single task or one inefficient high power but fast core. If you can spare the die area and have a good ISA translation layer, this opens up some of the power budget in a power limited device. A lot of discussion on laptops or smartphones is all about the power, although Soft Machines believes this can impact servers just as easily.

Arguably one could state that future processors will have to do something like VISC in order to get better IPC – when a thread needs a large wide core, then a VISC design can be one when needed. Technically we already have semiconductor designs that work very well on prepared data – vector calculations and graphics are handled by lots of small, simple cores in their thousands. But these only work with consistent data and when the same calculation on all the data points is needed; with a VISC design, the code can be complex with dependencies and the virtual cores will shrink/expand as needed. A lot of questions surrounding the translation layer are to be expected, and if it can be as water-tight as possible when other ISAs are passed through (ARM to VISC, x86 to VISC) and also take advantage of compiler benefits as to SMI’s claims.

As it stands the design promises a lot, but because we really need to see the proper silicon implementation, it might be hard to visualize until a company in the technology ecosystem decides to make that step. It would be an interesting differentiation point for sure, but it requires investment to reach utility in mass production. That makes a number of analysts wary and conservative with good reasons, especially with the assumptions made on that data graph.

Soft Machines has invited us to their offices next time I’m in the Bay Area, which I will probably take them up on.

Sources:

Soft Machines
Microprocessor Report
2014 Linley Conference Video
2015 Linley Conference Video

Show Me the Proof

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

97 Comments

View All Comments

Bleakwise - Tuesday, March 14, 2017 - link
"Floating point code"

"Integer code"

Do you have any idea what you're talking about?

Bulldozer does "flaoting poitn code" faster than the fucking 1080Ti

At least one one thread. Unless you're going to go wide it doesn't help.

The point of this isn't to "go wide" it's to massively increase speculation ability.

The 1080Ti has ZERO speculative ability, NONE. GPUs simply don't do branching, that's not what GPUs do, they rely on ACE units and SMX units and so on to balance thousands of cores.

A CPU on the other hand has more speculative branches than cores.

SIMD and SIMT that GPUs do are not "FPU code"

FFS
dcbronco - Friday, February 12, 2016 - link
AMD helped finance this. They may already have a stake and I would bet some right to first refusal. They used their investment in HBM to get earlier access than NVIDIA, I doubt they would have invested without some sort of incentives for themselves.
Bleakwise - Tuesday, March 14, 2017 - link
Of course not.
bcronce - Saturday, February 13, 2016 - link
There is no such thing as a free lunch. They are trading something. Their benchmarks are for single thread performance, which the graphs showed a much greater efficiency and performance than Intel. Very impressive and I'm sure they'll be great for something.

The problem is the platform sounds great for highly coupled cores and very wide single thread execution with few data dependencies. Could be great for computation.

What I'm wondering is how their platform scales for IO workloads like web servers, file servers, or event video games. Suddenly a large part of the work is communicating with other devices and synchronizing many cores.

One thing that has helped ARM for a long time is they were mostly single core and only recently multi-core. They didn't use to have a complex cache-coherency like x86. This dramatically reduced transistor counts, increase efficiency, and allowed for great decoupled core performance. But as soon as you wanted two cores to work together, it went to crap. Cache-coherency is hardware accelerated inter-core communication. Amdahl's law was not very forgiving to ARM's non-cache-coherency cores for anything except GPU like workloads.

Based on the description, VISC sounds like it needs highly couples cores to maintain low latency and high bandwidth. This is probably why they also seem to have lower frequency. Keeping many parts far away from each-other in sync takes time. But lower frequency also means lower voltage, and power consumption scales with the square of the voltage and linear with frequency.

I wonder how tightly coupled they can keep 4, 8, or 16 cores. Maybe they don't need the core counts for their target workloads or possibly they can stay competitive with a fraction the core counts by having better efficiency in power and IPC.

In the end, I'm sure they'll at least find a niche market and I'm glad some new ideas are making it out there. I wouldn't be surprised if they can take over the dual or quad core market, forcing Intel to add more cores.
Bleakwise - Tuesday, March 14, 2017 - link
It's not a "free lunch"

Obviously all of this crap is going to cost DIE space, it's not free.

If all we cared about was raw processing power we'd just make 2046kb wide vector units and ignore branching and speculation all together.

Bulldozer has better theortical performance than Haswell i5s. I'd rather have the extra out of order pipes, the SMT unit to use any unused pipes, better branch prediction and so, and in the real world this stuff wins the day.

Not everyone can become a world class programmer and re-factor all their code so that it spreads across thousands of cores like it can on a GPU.

Sometimes it's not even possible. Sometimes what you need is branch prediction, branch prediction lets you see the future, LITERALLY this is what the CPU does. Obviously the more branches you predict, the more cycles you're wasting on that thread, because the more speculations you get wrong.

You also reduce the number of misses and increase cache hits.

As for coupling 4 or 16 cores, they haven't even talked about going beyond 4 cores. Obvoiusly it doesn't scale into infinity, if you're getting 90% speculative accuracy you can only gain 10% more. Spending 30% of your transitor budget to bind up 8 or 16 cores when spending 10% of your budget on 4, for a 10% performance gain would be dumb.

You'd be much better off going for more clock speed, or reducing latency, adding a victim cache, or l2 cache coherency, or beefing up the GPU, a better memory controller, or just beefing up your underlying branch predictor,
Bleakwise - Tuesday, March 14, 2017 - link
You'll never get perfect speculation anyway. Unless a language is developed that puts limits on the number of branches possible per X lines of code and keep the number of branches below the number the CPU can handle. You're going to ALWAYS have to deal with the risk of cache misses.

Not sure there is even anything you could gain from 100% target prediction hit grantee beyond having no lost cycles on a miss. Getting there even through a core-binding fabric/bus like this across 16 cores would blow your transistor budget to the point that you could hardly afford a reasonable size cache in the first place.

You'd be better off just reducing the number of stages in the pipelines or just adding more pipelines to each core instead of blowing your budget on this fabric.

For example, binding together 100 in order CPUs to make a virtual 100 pipeline CPU would be ridiculously expensive and power hungry vs just having an 8 core superscaler CPU with 12 out of order pipelines in each CPU.
tipoo - Friday, February 12, 2016 - link
Question, since this is testing their core design in isolation and the rests of the package hasn't been built around it, is that accommodated for in the comparisons to other SoCs, which all have far more die area dedicated to non-core stuff than the cores?
Flunk - Friday, February 12, 2016 - link
If VISC is not an Acronym then don't capitalize it, idiots.

The technology looks like it could be really good, I'm hoping we see some practical applications.
smilingcrow - Friday, February 12, 2016 - link
They can capitalize it for any reason they like; it's just a word so nothing to GYKIATO (Get your kickers .... over).
andychow - Friday, February 12, 2016 - link
It's an acronym, you can't trademark acronyms, so now they claim it's not an acronym. Legal bs 101.

Examining Soft Machines' Architecture: An Element of VISC to Improving IPC

But 16FF+ Silicon Exists

Post Your Comment

97 Comments

View All Comments

Bleakwise - Tuesday, March 14, 2017 - link

dcbronco - Friday, February 12, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

bcronce - Saturday, February 13, 2016 - link

Bleakwise - Tuesday, March 14, 2017 - link

Bleakwise - Tuesday, March 14, 2017 - link

tipoo - Friday, February 12, 2016 - link

Flunk - Friday, February 12, 2016 - link

smilingcrow - Friday, February 12, 2016 - link

andychow - Friday, February 12, 2016 - link

Log in

Don't have an account? Sign up now