An Introduction to Neural Network Processing

AI is currently the big buzzword when talking about consumer electronics. While marketing departments all over are trying to embrace the term, when we’re talking about the current use of AI in computing terms we’re specifically talking about machine learning. More precisely when talking about the latest generations of silicon IPs, we’re talking about the implementation of specialized hardware block which are optimized to run convolutional neural networks (CNNs). 

While explaining how convolutional neural networks work in detail is far beyond this piece, they have been a research topic since the 1980’s. The idea is to try to simulate the behaviour of the human brain’s neurons. The keyword here again is simulation; no the various neural network IP’s hardware implementations do not mimic the human brain structure. While the field of neural networks in academia has been around for a long time, it’s only been in the last decade with the introduction software implementations that are able to run on GPUs that things have literally accelerated to become a lot more interesting. Via breakthroughs over the last half-decade, we’ve seen researchers iterate and develop CNN models that improve in terms of accuracy and efficiency. 

Looking under the hood, it turns out that CNNs map pretty well to highly threaded execution models. The work itself has minimal branching or other "complex" behavior that requires a general purpose processor (CPU), and instead can typically be broken up into discrete, semi-independent threads. Furthermore the required computational accuracy is not all that high – running fully developed networks can be done via low-precision integers in some cases – again simplifying the scope of the problem. As a result, CNN research & development hit its stride earlier this decade when GPUs began shipping with the necessary compute features and the overall performance to resolve complex CNN execution in a reasonable-by-human-standards timeframe.

Of course, while GPUs have been the most adapted to running them, GPUs are not the only kind of highly parallel processor out there. As the field is evolving and companies want to commercialize their use in actual use-cases, we saw the need for much higher performance requirements as well as consideration for power efficiency. At this point we started seeing the move towards more specialized processing units whose architecture is built with machine learning in mind. Google was the first to announce such hardware with the announcement of the TPU back in 2016. More specialized hardware loses some flexibility, but in turn it gains power and area (die space) efficiency by only including the hardware and features necessary for the task.


Google's TPU chip and board.

There are two key aspect to actually running NN workloads: first you have to have a trained model which contains the actual information that describes the data that the model is later meant to be run on. The training of models is rather processor intensive – not only is it a lot of work to begin with, but it has to be done with greater levels of precision than the execution of those models, which is to say that efficient neural network training requires more powerful and complex hardware than executing neural networks. Consequently, the idea is that the bulk of models will be trained by high performance hardware, such as server-class GPUs and specialized hardware such a Google’s TPUs on servers in the cloud.

The second aspect of NN is the execution the models; taking the completed models, feeding them new data, and generating results based on what the model perceives. The execution of a neural network model with input data to get an output result is called inferencing. And not unlike the conceptual differences between training and inferencing, the compute requirements for inferencing are quite a bit different as well. The name of the game is still highly parallel compute, but it can be done with lower precision computations and the overall amount of performance required for timely execution is lower as well. Which means that inference can be done on cheaper hardware in many more locations and scenarios.


Graphic source: Nvidia Blog

This in turn has caused the industry to move towards inferencing on edge devices (consumer devices) because it’s a much more performant and power efficient. If you have your trained model on your device locally you can just use the processing power of the device to run the inference and avoid having to upload data to the cloud and have a server do it. This alleviates issues such as latency, bandwidth, and power consumption but also eliminates privacy concerns as the input data never leaves your device.

With the goal of running neural network inferencing locally on an edge device, we have the choice of running the implementation on various different processing blocks on devices such as a smartphone. CPU, GPU and even DSPs are all able to run inferencing tasks, however there are vast efficiency differences between them. General purpose CPUs are the least suited for the task as they are not designed with massive parallelised execution in mind. GPUs and DSPs are much better choices but even then there’s much room for improvement. It is here were we see a new class of processing accelerator like the NPU on the Kirin 970. 

As these new IP blocks are still new the industry hasn’t had time to agree on a common nomenclature. HiSilicon/Huawei have coined the term NPU/neural processing unit while Apple publicly uses NE/neural engine. Other IP providers such as Cadence/Tensilica just outright call their implementation a neural network DSP (Vision C5), Imagination Technologies (Series 2NX) uses the term NNA/neural network accelerator and CEVA’s NeuPro settled on the marketing friendly “AI processor”. In the sense of simplicity I’ll just continue to refer to them as neural network IPs.

In the case of the Kirin 970 the NPU is provided by a new Chinese IP provider called Cambricon. The Kirin 970 NPU however isn’t a straight off-the-shelf offering but rather a co-development between Cambricon and HiSilicon optimized to HiSilicon’s requirements. Huawei quotes 2 TeraOPS FP16 performance on the IP, however this metric is misleading as the performance figure quotes sparse equivalent peak data, meaning the 8-bit quantized throughput. At this point in time we should largely shy away from theoretical performance figures of the neural network IPs as they don’t necessarily correlate to actual performance and there’s less understood architectural characteristics of the IPs that can play larger roles for the resulting end-performance.

The first hurdle to using a neural network on a hardware block other than the CPU is to make use of the proper APIs to access that block. The SoC and IP vendors all currently ship proprietary APIs and SDKs to enable application development for using hardware acceleration for neural networks. In the case of HiSilicon they offer the HiAI API which can manage the workloads between CPU, GPU and NPU. The API is currently not publicly available as it’s still under development, but developers which reach out to HiSilicon can get early access before the public release later in the year. Vendors such as Qualcomm make available the SNPE (Snapdragon Neural Processing Engine) SDK which does the equivalent task of enabling app developers to tap into resources of the GPU and DSP for neural network processing workloads. Other IP vendors of course have their own SDKs for their respective IPs.

However vendor-specific APIs may end up being a temporary quirk of the present time; the goal in the future is to have a common universal API alongside the respective vendor’s IPs. Google has already been working on this and the NN API introduced in Android 8.1 is already actively shipping on Pixel 2 devices. One note that I’ve been made aware of is that currently the NN API only supports a subset of features that is available to IP like the NPU, so for developers to take full advantage of the hardware and extract maximum performance Huawei still sees application developers targeting the various proprietary APIs while using the NN API as a fall-back method.

GPU Performance & Power NPU Performance & Huawei's Use-cases
Comments Locked

116 Comments

View All Comments

  • lilmoe - Monday, January 22, 2018 - link

    Unfortunately, they're not "fully" vertical as of yet. They've been held back since the start by Qualcomm's platform, because of licensing and "other" issues that no one seems to be willing to explain. Like Andrei said, they use the lowest common denominator of both the Exynos and Snapdragon platforms, and that's almost always lower on the Snapdragons.

    Where I disagree with Andrei, and others, are the efficiency numbers and the type of workloads used to reach those results. Measuring efficiency at MAX CPU and GPU load is unrealistic, and frankly, misleading. Under no circumstance is there a smartphone workload that demands that kind of constant load from either the CPU or GPU. A better measure would be running a actual popular game for 30 mins in airplane mode and measuring power consumption accordingly, or loading popular websites, using the native browser, and measuring power draw at set intervals for a set period of time (not even a benchmarking web application).

    Again, these platforms are designed for actual, real world, modern smartphone workloads, usually running Android. They do NOT run workstation workloads and shouldn't be measured as such. Such notions, like Andrei has admitted, is what pushes OEMs to be "benchmark competitive", not "experience competitive". Apple is also guilty of this (proof is in the latest events, where they're power deliver can't handle the SoC, or the SoC is designed well above sustainable TDP). I can't stress this enough. You just don't run SPEC and then measure "efficiency". It just doesn't work that way. There is no app out there that stresses a smartphone SoC this much, not even the leading game. In the matter of fact, there isn't an Android (or iPhone) game that saturates last year's flagship GPU (probably not even the year before).

    We've reached a point of perfectly acceptable CPU and GPU performance for flagships running 1080p and 1440p resolution screens at this point. Co-processors, such as the decoder, ISP, DSP and NPU, in addition to software optimization are far, FAR more more important at this time, and what Huawei has done with their NPU is very interesting and meaningful. Kudos to them. I just hope these co-processors are meant to improve the experience, not collect and process private user data in any form.
  • star-affinity - Monday, January 22, 2018 - link

    Just curious about your claims about Apple – so you think it's a design fault? I'm thinking that the problem arise only when the battery has been worn out and a healthy battery won't have the problem of not sustaining enough juice for the SoC.
  • lilmoe - Monday, January 22, 2018 - link

    Their batteries are too small, by design, so that's the first design flaw. But that still shouldn't warrant unexpected slowdowns within 12-18 months of normal usage; their SoCs are too power hungry at peak performance, and the constant amount of bursts was having its tall on the already smaller batteries that weren't protect with a proper power delivery system. It goes both ways.
  • Samus - Monday, January 22, 2018 - link

    Exactly this. Apple still uses 1500mah batteries in 4.7" phones. When more than half the energy is depleted in a cell this small, the nominal voltage drops to 3.6-3.7v from the 3.9-4.0v peak. A sudden spike in demand for a cell hovering around 3.6v could cause it to hit the low-voltage cutoff, normally 3.4v for Li-Ion, and 3.5v for Li-Polymer, to prevent damage to the chemistry the internal power management will shut the phone down, or slow the phone down to prevent these voltage drops.

    Apple designed their software to protect the hardware. It isn't necessarily a hardware problem, it's just an inherently flawed design. A larger battery that can sustain voltage drops, or even a capacitor, both of which take up "valuable space" according to Apple, like that headphone jack that was erroneously eliminated for no reason. A guy even successfully reinstalled a Headphone jack in an iPhone 7 without losing any functionality...it was just a matter of relocating some components.
  • ZolaIII - Wednesday, January 24, 2018 - link

    Try with Dolphine emulator & you will see not only how stressed GPU is but also how much more performance it needs.
  • Shadowfax_25 - Monday, January 22, 2018 - link

    "Rather than using Exynos as an exclusive keystone component of the Galaxy series, Samsing has instead been dual-sourcing it along with Qualcomm’s Snapdragon SoCs."

    This is a bit untrue. It's well known that Qualcomm's CDMA patents are the stumbling block for Samsung. We'll probably see Exynos-based models in the US within the next two versions once Verizon phases out their CDMA network.
  • Andrei Frumusanu - Monday, January 22, 2018 - link

    Samsung has already introduced a CDMA capable Exynos in the 7872 and also offers a standalone CDMA capable modem (S359). Two year's ago when I talked to SLSI's VP they openly said that it's not a technical issue of introducing CDMA and it'll take them two years to bring it to market once they decide they need to do so (hey maybe I was the catalyst!), but they didn't clarify the reason why it wasn't done earlier. Of course the whole topic is a hot mess and we can only speculate as outsiders.
  • KarlKastor - Thursday, January 25, 2018 - link

    Uh, how many devices have shipped yet with the 7872?
    Why do you think they came with a MDM9635 in the Galaxy S6 in all CDMA2000 regions? In all other regions their used their integrated shannon modem.
    The other option is to use a Snapdragon SoC with QC Modem. They also with opt for this alternative but in the S6 they don't wanted to use the crappy Snapdragon 810.

    It is possible, that Qualcomm today skip their politics concerning CDMA2000 because it is obsolete.
  • jjj - Monday, January 22, 2018 - link

    Don't forget that Qualcomm is a foundry customer for Samsung and that could be why they still use it.
    Also, cost is a major factor when it comes to vertical integration, at sufficient scale integration can be much cheaper.
    What Huawei isn't doing is to prioritize the user experience and use their high end SoCs in lower end devices too, that's a huge mistake. They got much lower costs than others in high end and gaining scale by using these SoCs in lower end devices, would decrease costs further. It's an opportunity for much more meaningful differentiation that they fail to exploit. Granted, the upside is being reduced nowadays by upper mid range SoCs with big cores and Huawei might be forced into using their high end SoCs more as the competition between Qualcomm and Mediatek is rather ferocious and upper mid becomes better and better.

    Got to wonder about A75 and the clocks it arrives at ... While at it, I hope that maybe you take a close look at the SD670 when it arrives as it seems it will slightly beat SD835 in CPU perf.

    On the GPU side, the biggest problem is the lack of real world tests. In PC we have that and we buy what we need, in mobile somehow being anything but first is a disaster and that's nuts. Not everybody needs a Ferrari but mobile reviews are trying to sell one to everybody.
  • HStewart - Monday, January 22, 2018 - link

    This could be good example why Windows 10 for ARM will failed - it only works for Qualcomm CPU and could explain why Samsung created Intel based Windows Tablets

    I do believe that ARM especially Samsung has good market in Phone and Tablets - I love my Samsung Tab S3 but I also love my Samsung TabPro S - both have different purposes.

Log in

Don't have an account? Sign up now