Cambricon Technologies, the company in collaboration with HiSilicon / Huawei for licensing specialist AI silicon intellectual property for the Kirin 970 smartphone chipset, have gone solo and created their own series of chips for the data center.

The IP inside the Kirin 970 is known as Cambricon-1A, the company’s first licensable IP. At the time, finding information on Cambricon was difficult: its website was a series of static images with Chinese embedded into the image itself. Funnily enough, we used the AI-accelerated translate feature on the Huawei Mate 10 to translate what the website said. Fast forward 12-18 months, and the Cambricon website is now interactive and has information about upcoming products. A few of which were announced recently.

The Big Chip: Going for Data Center

Built on TSMC’s 16FF, the MLU-100 is an 80W chip with a capability of 64 TFLOPS of traditional half-precision or 128 TOPS using the 8-bit integer metric commonly used in machine learning algorithms. This is at 1.0 GHz, or the ‘standard’ mode – Cambricon’s CEO, Dr Chan Tianshi, stated that their new chip has a high-performance mode at 1.30 GHz, which allows for 83.2 TFLOPS (16-bit float) or 166.4 TOPS (8-bit int) but rises to 110W. This technically decreases performance efficiency, but allows for a faster chip. All this data relies on sparse data modes being enabled.

The technology behind the chip is Cambricon’s latest MLUv01 architecture, which is understood to be a variant of the Cambricon-1A used in the Kirin chipsets but scaled to something bigger and faster. Obviously additional rules have to be implemented for data and power management compared to the mobile IP. Cambrian also has its 1H architecture and newly announced 1M architecture, although there is no disclosure as to how these might relay to the chip.

David Schor from WikiChip (the main source of this article) states that this could be NVIDIA’s first major ASIC competition for machine learning, if made available to commercial partners. To that end, Cambricon is also manufacturing a PCIe card.

Specification Comparison
AnandTech Cambricon
MLU100-Base
Cambricon
MLU100-Perf
Tesla V100
(SXM2)
Tesla V100
(PCIe)
CUDA Cores - 5120 5120
Tensor Cores - 640 640
Core Clock 1.0 GHz 1.3 GHz ? ?
Boost Clock - 1455MHz 1370MHz
Memory Clock DDR4-1600 1.75Gbps HBM2 1.75Gbps HBM2
Memory Bus Width 256-bit 4096-bit 4096-bit
Memory Bandwidth 102.4GB/sec 900GB/sec 900GB/sec
VRAM 16GB
32GB
16GB
32GB
16GB
32GB
L2 Cache - 6MB 6MB
Half Precision 64.0 TFLOPS 83.2 TFLOPS 30 TFLOPS 28 TFLOPS
Single Precision - 15 TFLOPS 14 TFLOPS
Double Precision - 7.5 TFLOPS 7 TFLOPS
Deep Learning 128.0 TOPS 166.4 TOPS 120 TFLOPS 112 TFLOPS
GPU - GV100 GV100
Transistor Count ? 21B 21B
TDP 80 W 110 W 300W 250W
Form Factor PCIe SXM2 PCIe
Cooling Active Passive Passive
Process TSMC 16FF TSMC 12FFN TSMC 12FFN
Architecture Cambricon-1? Volta Volta

Obviously NVIDIA has a strong user base and multiple generations at this, along with the software in hand to take advantage of their hardware. Cambricon did not go into detail about how they plan to support any SDKs for the new chip, however it does have a series of SDKs on its website, supporting TensorFlow, Caffe, and MXNet.

Getting Into the Data Center: PCIe

The best way to be plug and play in a data center is through a PCIe card. Cambricon’s MLU100 accelerator card is just that: a PCIe 3.0 x16 enabled implementation with either 16 or 32 GB of DDR4-3200 memory on a 256-bit bus, which is good for 102.4 GB/s of bandwidth. To get that much memory on NVIDIA requires the high end cards, but those cards offer multiple times the memory bandwidth. The memory on the MLU100 card has ECC enabled also.

The reports so far state that Lenovo is offering the cards as add-ons to its ThinkSystem SR650 dual Intel Xeon servers; up to two per machine. Looking on the Lenovo website it does not look like they are available quite yet. Given Huawei’s big enterprise presence, it is likely that we might see the chips in those systems as well.

Next Generation: 5 TOPS/Watt

Also reported was the new Cambricon-1M product IP, although the company was not forthcoming with details. WikiChip states that this new IP is built primarily for 7nm, so we are likely to see it when Huawei/HiSilicon starts shipping 7nm mobile processors and then into the next generation of server-focused products. The goal for this IP is to hit 5 TOPS/Watt, compared to the 3 TOPS/Watt advertised by ARM's IP. David also states that Cambricon has a training and inference chip planned for later this year, with another update in 2019.

Related Reading

 

Source: WikiChip, Cambricon 1, Cambricon 2

POST A COMMENT

26 Comments

View All Comments

  • Bulat Ziganshin - Saturday, May 26, 2018 - link

    Obviously, 8-bit integer and 16/32 FP operations cannot be directly compared Reply
  • mode_13h - Saturday, May 26, 2018 - link

    Yeah, that was pretty weak, Ian. You should've made two rows - training TFLOPS and inferencing TOPS. The 64 (or 83.2) TFLOPS of half-precision performance are clearly meant to address the same purposes as V100's fp16 tensor cores.

    And do we know that V100 lacks the 8-bit integer dot product found in most of their Pascal GPUs?
    Reply
  • Yojimbo - Sunday, May 27, 2018 - link

    The V100 has the 8-bit integer operations. But the theoretical peak for the operation is 60 TOPS for the V100, less than the Tensor Core peak of 120 TOPS. So there's probably no reason to use it, as the Tensor Cores give better precision and faster execution.

    Of course, theoretical comparisons of such different chips is not very useful. We need real application benchmarks. But yeah, the comparisons in the chart are very wrong.

    Firstly, NVIDIA's half-precision FLOPS are general purpose FMAs. I get the idea that this ASIC's half-precious FLOPS are not, but are rather more akin to the Tensor Core FLOPS on the V100. The reason I say this is because the chip only has 102.4 GB/s of memory bandwidth, so all those FLOPS would be useless in most general purpose applications. They need specialized algorithms able to reuse data for high compute density to have any hope of taking advantage of those FLOPS with that bandwidth.

    Secondly, 8-bit TOPS should not generically be compared to NVIDIA's Tensor Core FLOPS under "deep learning". 8-bit quantization cannot be used for training and even in inference is only successfully used sometimes. The Tensor Cores can be used for more inference applications than 8-bit integer and can be used for training as well.

    Thirdly, NVIDIA Tensor Core implementation is 16 bit multiply with 32 bit accumulate, which is superior to 16 bit arithmetic, and this difference has been shown in research to be important.

    So, from the best I can tell, the proper comparison should be the V100's Tensor Core "Deep Learning" numbers with the Cambricon chip's "Half Precision" numbers, with the caveat that the Tensor Cores provide potentially better accuracy because of the 32-bit accumulate. The V100's Tensor Core numbers can also be compared with the Cambricon chips 8-bit numbers for inference applications, but it should be noted that mixed precision floating point is being compared to 8-bit integer in that case.
    Reply
  • steve_musk - Sunday, May 27, 2018 - link

    The other elephant in the room for both these chips when discussing FLOPS/OPS is the memory bandwidth needed to feed the execution units. For the V100 tensorcores, you are doing really well if you can get even 40-50% of theoretical flops (I’m a cuda dev and have talked with some Nvidia guys) because you cant get enough data in from memory to feed the cores. Even the Nvidia gurus who write “assembly” code for cuDNN only get near peak performance in very specific and limited circumstances, which involve loading a tensorcore with data and then reusing it 10+ times. Reply
  • Yojimbo - Sunday, May 27, 2018 - link

    That's always the case, though. That's why you have to test hardware on real applications. Here the difference in bandwidth is so great that I would guess results would vary wildly depending on the test. But people still compare theoretical specs of products. Reply
  • mode_13h - Tuesday, May 29, 2018 - link

    That's why people use cuDNN and batching. Reply
  • Bizwacky - Wednesday, May 30, 2018 - link

    Thanks for this comment; it really helps clear up the comparison between the two. I think If these can hit 30% of the performance of the Nvidia cards, they still might have great price performance if Huawei can manage to sell them profitably at ~10% of the price. Reply
  • npz - Saturday, May 26, 2018 - link

    And that's highly specialized 8-bit Tensor Ops and not generic Integer operations either. Reply
  • mode_13h - Tuesday, May 29, 2018 - link

    Well, yes. Same goes for V100, in fact. Reply
  • Pork@III - Saturday, May 26, 2018 - link

    Poor Nvidia, poor Tesla, poor green fens Reply

Log in

Don't have an account? Sign up now