Several CAPI-Enabled Accelerators for OpenPOWER Servers Revealed

by Anton Shilov on April 12, 2016 2:00 PM EST

9 Comments | Add A Comment

9 Comments

Over a dozen special-purpose accelerators compatible with next-generation OpenPOWER servers that feature the Coherent Accelerator Processor Interface (CAPI) were revealed at the OpenPOWER Summit last week. These accelerators aim to help encourage the use of OpenPOWER based machines for technical and high-performance computing. Most of the accelerators are based on Xilinx high-performance FPGAs, but some feature custom silicon.

IBM’s CAPI port is a PCIe 3.0-based interconnection specifically designed for programmable processors (e.g., ASICs, GPUs, FPGAs, etc.) that enables them to address the same memory address space as the CPU. CAPI requires custom hardware incorporated into IBM’s POWER8 processors, which is called the coherent accelerator processor proxy (CAPP), as well as a POWER service layer (PSL) integrated into CAPI-supporting processors. CAPP maintains a directory of cache lines held by the accelerator and snoops the processor bus for the accelerator. The PSL performs address translations and holds the coherent data for quick access by the accelerating hardware. To work, CAPI has to be supported by the hardware, the operating system and the application in use. At present, IBM’s POWER8 CPUs, a number of accelerators, RedHat Enterprise Linux 7.2 LE (and higher), and Ubuntu LE, as well as select programs, support CAPI.

IBM and the OpenPOWER Foundation need CAPI in order to enable a relatively simple and inexpensive way to build special-purpose accelerators for various workloads. The aim is to make POWER8-based machines viable for a variety of market segments as well as to create platforms that can process modern workloads faster.

While it is possible to enable unified memory for CPUs and co-processors using custom hardware and multiple tweaks in device drivers, this requires huge investments in silicon development, complex drivers and a number of other things. By contrast, programming an FPGA (field-programmable gate array) is considerably cheaper, and the CAPI technology brings them key heterogeneous processing capabilities. While this does not necessarily enable higher bandwidth between the CPU and the accelerator (after all, CAPI is layered on top of PCI Express 3.0 and a specified peak bandwidth), according to IBM they remove overheads, improve performance and can potentially simplify the workflow for programmers. In short, CAPI is an important part of IBM’s POWER strategy in general as well as OpenPOWER initiative.

At this year’s OpenPOWER Summit, IBM and its partners revealed over a dozen of special-purpose CAPI-enabled FPGA-based accelerators. This shows that the OpenPOWER platform is gaining interest and investment from different sources. The list of developers includes such companies as BittWare, DRC, IBM, Mellanox, Xilinx and others, but some decided not to publish details about their accelerators, as it seems from OpenPOWER’s press release. The accelerators revealed at the conference are either available or are set to become available in the coming quarters. The devices come in the form of PCIe 3.0 x8 or x16 cards and are compatible with IBM POWER8-based servers. Some are also compatible with machines running other processors (and in this case, CAPI is not supported).

IBM CAPI-Compatible Accelerators
Developer	Model	Hardware and Application
Alpha Data	ADM-PCIE-8K5	Xilinx UltraSCALE KU115-2 FPGA 2×8 GB of DDR4-2400 with ECC (32 GB version can be built) Dual Firefly connectors for up to 4×16Gbps per connector Reconfigurable accelerator for custom video processing, machine learning, HPC and network acceleration applications. Available as add-in PCIe 3.0 x8 cards.
BittWare	XUSP3S	Xilinx Virtex UltraScale 80/95/125/160/190 or Kintex UltraScale 115 2×16 GB DDR4 ECC (64 GB version can be built), QDR memory Four QSFP28 cages for 1×400GbE, 4×100GbE, 4×40GbE, 16×25GbE, or 16×10GbE Massive data flow and packet processing. Available as add-in PCIe 3.0 x16 cards.
DRC	GraphFind	Xilinx Kintex UltraScale KU115 FPGA Can rapidly discover relationships between people, places, events, and objects. Simultaneously identifying focal points with weighted strengths of connections. Available as a PCIe card, or as a pre-configured appliance consisting of multiple cards.
DRC	Novara	Xilinx FPGA A search engine and an accelerator, which identifies key imprecise phrases and Bit patterns using a fuzzy logic analyzer that can instantly analyze millions of messages and data streams without the need to index first. Can process up to 2.5 GB of data per second. Available as 1U server, which contains up to four Novara cards. Servers can be clustered.
DRC	Ferrara2	Xilinx FPGA, four QSFP28 cages. Encrypts and/or authenticates data using AES-256 algorithm with bit-splitting capability from Security First Corporation (SFC) at line rates up to 40 Gb/s. Available as PCIe 3.0 x16 add-in boards for servers, communication or storage systems. Multiple Ferrara2 boards can be placed in one system.
Edico Genome	DRAGEN Genomics Platform	Xilinx Virtex-7 980T FPGA 4×4 GB DDR3L-1866 memory. Analyzes an entire human genome in 26 minutes (vs. 30 hours on general-purpose hardware). Enables healthcare providers to identify patients at higher risk for cancer before the conditions worsen. Compatible with the IBM S822LC server. Available in a pre-configured Power8 server.
IBM	Prototype	Xilinx Virtex UltraScale 190 16 GB of Micron HMC memory. Acceleration of in-memory computing applications. Available as add-in PCIe 3.0 x16 cards.
IBM, Nallatech, RedisLabs Altera	IBM Data Engine for NoSQL	IBM Power S822L server(s) IBM FlashSystem 840 or 900 all-Flash storage system(s) Altera Stratix V FPGA-based interconnection card with 10 GbE SFP+ ports by Nallatech IBM FlashSystems are attached to the POWER8 processor through the CAPI coherent attach card. Thanks to the new interconnection method, the Redis Enterprise Cluster application can issue read/write commands that eliminate 97% of the code path length. According to IBM, this enables IBM Data Engine for NoSQL to access Flash within latency levels comparable to traditional RAM-based x86 implementations. Various configurations available.
IBM, Nallatech, Samsung, Xilinx	Prototype	Xilinx FPGA 2×1 TB Samsung M.2 NVMe SSDs. IBM Data Engine for NoSQL, which allows fast application exploitation in a smaller, in-server form-factor. Available as add-in PCIe 3.0 x8 cards.
Mellanox	ConnectX-4 VPI	ConnectX-4 VPI ConnectX-4 adapter cards with virtual protocol interconnect (VPI) support EDR 100 Gb/s InfiniBand and 100 Gb/s Ethernet connectivity. Available as add-in PCIe 3.0 x16 cards.
Semptian	NSA-120 NSA-120B	Xilinx Kintex UltraScale XCKU060/XVKU115 2×4 GB or 2×8 GB DDR3-1600 memory with ECC Two SATA interfaces Network and service accelerator. Can be used in big data analysis, image recognition/processing, video encoding/decoding, data compression/decompression, data encryption/decryption, voice recognition, neural network, machine learning, network security, etc. Available as add-in PCIe 3.0 x8 cards.

One of the important announcements at the summit was Edico Genome’s DRAGEN genomics platform, which uses an accelerator powered by the Xilinx Virtex-7 980T FPGA and is equipped with 16 GB of quad-channel DDR3L-1866 memory. The platform, which is based on a 2-way IBM S822LC server, can analyze an entire genome in 26 minutes, down from approximately 30 hours on general-purpose processors. An earlier prototype was shown at SuperComputing 2015, however this seems to be the announcement of the full product.

Other interesting solutions discussed at the summit include an FPGA-based accelerator for discovering relationships hidden in big data; an FPGA-powered fuzzy search engine for imprecise string searching and matching, which can analyze millions of messages and data streams without indexing; as well as various reconfigurable accelerators for HPC, Big Data, and so on. IBM also mentions that there are companies offering CAPI-enabled building blocks for FPGAs for computer vision, machine learning, and other applications. Some of those companies are startups or working in stealth mode (we do not know whether they developed their building blocks thanks to the SuperVessel program, though this is a possibility), and they may announce their products over time.

While the number of CAPI-enabled accelerators available today is not high, it is growing, which is a good news for the OpenPOWER ecosystem. Positive news (from IBM) is the number of China-based companies developing accelerators featuring CAPI, which shows that local companies in growing markets for servers are expressing interest in such solutions.

Source: OpenPOWER Foundation

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

9 Comments

View All Comments

SaolDan - Tuesday, April 12, 2016 - link
Neat!
SarahKerrigan - Tuesday, April 12, 2016 - link
Good to see a decent set of CAPI peripherals emerging. The presence of HMC on the mentioned IBM prototype is interesting; I've been very favorably impressed by the performance of HMC so far - gobs of memory bandwidth in a small physical footprint.
mosu - Wednesday, April 13, 2016 - link
What is the estimated cost of a Dragen Genomics platform? numbers please, if possible.
Shadow7037932 - Wednesday, April 13, 2016 - link
If you don't see a price listed, it's probably very expensive.
name99 - Wednesday, April 13, 2016 - link
That's a meaningless statement. "Very expensive" for the casual user who thinks it might be a neat addition to his home PC is more than $100. "Very expensive" for a research lab that is processing genomes by the thousand might be more than, what, $100,000?

Of course the card costs money. The question is: what does it (and the associated POWER8 box) cost compared to the alternative.
Freakie - Thursday, April 14, 2016 - link
Being into genomics myself, I'd be willing to be that at $130,000 Nvidia's DGX-1 server with 8 Tesla P100's would get the job done in less than an hour, cost less, and still be a general purpose machine to do other types of work on.
tuxRoller - Tuesday, April 19, 2016 - link
It'll also take up more room and blow through more energy.
Freakie - Tuesday, April 19, 2016 - link
The DGX-1 is only 3U compared to (probably) 2U of the Dragen Genomics platform. So room really isn't much of an issue. Probably will go through more energy, but that's a negligible factor as these aren't being scaled to 100 nodes.

But really, having the machine be x86 and general purpose means that I can utilize those Pascal based Tesla's to do advanced DNA and molecular modeling, with the data that I get from analyzing the DNA in the first place which is all that the Dragen unit does. The article even says that its use-case is for situations where reconstructing sequenced DNA is the only thing that is needed. In a research environment, you'd go for the expensive computer that does lots of things, not the expensive computer that does one thing. The Dragen is really just a super-focused machine that only does a single thing, not for genomics research unless you happen to have an absurd amount of money to spend.
kirannmehta - Thursday, April 14, 2016 - link
This---otherwise nice article---omits explicit mention of CAPI support by AIX

Several CAPI-Enabled Accelerators for OpenPOWER Servers Revealed

Post Your Comment

9 Comments

View All Comments

SaolDan - Tuesday, April 12, 2016 - link

SarahKerrigan - Tuesday, April 12, 2016 - link

mosu - Wednesday, April 13, 2016 - link

Shadow7037932 - Wednesday, April 13, 2016 - link

name99 - Wednesday, April 13, 2016 - link

Freakie - Thursday, April 14, 2016 - link

tuxRoller - Tuesday, April 19, 2016 - link

Freakie - Tuesday, April 19, 2016 - link

kirannmehta - Thursday, April 14, 2016 - link

Log in

Don't have an account? Sign up now