Hot Chips: IBM's Next Generation z14 CPU Mainframe Live Blog (5pm PT, 12am UTC)

Name: Hot Chips: IBM's Next Generation z14 CPU Mainframe Live Blog (5pm PT, 12am UTC)
Item: Hot Chips: IBM's Next Generation z14 CPU Mainframe Live Blog (5pm PT, 12am UTC)
Author: Dr. Ian Cutress

by Ian Cutress on August 22, 2017 6:55 PM EST

67 Comments | Add A Comment

67 Comments

07:56PM EDT - Sitting down, ready to go

08:01PM EDT - This is the last set of talks at Hot Chips. Starting with IBM, then Intel Xeon, AMD EPYC and Qualcomm Centriq

08:02PM EDT - We've covered Xeon, EPYC and Centriq in recent articles, and nothing new is being announced for the show for them except some minor things that we'll summarize in a news post

08:02PM EDT - But the IBM z14 will be interesting

08:02PM EDT - To clarify, the z series is IBM's mainframe product line

08:02PM EDT - So this isn't POWER8 or POWER9

08:04PM EDT - IBM's z-series has central processors and system control chips with integrated fabric and off-compute chip caches

08:05PM EDT - This is under a 'mainframe' setup, rather than a standard CPU/co-processor setup.

08:05PM EDT - Dr Christian Jacobi to the stage, Chief Architect

08:06PM EDT - z14 was technically announced a few weeks ago

08:06PM EDT - A lot of mainframes still exist

08:06PM EDT - Still used in large corporations for transactional data, e.g. credit card has a mainframe involved. 90% of airline booking systems involve mainframes

08:07PM EDT - Run large databases and large virtualised linux

08:07PM EDT - Have to make design decisions tailored for those workloads

08:07PM EDT - z10 was high frequency, z196 had OoO, z13 had SMT and now z14

08:08PM EDT - The mainframe uses two different chips - the CP (cores and shared L3) and SCP (large L4 and interconnect logic)

08:08PM EDT - Picture is a deep drawer with DRAM, PCIe, and six CP chips under cold plates and one SC (SCP)

08:08PM EDT - Two clusters of CP chips connect to the SC. Can connect four drawers together

08:09PM EDT - CP and SC are large chips, 17 layer metal in 14nm SOI

08:09PM EDT - 10 cores has private 2MB L2-i and 4MB L2-D and 128 MB shared L3

08:09PM EDT - SC chip has 672MB of L4 and coherency logic

08:10PM EDT - Up to 24 sockets int he system, 32 TB RAIM protected memory, 40 PCIe lane fanouts, 320 IO cards

08:10PM EDT - New translation and TLB design over z13, and general pipeline optimations. Changes in instruction set too

08:10PM EDT - Pauseless garbage collection for Java, single and quad vector precision for crypto

08:11PM EDT - Register to register arithmatic

08:11PM EDT - Optimizing for COBOL performance (........)

08:11PM EDT - E.g. gazillions of lines of COBOL in online booking systems

08:11PM EDT - Compression acceleration

08:11PM EDT - This is the pipeline diagram

08:12PM EDT - 5.2 GHz, super long pipeline

08:12PM EDT - 6 instruction parse and decode, CISC instruction cracking

08:12PM EDT - 4-cycle load/use

08:12PM EDT - Directory and TLB pipeline changes

08:13PM EDT - Most designs use logical indexed, absolute tagged directory

08:13PM EDT - Use of partial compare set-predict array reduces latency of data return from L1 cache - TLB and L1 directory access happen in parallel with L1 cache read

08:13PM EDT - (doesn't that sound like way-prediction?)

08:14PM EDT - Highly associative TLB is area and power inefficiency, to limit TLB L1 size

08:14PM EDT - Sorry, I misread the slide, This is how L1 cache looks today

08:14PM EDT - This new slide shows how IBM is using it in z14

08:15PM EDT - I-cache and D-cache is now logically tagged, combining TLB1 and cache directory into single structure

08:15PM EDT - Significant area and power reduction for L1 hit

08:15PM EDT - Now a super large L2 TLB

08:16PM EDT - L2 and TLB2 can be large - 2MB L2I and 4MB L2D, 6k entries TLB2 for 4KB pages

08:16PM EDT - 8 cycle L2 hit latency (that's only 1.5 ns) ...

08:17PM EDT - Now crypto

08:17PM EDT - Now redesigned for 4-7x bandwidth

08:17PM EDT - make it simple and fast enough to be able to encrypt all data

08:17PM EDT - combination of OS, firmware and hardware implementation

08:18PM EDT - Execute 2 AES in 3 cycles

08:18PM EDT - Copy up to 256B per instruction from D-cache to coprocessor

08:18PM EDT - can execute multiple AES at once, multiple engines on die

08:19PM EDT - 13.2GB/sec per core (so 132GB/s per CP, and about 1TB/s per 6-socket server)

08:19PM EDT - Use new instructions to feed crypto engine to avoid branches

08:19PM EDT - Avoid pipeline bubbles using new instructions

08:19PM EDT - Significant effort in prefetching as well

08:20PM EDT - New GCM instruction

08:20PM EDT - Algorithm that does encryption and signature authentication

08:20PM EDT - Implement use AES and GHASH engines

08:20PM EDT - the 2 engines used in concert rather than independently

08:21PM EDT - Now key protection - most CPUs work with keys in memory. CryptoExpress6S is a tamper responding PCIe crypto accelerator. Master key is in physically protected memroy on card

08:21PM EDT - 'Clear Key Cryptography'

08:22PM EDT - Root key access usually means can steal key through mem access or core dump. This method means that the key is protected by tamper protection

08:23PM EDT - Secure Key is another mode, which diverts all crypto off the CPU onto the card instead

08:23PM EDT - This way the application never sees the key, just sees the encrypted data

08:24PM EDT - Creates a key token from the data, which remains in tamper resistent memory, and when data is decrypted, key is thrown away and new key generated

08:24PM EDT - Data Compression Accelerator

08:24PM EDT - Dictionary based data compression

08:25PM EDT - Reduces bandwidth need between memroy and disks, increases efficiency, implemented as irmware and co-processor specialized hardware

08:25PM EDT - *firmware

08:25PM EDT - z14 performance at peak throughput and start up latency. Optimized compression status return to firmware

08:26PM EDT - Order-preserving compression: Allows data still be compared when compressed

08:26PM EDT - Allows compressed directory/tree structures to do comparisons between elements without decompression

08:27PM EDT - CP has 7b transistors, SC has 10b transistors

08:27PM EDT - water cooled

08:28PM EDT - of 240 CPUs in a full system, 170 can be customer configured

08:28PM EDT - +35% capacity, +10 single thread, +25% SMT2 perf over z13

08:29PM EDT - Now for Q&A

08:29PM EDT - Q: Please generate workstations. I want to swap out x86 with z14

08:29PM EDT - (at same price, insert laughs)

08:29PM EDT - Not a serious question

08:30PM EDT - Q: What power for the chips?

08:31PM EDT - A: You can get the chips to run at any power you need. Could go 400-500W on high workload. We aim around 300-350W. We don't bin - there's only one product and we stay within the drawer power

08:31PM EDT - The chips themselves are water cooled, but customers can run an aircooled system, or you can hook up datacenter water

08:32PM EDT - Q: Doesn't going over the PCI card cause extra latency

08:32PM EDT - A: Card only has the master key - the data has a key token, which doesn't need to keep going back and forth

08:32PM EDT - Q: Have you considered something like SGX?

08:33PM EDT - A: That's not an apples to apples comparison. We consider the tamper resistant element a key feature of our products.

08:34PM EDT - Q: But SGX prevents someone with a logic analyzer going in

08:34PM EDT - A: Our solution does not need recoding - our customers use older software and it is transparent

08:34PM EDT - Q: What would you do to make COBOL run faster?

08:35PM EDT - A: COBOL does a lot of time doing BCD arithmetic, but there's traditional issue queue limitations, so we use packed BCD compute to reduce that bottleneck

08:36PM EDT - Q: What did +35% capacity and +25% SMT2 mean

08:37PM EDT - A: +35% is instructions for a whole system. The +10% single thread is a large scale number for benchmarks on capacity planning. +25% SMT2 from tuning and tweaking in our implementation due to maturity

08:37PM EDT - That seems to be a wrap. This is our last live blog on Hot Chips - I'll be writing up some of these talks on my flight home tomorrow. Hope you enjoyed them :)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

67 Comments

View All Comments

beginner99 - Wednesday, August 23, 2017 - link
I would like to say I wonder who buys these systems. But then my daily work involves interacting with an Oracle DB running on IBM AIX...
FreckledTrout - Wednesday, August 23, 2017 - link
Oracle 12c runs great on IBM Power8 and AIX7. We have a lot of that here too....
SarahKerrigan - Wednesday, August 23, 2017 - link
Banks. Insurers. Airlines.

Mainframes are not uncommon, and IBM is not the only vendor building them (although it is the dominant one.)
FunBunny2 - Wednesday, August 23, 2017 - link
-- IBM is not the only vendor building them (although it is the dominant one.)

but, IIRC, only the 360/370/390/z ISA is implemented on all the others, Burroughs possibly excepted. Unisys bought up the dregs years ago, and the MCP machines are in limbo.
SarahKerrigan - Wednesday, August 23, 2017 - link
Fujitsu and Hitachi are both IBM compatible (more or less.) Fujitsu has a roadmap going through 2030 with new hardware every couple years. Hitachi is going to rebadging IBM systems, starting next year. Fujitsu historically is about on par with IBM in Japan, with product cycles causing IBM or Fujitsu to gain temporary sales boosts over each other. They also have a German mainframe business they inherited from Siemens, which does okay. Hitachi's business is smaller.

NEC has mainframes with several generations roadmapped. These are most assuredly not IBM compatible; they're distant relatives of Bull's (mostly-dead) DPS-7 mainframe family. NEC made an abortive attempt to move to emulation, which sold badly, but is now firmly back in the hardware game and has had 20-25% mainframe market share in Japan.

Bull (now Atos) and Unisys both remain in the mainframe OS business but have exited mainframe hardware. Both do emulation on x86, and neither have platforms that are 360-related. Bull's systems, especially the older one (GCOS 8) are effectively in maintenance mode, with small customer bases (mostly in France.) Unisys continues to throw meaningful development resources at MCP and OS 2200, and they do pretty well, especially in the US or Latin America; in the US, if a company is running non-IBM mainframe operating systems, it's almost certainly Unisys.

Anyway, sorry for the wall of text. tl;dr version: IBM is dominant but not the only one - especially in Japan.
name99 - Wednesday, August 23, 2017 - link
No need to apologize.
The one-in-a-hundred comment that contains actual content is the only reason to wade through the other ninety nine!

That Japan situation is bizarre, especially when you throw in that Fujitsu also makes HPC SPARC designs and is soon going to be making HPC ARM designs,
Maybe that's why Japan is so wimpy is software --- every CS-competent engineer they produce goes into make yet another hardware variant?!
SarahKerrigan - Wednesday, August 23, 2017 - link
I'm glad you appreciate it!

For some context, there was a lot of government and quasi-government (NTT) support for the domestic computer industry in Japan. Originally, there were even more, with the vendors organized into pairs: Hitachi/Fujitsu worked together on IBM clones, often in collaboration with Amdahl; NEC and Toshiba worked on derivatives of Honeywell/Bull's GCOS lines; Oki and Mitsubishi cloned, if I recall, the same RCA product line that got recycled into Univac VS/9 and Siemens BS2000 - but that's the one I know the least about, so don't take it as gospel.

Fujitsu and Hitachi ultimately diverged (a friend of mine, who worked in their mainframe unit, called it a "very messy divorce" without elaborating.) NEC bought out Toshiba's mainframe line and continues alone to the present day. Somewhere, the Oki/Mitsubishi mainframe partnership seems to have fallen apart, and Mitsubishi ended up shipping IBM clones and later IBM rebadges running a Mitsubishi OS, but afaik they're long since out of that business.

On top of all that, Japanese companies often have a real preference for local vendors - which is why you see Hitachi rebadging HPE's Itanium boxes, for instance. Fujitsu, Hitachi, and NEC are *not* necessarily trying to compete with z directly on performance or scalability, but they have a customer base that likes their products, so improvements continue. Hitachi's systems, for instance, top out at 8 cores and 64GB RAM, afaik.

By the way, of the Japanese vendors, Fujitsu is the only one i know of with any kind of real mainframe customer base internationally (running BS2000 in Europe, and some MSP/XSP in Australia.)

Again, I apologize for the text wall. :-)
jospoortvliet - Wednesday, August 30, 2017 - link
Thanks for the interesting comments!!!
SFoster4 - Thursday, September 7, 2017 - link
RCA did become Univac VS/9. I worked on conversions from VS/9 to OS1100 in the early 80's.
sorten - Wednesday, August 23, 2017 - link
What was that about lies, damn lies and statistics? 91% of CEOs say "new customer facing apps are accessing the mainframe?" Is that because most CIOs are incompetent?

Hot Chips: IBM's Next Generation z14 CPU Mainframe Live Blog (5pm PT, 12am UTC)

Post Your Comment

67 Comments

View All Comments

beginner99 - Wednesday, August 23, 2017 - link

FreckledTrout - Wednesday, August 23, 2017 - link

SarahKerrigan - Wednesday, August 23, 2017 - link

FunBunny2 - Wednesday, August 23, 2017 - link

SarahKerrigan - Wednesday, August 23, 2017 - link

name99 - Wednesday, August 23, 2017 - link

SarahKerrigan - Wednesday, August 23, 2017 - link

jospoortvliet - Wednesday, August 30, 2017 - link

SFoster4 - Thursday, September 7, 2017 - link

sorten - Wednesday, August 23, 2017 - link

Log in

Don't have an account? Sign up now