08:30PM EDT - Remember back when Arm announced Scalable Vector Extensions? Well Fujitsu has made an Arm CPU that uses it with a 512-bit width. The presentation looks super interesting, so follow along with our live blog. The talk is set to start at 5:30pm PT / 12:30am UTC.

08:32PM EDT - Last time we were here, had a 3-min presentation about Post-K

08:32PM EDT - Called A64FX

08:32PM EDT - First chip to use Arm SVE

08:32PM EDT - Scalable Vector Extensions

08:33PM EDT - New microarch maximises SVE perf

08:33PM EDT - Fujitsu has been making processors for 60 years

08:34PM EDT - SPARC? Remember that?

08:34PM EDT - UNIX, HPC, Mainframe, now HPC + AI

08:34PM EDT - New CPU inherits DNA from Fujitsu

08:34PM EDT - Reliability, speed, flexibility, high perf/watt

08:34PM EDT - end up with CPU w/ extremely high throughput

08:35PM EDT - low power

08:35PM EDT - (A64FX doesn't mean Athlon 64, FX)

08:35PM EDT - Optimized for massively parallel

08:35PM EDT - Four features

08:35PM EDT - Perf: FP64 through to INT8

08:36PM EDT - Throughput: 512-bit SIMD x 2 pipes/core, HBM2, 48-cores, Tofu interconnect

08:36PM EDT - Efficiency: GEMM and Triad perf

08:37PM EDT - Standards: Arm v8.2 + SVE + SBSA level 3 (Server Base System Architecture)

08:37PM EDT - AArch64 only, no 32

08:37PM EDT - 48 computing cores and 4 identical assistant cores

08:37PM EDT - 32GiB HBM2

08:38PM EDT - 6D Mesh - 28 Gbps x 2 lanes x 10 ports

08:38PM EDT - PCIe 3.0 x16

08:38PM EDT - Built on 7nm FinFET

08:38PM EDT - 8.786B transistors, but only 594 pin

08:38PM EDT - 2.7 TFLOPS

08:38PM EDT - 1TB/s memory bandwidth

08:39PM EDT - ISA feature support

08:39PM EDT - Optimized SVE for wide range of applications

08:39PM EDT - INT8 Dot Product

08:39PM EDT - Enhanced compression

08:39PM EDT - AI applications

08:40PM EDT - HW Barrier and Sector cache - implementation defined system registers from AArch64

08:40PM EDT - Enahnced blocks in chip

08:40PM EDT - Predicated operations dedicated pipe

08:41PM EDT - SVE has limitation on operands - FMA equivalent requires destructive 3-operand FMA3

08:41PM EDT - MOVPRFX instruction

08:41PM EDT - hides overhead of main pipelin

08:42PM EDT - 21.6 TOPS for INT8 dot product

08:42PM EDT - 90% execution efficiency

08:42PM EDT - Still 2x in 64-bit DGEMM over SPARC64 PrimeHPC FX100

08:43PM EDT - Almost 20x the K comp in DGEMM

08:43PM EDT - L1 cache is key to design for 512-bit SIMD

08:43PM EDT - Combined Gather mechanism to increase throughput

08:44PM EDT - Combined Gather enables return up to two consecutive elements in a 128-byte aligned block

08:44PM EDT - Throughput per core is 32 bytes/cycle

08:44PM EDT - Full chip is Divided into four memory groups

08:45PM EDT - One CMG is 13 cores, an L2 cache, and a memory controller

08:45PM EDT - One core handles Daemon/IO

08:45PM EDT - Cache coherency by ccNUMA with on-chip directory

08:45PM EDT - X-bar connection for L2 cache efficiency

08:45PM EDT - Process binding ensures scaling

08:45PM EDT - Wide Ring Bus for IO across whole chip

08:46PM EDT - Bandwidth in cache and memory is key

08:46PM EDT - Out-of-order mechanisms in cores, caches, and IMCs

08:46PM EDT - L1 cache at 11.0 TB/s

08:46PM EDT - L2 cache is 3.6 TB/s

08:47PM EDT - Normalized compared to previous processor, perf is 2x across wide range of workloads

08:48PM EDT - For AI, convolution low precision is 9.4x using INT8 dot product

08:48PM EDT - Each chip has energy monitor in msec

08:49PM EDT - Each core has energy analyzer in nanosec

08:49PM EDT - Fine grained power analysis of a core, an L2 cache and memory

08:49PM EDT - Power Knob for optimization

08:49PM EDT - Can change hardware config for power

08:50PM EDT - Change decode width, floating point pipeline, and general frequency reduction

08:50PM EDT - Extensive RAS

08:50PM EDT - ECC on all caches

08:50PM EDT - Parity cehc on execution units

08:50PM EDT - 128400 error checkers

08:50PM EDT - Parity Check* on execution units

08:51PM EDT - Hardware instruction retry

08:52PM EDT - Software stacks developed by RIKEN and Fujitsu

08:52PM EDT - Will continue to use Arm in the future

08:52PM EDT - Work with partners

08:53PM EDT - Q&A time

08:54PM EDT - Q: When can you reach exascale? A: The Post-K system will be available in 2021. 100x perf from K-comp. But exa-scale not answerable

08:55PM EDT - Q: nanosecond level power monitoring - what techniques do you use? A: Activity based on coefficient based on operations

08:56PM EDT - Q: Support 64-bit FP, not 128-bit? A: No.

08:59PM EDT - That's a wrap. Next talk is on the NEC Vector processor: https://www.anandtech.com/show/13259/hot-chips-2018-nec-vector-processor-live-blog

POST A COMMENT

8 Comments

View All Comments

  • eastcoast_pete - Wednesday, August 22, 2018 - link

    Thanks Ian! One immediate takeaway: Interesting design, but why only 16 PCIe Gen3 lanes? That can really limit the A64fx's usefulness. Any comments from Fujitsu, and your thoughts? Reply
  • TeXWiller - Wednesday, August 22, 2018 - link

    My guess would be that the PCIe bus is mostly used for burst buffers (non-volatile fast storage) and system management (networking) and most of the magic happens through the Tofu in the very large system these processors are part of. Reply
  • eastcoast_pete - Thursday, August 23, 2018 - link

    Hi TeXWiller, that was my initial thought, too. But, why limit this chip "only" to supercomputer-type machines? They did show their new chip as being superior to their SPARC64 in their talk, which is why I found this confusing, and Fujitsu needs a follow-up to their Sparc64 line ASAP, unless they have given up on the server category. However, for server use, 16 Gen3 PCIe lanes won't cut it by a mile. The A64fx design could fit the bill, and the ability to run 512 bit SIMD extensions (vs. 256 bit for current gen Sparc64) would add a strong performance boost (absolute and perf/Wh) for some key server-run applications. Reply
  • TeXWiller - Thursday, August 23, 2018 - link

    They did the same thing with the K-computer: some of the HPC features that were useful for the server version were integrated into it. As to SPARC -> ARM transition, the post-K project has to show its performance and reliability, both in terms of hardware and software first. By that time the server version can be integrated with any next generation IO that is needed. That is my relatively humble opinion. ;) Reply
  • SarahKerrigan - Saturday, August 25, 2018 - link

    They compare it to SPARC64fx, which is also supercompute-only and highly limited on the peripheral side. SPARC64 XIfx also has exactly 16 PCIe 3 lanes, just like A64FX does.

    SPARC64 and SPARC64fx are not the same - SPARC64 are commercial systems running Solaris, with multithreading, SMP, and support for more than the 32GB per node of RAM present in the A64FX and SPARC64 XIfx. SPARC64 XIfx is SMP-less (multi-node is done via Tofu), has no multithreading, and has a significantly different cache configuration from SPARC64 XII (or X.)

    There's a new SPARC64 on the roadmap, although I'm skeptical there will be another one after, given the decline in the commercial UNIX market.
    Reply
  • Santoval - Wednesday, August 22, 2018 - link

    Are they actually going to make a supercomputer in 2021 with no PCIe 4.0 and HBM3 memory? By that time both will certainly be the norm (with the final spec of PCIe 5.0 already released), so it would be very strange. Reply
  • SquarePeg - Wednesday, August 22, 2018 - link

    08:35PM EDT - (A64FX doesn't mean Athlon 64, FX)

    My immediate thought when I read the title was "how the fook are they going to use that branding". Because that's what I was reminded of right away.
    Reply
  • Nehemoth - Thursday, August 23, 2018 - link

    I guess that the intended target for the product are very different from the one of AMD Athlon :) Reply

Log in

Don't have an account? Sign up now