Instruction Sets: Alder Lake Dumps AVX-512 in a BIG Way

One of the big questions we should address here is how the P-cores and E-cores have been adapted to work inside a hybrid design. One of the critical aspects in a hybrid design is if both cores support different levels of instructions. It is possible to build a processor with an unbalanced instruction support, however that requires hardware to trap unsupported instructions and do core migration mid-execution. The simple way to get around this is to ensure that both types of cores have the same level of instruction support. This is what Intel has done in Alder Lake.

In order to get to this point, Intel had to cut down some of the features of its P-core, and improve some features on the E-core. The biggest thing that gets the cut is that Intel is losing AVX-512 support inside Alder Lake. When we say losing support, we mean that the AVX-512 is going to be physically fused off, so even if you ran the processor with the E-cores disabled at boot time, AVX-512 is still disabled.

Intel’s journey with AVX-512 has been long and fragmented. Some workloads can be vectorised – multiple bits of consecutive data all require the same operation, so you can pack them into a single register and perform it all at once with a single instruction. Designed as its third generation of vector instructions (AVX is 128-bit, AVX2 is 256-bit, AVX512 is 512-bit), AVX-512 was initially found on server processors, then mobile, and we found it in the previous version of desktop processors. At the time, Intel stated that by enabling AVX-512 on its processor line from top to bottom, it would encourage greater adoption, and they were leaning hard into this missive.

But that all changes with Alder Lake. Both desktop processors and mobile processors will now have AVX-512 disabled in all scenarios. But the silicon will still be physically present in the core, only because Intel uses the same core in its next generation server processors called Sapphire Rapids. One could argue that if the AVX-512 unit was removed from the desktop cores that they would be a lot smaller, however Intel has disagreed on this point in previous launches. What it means is that for the consumer parts we have some extra dark silicon in the design, which ultimately might help thermals, or absorb defects.

But it does mean that AVX-512 is probably dead for consumers.

Intel isn’t even supporting AVX-512 with a dual-issue AVX2 mode over multiple operations - it simply won’t work on Alder Lake. If AMD’s Zen 4 processors plan to support some form of AVX-512 as has been theorized, even as dual-issue AVX2 operations, we might be in some dystopian processor environment where AMD is the only consumer processor on the market to support AVX-512.

On the E-core side, Gracemont will be Intel’s first Atom processor to support AVX2. In testing with the previous generation Tremont Atom core, at 2.9 GHz it performed similarly to a Haswell 2.9 GHz Celeron processor, i.e. identical in non-AVX2 situations. By adding AVX2, plus fundamental performance increases, we’re told to expect ‘Skylake-like performance’ from the new E-cores. Intel also stated that both the P-core and E-core will be at ‘Haswell-level’ AVX2 support.

By enabling AVX2  on the E-cores, Intel is also integrating support for VNNI instructions for neural network calculations. In the past VNNI (and VNNI2) were built for AVX-512, however this time around Intel has done a version of AVX2-VNNI for both the P-core and E-core designs in Alder Lake. So while AVX-512 might be dead here, at least some of those AI acceleration features are carrying over, albeit in AVX2 form.

For the data center versions of these big cores, Intel does have AVX-512 support and new features for matrix extensions, which we will cover in that section.

Gracemont Microarchitecture (E-Core) Examined Conclusions: Through The Cores and The Atoms
Comments Locked

223 Comments

View All Comments

  • mode_13h - Saturday, August 21, 2021 - link

    > micro-servers Big-little seems much more useful, but Intel typically has gone
    > a long way to ensure that 'desktop' CPUs were not used for that.

    Huh? Their E-series Xeons are simply desktop CPUs with a few less features fused-off.
  • abufrejoval - Saturday, August 21, 2021 - link

    We all know that that's what they are technically. But that didn't keep Intel from selling them, and the required chipsets, which had the same magical snake oil, at a heavy markup, before AMD came along and offered ECC and some RAS for free.

    And that is going to come back, as soon as Intel sees a chance to make an extra buck.
  • mode_13h - Sunday, August 22, 2021 - link

    > that didn't keep Intel from selling them, and the required chipsets, ... at a heavy markup

    Except for maybe the top-end models, I tended to observe E-series (previously E3-series) selling for similar prices as the desktop equivalents. However, workstation motherboards generally have commanded a higher price.
  • mode_13h - Saturday, August 21, 2021 - link

    > given an equal price choice, I cannot imagine preferring the use of AVX-512 for
    > dark silicon and two P-core tiles for eight E-cores over a fully enabled ten P-core chip.

    Aside from the AVX-512 part, the math is quite easy. If you just take what they showed in the Gracemont vs. Skylake comparison, it's clear that 8 E-cores is going to provide more performance than 2 more P-cores. And anything well-threaded enough to fully-load 10 P-cores should probably scale well to at least 16 (or 24) threads.

    As for the AVX-512 part, its absence irrelevant if your workload doesn't utilize it, as most don't. Ryzen 5000 has been very competitive without it. I'm sure folks at Intel were keen to cite that.

    > And I'd belive that most 'desktop' users would prefer the same.

    I don't love the E-cores, in a desktop, but that's more out of apprehension about how well-scheduled they'll be. If the scheduling is good, then I'm fine with having them instead of 2 more P-cores.
  • Spunjji - Tuesday, August 24, 2021 - link

    "If the scheduling is good, then I'm fine with having them instead of 2 more P-cores"
    It's all going to come down to this. Lakefield wasn't great in that regard; presumably anybody running Windows 10 on ADL will get a slightly more refined version of that experience. Hopefully the Windows 11 + Thread Director combo will be what's needed!
  • Timur Born - Friday, August 20, 2021 - link

    My current experience is that anything based on older Lua versions (like 5.1) does not seem to benefit from IPC gains at all, only clock-rate matters.
  • abufrejoval - Saturday, August 21, 2021 - link

    That's interesting.

    If IPC gains were "uniform", that should not happen, which then means they aren't uniform enough for your workloads.

    But a bit more data would help... especially if a newer version of Lua doesn't show this behavior?
  • mode_13h - Sunday, August 22, 2021 - link

    I've never used it, but it seems to be dynamically-typed and table-based. So, I'd assume it's doing lots of hashtable lookups, which seem harder for a CPU to optimize. Maybe newer versions have some optimizations to reduce the frequency of table lookups, which would also be more OoO-friendly.
  • TristanSDX - Friday, August 20, 2021 - link

    for disabled AVX-512, I suspect they found last-minute bug in P cores. ADL is in mass production now, and release can't be posponed, and not many apps use it currently, so they disabled it completely. For Saphire Rapids AVX-512 is mandatory, that's why they delayed it half year, from Q421 to Q222, HPC product without AVX-512 used by many HPC software is just brick.
  • mode_13h - Saturday, August 21, 2021 - link

    That doesn't explain the E-core situation, though. As the article explains, enabling it on only the P-cores would create a real headache for the OS' thread scheduler.

    Plus, a lot of multi-threaded software naively spawns one worker thread per hardware thread, so you could end up with a situation where 24 software threads are fighting for execution time on 16 hardware threads, leading to more context switches and higher software latencies.

    I'm just saying that the stated explanation of disabling it because it's lacking in the E-cores is a suitable reason.

    As for Sapphire Rapids' delays, it's not hard to imagine they're having yield problems with such big chips on their new "Intel 7" process. Also, they're behind schedule for the software support for it, with AMX still being in really rough shape.

Log in

Don't have an account? Sign up now