Instruction Sets: Alder Lake Dumps AVX-512 in a BIG Way

One of the big questions we should address here is how the P-cores and E-cores have been adapted to work inside a hybrid design. One of the critical aspects in a hybrid design is if both cores support different levels of instructions. It is possible to build a processor with an unbalanced instruction support, however that requires hardware to trap unsupported instructions and do core migration mid-execution. The simple way to get around this is to ensure that both types of cores have the same level of instruction support. This is what Intel has done in Alder Lake.

In order to get to this point, Intel had to cut down some of the features of its P-core, and improve some features on the E-core. The biggest thing that gets the cut is that Intel is losing AVX-512 support inside Alder Lake. When we say losing support, we mean that the AVX-512 is going to be physically fused off, so even if you ran the processor with the E-cores disabled at boot time, AVX-512 is still disabled.

Intel’s journey with AVX-512 has been long and fragmented. Some workloads can be vectorised – multiple bits of consecutive data all require the same operation, so you can pack them into a single register and perform it all at once with a single instruction. Designed as its third generation of vector instructions (AVX is 128-bit, AVX2 is 256-bit, AVX512 is 512-bit), AVX-512 was initially found on server processors, then mobile, and we found it in the previous version of desktop processors. At the time, Intel stated that by enabling AVX-512 on its processor line from top to bottom, it would encourage greater adoption, and they were leaning hard into this missive.

But that all changes with Alder Lake. Both desktop processors and mobile processors will now have AVX-512 disabled in all scenarios. But the silicon will still be physically present in the core, only because Intel uses the same core in its next generation server processors called Sapphire Rapids. One could argue that if the AVX-512 unit was removed from the desktop cores that they would be a lot smaller, however Intel has disagreed on this point in previous launches. What it means is that for the consumer parts we have some extra dark silicon in the design, which ultimately might help thermals, or absorb defects.

But it does mean that AVX-512 is probably dead for consumers.

Intel isn’t even supporting AVX-512 with a dual-issue AVX2 mode over multiple operations - it simply won’t work on Alder Lake. If AMD’s Zen 4 processors plan to support some form of AVX-512 as has been theorized, even as dual-issue AVX2 operations, we might be in some dystopian processor environment where AMD is the only consumer processor on the market to support AVX-512.

On the E-core side, Gracemont will be Intel’s first Atom processor to support AVX2. In testing with the previous generation Tremont Atom core, at 2.9 GHz it performed similarly to a Haswell 2.9 GHz Celeron processor, i.e. identical in non-AVX2 situations. By adding AVX2, plus fundamental performance increases, we’re told to expect ‘Skylake-like performance’ from the new E-cores. Intel also stated that both the P-core and E-core will be at ‘Haswell-level’ AVX2 support.

By enabling AVX2  on the E-cores, Intel is also integrating support for VNNI instructions for neural network calculations. In the past VNNI (and VNNI2) were built for AVX-512, however this time around Intel has done a version of AVX2-VNNI for both the P-core and E-core designs in Alder Lake. So while AVX-512 might be dead here, at least some of those AI acceleration features are carrying over, albeit in AVX2 form.

For the data center versions of these big cores, Intel does have AVX-512 support and new features for matrix extensions, which we will cover in that section.

Gracemont Microarchitecture (E-Core) Examined Conclusions: Through The Cores and The Atoms
Comments Locked

223 Comments

View All Comments

  • zamroni - Friday, August 20, 2021 - link

    That low power cores for desktop is waste of transistors.
    They are better to be used for more caches or more performance cores
  • mode_13h - Friday, August 20, 2021 - link

    This is what I thought, until I realized that they have better perf/area than the big cores. Not to mention perf/W.

    So, in highly-threaded workloads, their 8+8 core configuration should out-perform 10 cores of Golden Cove. And, when thermally-limited, the little cores will also more than pull their weight.

    It's an interesting experiment they're trying. I'm interested in seeing how it plays out, in the real world.
  • nevcairiel - Friday, August 20, 2021 - link

    > Designed as its third generation of vector instructions (AVX is 128-bit, AVX2 is 256-bit, AVX512 is 512-bit)

    SSE is 128-bit. AVX is 256-bit FP, AVX2 is 256-bit INT.
    And MMX was 64-bit before that. So doesn't this make it the 4th generation, assuming you don't count all the SSE versions separately? (The big ones were SSE1 with 128-bit FP, and SSE2 with 128-bit INT, SSE3/SSSE3/SSE4.1 are only minor extensions)
  • mode_13h - Saturday, August 21, 2021 - link

    Yeah, I came to the same conclusion. It's the 4th major family of vector instructions. Or, another way to clearly demarcate it would be the 4th vector width.
  • abufrejoval - Friday, August 20, 2021 - link

    I wonder how many side channel attacks the power director will enable.

    Also wonder if the lack of details is due to Intel stepping awfully close to some of Apple's patents.

    The battles between the Big little and AVX-512 teams inside Intel must have been epic: I imagine frothing red faces all around...
  • mode_13h - Saturday, August 21, 2021 - link

    > The battles between the Big little and AVX-512 teams inside Intel must have been epic

    : )

    Although, the AVX-512 folks have some egg on their faces from a problematic implementation in Skylake-SP and its derivatives.
  • abufrejoval - Friday, August 20, 2021 - link

    Does Big-little make any sense on a "desktop"?

    And then: Are there actually still any desktops around?

    All around my corporate workplaces, notebooks have become the de-facto desktop for many depreciation cycles, mostly because personal offices got replaced by open space and home-office days became a regular thing far before the pandemic. Since then even 'workstations' just became bigger notebooks.

    Anywhere else I look it's becoming hard to detect desktops, even for big-screen & multi-monitor setups, it's mostly NUCs or in-screen devices these days.

    Those latter machines rarely seem to get turned off any more and I guess many corporate laptops will remain 'turned on' (= stay in some sort of slumber) most of the time, too, so there Big-little overall power consumption might drop vs. Big-only, when both no longer sleep deeply.

    Supposedly that makes all these voice commands possible, but try as I might, I can see no IT admin turning that on in an office, nor would I want that in my living room.

    The only place I still see 'desktops' are really gamer machines and for those it's hard to see how those small cores might have any significant energy budget impact, even while they are used for ordinary 2D stuff.

    For micro-servers Big-little seems much more useful, but Intel typically has gone a long way to ensure that 'desktop' CPUs were not used for that.

    Intel's desire for market differentiation seems the major factor behind this and many other features since MMX, but given an equal price choice, I cannot imagine preferring the use of AVX-512 for dark silicon and two P-core tiles for eight E-cores over a fully enabled ten P-core chip.

    And I'd belive that most 'desktop' users would prefer the same.
  • mode_13h - Saturday, August 21, 2021 - link

    > The only place I still see 'desktops' are really gamer machines

    We still use traditional desktops for software development and VMs for testing. Our software takes long enough to build and the test environment needs to boot a full image. So, a proper desktop isn't hard to justify.
  • abufrejoval - Saturday, August 21, 2021 - link

    Our developers are encouraged to use build servers and the automatic testing pipelines behind them. Those run on machines with hundreds of GB of RAM and dozens of CPU cores, where loads get distribued via the framework. The QA tests will use containers or VMs as required, which are built and torn down to match by the pipeline. With thousands of developers in the company, that tends to give both better performance to any developer and much better economy to the company, while (home-)offices stay cool and quiet. We still give them laptops with what used to be "desktop" specs (32GB RAM, i7 quads), because, well they're cheap enough, and it allows them to play with VMs locally, even offline, should they want to e.g. for education/self-study.

    These days when you're running a build farm on your "desktop", that may really more of a workstation. It may be the "economy" model, which means from a price point it's what used to be a desktop, in my home-lab case a Ryzen 7 5800X 8-core with an RTX 2080ti and 128GB ECC RAM that runs whisper quiet even at full load. It would have been a 16-Core 5950X today, but when I built it, those were impossible to get. It's still an easy upgrade and would get you 16 "P-cores" on the cheap. It's also pretty much a gamer's rig, which is why I also use it after hours.

    My other home-lab workstation is what used to be a "real workstation" some years ago, an 18-core Haswell E5-2696 v3, which has exactly the same performance as the Ryzen 7 5800X on threaded jobs, even uses the same 110 Watts of power, but much lower clocks (2.7 vs. 4.4 GHz all-cores). Also 128GB of ECC RAM and thankfully just as quiet. It's not so great at gaming, because it only clocks to 4 GHz for single/dual core loads with Haswell IPC and I've yet to find a game that's capable of using 18-cores for profit to balance that out.

    Today you would use a Threadripper in that ballpark, with an easy 64 "P-Cores" and matching RAM, pretty much the same computing capacity as a mid-range server, but much quieter and typically tolerable in a desktop/office setup.

    If threaded software builds were all you do, you'd want to use 64 E-Cores on the "economy" variant and 256 E-Cores on the "premium", much like Ian hinted, because as long as you can fully load those 256 cores for your builds, they would be faster overall. But the chances for that happening are vastly bigger on a shared server than on a dedicated desktop, which is why we see all these ARM servers going for extra cores at the price of max single threaded performance.

    As a thought experiment imagine a machine where tiles can be switched dynamically between being a single P-core or four E-cores. For embarrassingly parallel workloads, the E-Cores would give you both better Watt economy (if you can maintain full load or your idle power consumption is perfect) and faster finish times. But as soon as your workload doesn't go beyond the number of P-cores you can configure, finishing times will be better on P-cores, while power effiency very much gets lost in idle power demands.

    The only way to get that re-configurability is to use shared servers, cloud or DC, while a fixed allocation of P vs E cores on a desktop has a much harder time to match your workload.

    I can tell you that I much prefer working on the 5800X workstation these days, even if it's no faster for the builds. Beause it's twice as fast on all those scalar workloads. And no matter how much most stuff tries to go wide and thready, Amdahl's law still holds true and that where P-Cores help.
  • mode_13h - Sunday, August 22, 2021 - link

    > Our developers are encouraged to use build servers

    We use VM servers, but they're old and the VMs are spec'd worse than desktops. So, there's no real incentive to use them for building. And if you're building on a desktop in your home, then testing on a server-based VM means copying the image over the VPN. So, almost nobody does that, either.

    VM servers are a nice idea, but companies often balk at the price tag. New desktops every 4-5 years is an easier pill to swallow, especially because upgrades are staggered.

    > I much prefer working on the 5800X workstation these days,
    > even if it's no faster for the builds. Beause it's twice as fast on all those scalar workloads.

    Exactly. Most incremental compilation involves relatively few files. I do plenty of other sequential tasks, as well.

Log in

Don't have an account? Sign up now