Compute Tile: New P (Redwood Cove) and E-cores (Crestmont)

The compute tile is the first client-focused tile to be built on the Intel 4 process technology. It houses the latest-generation P-cores and E-cores, both of which are based on newer and updated architectures. The P-Cores are officially called Redwood Cove, while the E-Cores are Crestmont. Intel also claims that power efficiency is greatly improved from previous generations, combined with its 3D Foveros packaging and offloading less performance-critical elements such as the SoC, media, and graphics onto other tiles. Intel also uses the same ring fabric to interconnect all the tiles to reduce power and latency penalties across the entirety of the chip.

One thing to note with the new core architectures, including Redwood Cove (P-core) and Crestmont (E-core), is that Intel was very light on disclosing many of the finer details. While we got the general blurb of 'it's better than this and has better IPC performance than the last gen,' Intel has omitted details such as L3 cache, whether there's L4 cache through Intel's "Adamintine" hierarchy, and disclosing decoder widths within the cores. As such, Intel hasn't provided enough details for us to do a full architectural deep dive of Redwood Cove or Crestmot, but more an easy look to see what's new and how it's implemented.

Looking at some of the new changes to Meteor Lake, one of the most notable is the introduction of the new Redwood Cove P-core. This new P-core is the direct successor to the previous Golden Cove core found in the 12th Gen Core (Raptor Lake) processors and is designed to bring generational improvements. As expected, Meteor Lake brings generational IPC gains through the new Redwood Cove cores. The Redwood Cove core also has increased bandwidth for both cache and memory. The performance monitoring unit has also been updated to enhance monitoring, and one of the standout features of the new P-Core is the enhanced feedback provided by Intel's Thread Director, which aids in optimizing core performance and directing workloads to the right cores.

Another inclusion is the new Cresmont-based E-cores, which also benefit from generational IPC gains, and they keep the CPU-based AI acceleration through Vector Neural Network Instructions (VNNI) as seen on Raptor Lake (13th Gen) and Alder Lake (12th Gen). Intel claims improvements over the previous gens, although they haven't provided anything to substantiate this.

However, Intel states what it means by improvements: "Architectural simulation vs. Gracemont architecture across a broad set of workloads. VNNI improvements based on doubling the number of VNNI ports. Results may vary." This is a very roundabout way of saying we've doubled the number of AVX2 VNNI ports, but they haven't given us any figures, and with Raptor Lake, not all SKUs had support for the VNNI instruction set. They haven't told us whether or not this is now a feature of the Crestmont E-core itself or if it's, again, SKU-dependent. 

This is designed to bolster the user experience when using AI applications and running AI-based workloads, although the NPU on the SoC tile is predominately more suited for these. Like the P-Cores, the E-Cores also benefit from enhanced Thread Director feedback, which provides better granular control and optimization. Workloads that aren't as intensive can be offloaded onto the new Low Power Island E-cores, which are embedded into the SoC tile.

One area where Intel is promising major improvements and optimizations is through Thread Director. The Alder Lake (12th Gen) and Raptor Lake (13th Gen) architecture introduced a nuanced approach to scheduling. Under Alder/Raptor work was assigned a quality-of-service (QoS) ranking, and higher priority work was allocated to P-cores for more demanding and intensive workloads, while lower-ranked workloads are directed to E-cores, primarily to save power.

In cooperation with Microsoft Windows, Intel is bringing new enhancements and refinements into play for Meteor Lake. Meteor Lake's SoC tile LP E-cores represent a third tier of service, and Thread Director will try to keep work there first. Then, if threads need faster performance, they can be moved to the compute tile, accessing the faster, full-power E-cores, and at the top, the P-cores. This gives the chip better overall workload distribution in terms of power efficiency. Moreover, Meteor Lake's strategy to periodically move highly demanding tasks to the P-cores that can benefit from the higher performance levels offers a dynamic approach to thread scheduling.

Overall, this is designed to improve power efficiency through Meteor Lake, giving it more versatility over Raptor Lake regarding task scheduling. The flip side is that, on paper, Meteor Lake is a more efficient platform through these enhancements than Raptor Lake, especially in scenarios requiring rapid adjustments to fluctuating workloads and through those lighter workloads that can be offloaded onto the LP E-cores within the SoC.

Compute Tile: Intel 4 with EUV Lithography

The entirety of the compute tile, including the P and E-cores, is built using the Intel 4 node and is also Intel's first client chip to use EUV lithography. Intel 4 is a key part of Intel's IDM 2.0 strategy, which aims to achieve parity by 2024 and process leadership by 2025. We have already written a piece detailing the Intel 4 node in great detail, which can be found below:

Intel 4 Process Node In Detail: 2x Density Scaling, 20% Improved Performance

Intel 4 uses extreme ultraviolet (EUV) lithography, a highly efficient manufacturing technique that simplifies manufacturing, improving yield and area scaling. Not only is EUV, along with Intel 4, which is designed to scale out for better power efficiency, but it's also the precursor for Intel to switch things over to their Intel 3 process, which is still being developed.

According to Intel's '5 nodes in 4 years' goal within the roadmap, Intel 3 is stated to be manufacturing-ready in H2 of 2023. What's interesting about the cadence of Intel 3 in the roadmap is that Intel 3 is design-compatible with Intel 4, and as such, Intel 3 is designed to be the long-lived node with EUV lithography.

One of the primary benefits of Intel 4 is its area scaling capabilities. The Intel 4 process offers 2X the area scaling for high-performance logic libraries compared with the previous Intel 7 process node – a process which wasn't only troublesome through its exceedingly long development cycle, but yields were not the greatest. Having the ability to scale out in such a way is vital for fitting more and more transistors on a chip, which should theoretically improve the overall performance and efficiency of the silicon. Intel 4 is also optimized for high-performance computing applications and supports both low-voltage (<0.65V) and high-voltage (>1.1V) operations. Intel claims that having this flexibility results in more than 20% performance in iso-power performance over Intel 7, and the technology also incorporates high-density Metal-Insulator-Metal (MIM) capacitors, which Intel claims make power delivery to the chip superior.

Through Intel 4 with EUV, Intel uses a 30 mm fin pitch with self-aligned quad patterning (SAQP) and a 50 nm tungsten gate pitch, scaled down by 0.83x from 54/60 nm on Intel 7. The M0 pitch is also down by 0.75x to 30 nm from 40 nm, and the HP library height has been reduced greatly from 408 in Intel 7 to 240 nm on Intel 4, a scaling of 0.59x. Moving from a 4 fin to 3 fin allocation means that Meteor Lake on Intel 4 has a tighter gate spacing than Intel 7.

One key new introduction to Intel 4 is the materials used, with Intel using what it calls 'Enhanced Copper'. Although Intel hasn't disclosed the specific percentage of the mixture, Enhanced Copper is essentially copper (Cu) adorned with cobalt (Co) and is designed to eliminate high resistance and high volume barriers. The combined metallurgy of copper and cobalt is used on layers M0 to M4, while layers M5 to M15 are made from copper using different pitches ranging from 50 nm up to 280 nm.

Comparing Intel 4 to Intel 7
  Intel 4 Intel 7 Change
Fin Pitch 30 nm 34 nm 0.88 x
Contact Gate Poly Pitch 50 nm 54/60 nm 0.83 x
Minimum Metal Pitch (M0) 30 nm 40 nm 0.75 x
HP Library Height 240h 408h 0.59 x
Area (Library Height x CPP) 12K nm2 24.4K nm2 0.49 x

Using Extreme Ultraviolet (EUV) lithography on Intel 4 represents a major progressive advancement in semiconductor fabrication. Accomplished by using x-rays with a wavelength of around 13.5 nanometers (generated by zapping tin with a laser, no less), EUV lithography significantly improves and optimizes the photolithographic process, allowing for enhanced resolution and pattern fidelity metrology. The technology requires specialized equipment, including high-precision optics and vacuum chambers, with a single EUV lithographic system costing around $150 million (as per Reuters).

When it comes to using manufacturing chips, there are different levels of patterning, both single and multi-patterning. Using EUV allows Intel to reduce the number of masks and steps in the fabrication process, with up to 20% fewer masks on Intel 4 than Intel 7 by replacing multi-patterning steps a single EUV layer. While each patterning level presents its own unique challenges, EUV allows for a single pattern to use just one exposure to etch out. This means that production can be increased and flow faster throughout the process. Opting for multi-patterning means more cost and higher variability. Another advantage of using a single-pattern EUV process also reduces the number of defects within the silicon.

Despite the substantial capital and operational expenditures, the technology offers compelling advantages, such as a reduction in mask count by 20% and a decrease in process steps by 5% for Intel 4. These efficiencies contribute to superior area scaling and yield optimization and put EUV lithography as a cornerstone in Intel's processor roadmap as they try to achieve leadership. It also synergizes with Advanced Packaging Technologies (APT) like Embedded Multi-die Interconnect Bridge (EMIB) and is combined with Foveros 3D packaging, further ensuring its role as a progressive technology in semiconductor and chip manufacturing.

Intel Unveils Meteor Lake Architecture: Intel 4 Heralds the Disaggregated Future of Mobile CPUs SoC Tile, Part 1: Low Power Island (LP) E-Cores For Efficiency
Comments Locked

107 Comments

View All Comments

  • FWhitTrampoline - Wednesday, September 20, 2023 - link

    I'm more focused the on eGPU usage for OCuLink so I'm not stating that TB4/USB4 connectivity does not have its usage model for your use case. But pure PCIe is lowest latency for eGPU usage and can be easily adopted by more OEMs than just GPD for their handhelds as that OCuLink will work with any makers' GPUs as long as one is using an OCuLink capable eGPU adapter or enclosure.

    And ETA Prime has extensively tested OCuLink adapters with plenty of Mini Desktop PCs and even the Steam Deck(M.2 slot is only PCIe 3.0 capable). It's the 64Gbs on any PCIe 4.0/x4 connection(M.2/NVMe or other) that's what good for eGPUs via OCuLink relative to the current bandwidth of TB4/USB4 40Gbs.
  • Exotica - Wednesday, September 20, 2023 - link

    I’ve seen those videos and the performance advantages for EGPUs. But most of the EGPUs in the market use alpine ridge. A chipset known to reserve bandwidth for DP and have less available for PCIe (22 Gbps). Perhaps there may be one or two based on Titan ridge with slightly more pcie bandwidth. It’s hard to say how barlow ridge will perform in terms of the amount of pcie bandwidth made available to peripherals. But a 64 Gbps pcie connection will not saturate the 80 Gbps link so hopefully we can have most of the available 64 Gbps pcie bandwidth. Another problem with occulink is that there’s no power delivery so you need to have a separate wire for power.

    So Barlow ridge TB5 has the potential to be a one cable solution, power upto 240W, pcie up to 64 Gbps, and it will also tunnel DisplayPort. Occulink is cool. But thunderbolt tunnels more capabilities over the wire.
  • FWhitTrampoline - Wednesday, September 20, 2023 - link

    OCuLink is lower latency as was stated in the earlier posts! And TB4/TB# or USB4/USB# will not be able to beat Pure PCIe connectivity for low latency and latency is the bigger factor for gaming workloads. TB tunneling protocol encapsulation of PCIe/Any other Protocol will add latency the result of having to do the extra encoding/encapsulation and decoding/de-encapsulation steps there and back whereas OCuLink is just unadulterated PCIe passed over an external cable.

    More Device makers need to be adding OCuLink capability to their systems as that's simple to do and requires no TB#/USB4-V# controller chip to be hung off of MB PCI lanes as the OCuLink port is just passing PCIe signals outside of the device. And TB5/USB4-V2 is more than 64Gbs but that will require more PCIe lanes be attached to the respective TB5/USB4-V2 controller and use more overheard to do that whereas if one has the same numbers of PCIe lanes connected via OCuLink then that's always going to be lower overhead with more available/usable bandwidth and lower latency for OCULink.

    Most likely the PCIe lane counts will remain at 4 lanes Max and that will just go from PCIe 4.0 to PCIe 5.0 instead to support TB5 and USB4-V2 bandwidth but whatever PCIe standard utilized OCuLink will always have lower overhead and lower latency than TB/Whatever or USB4/Whatever as with OCuLink that's skipping the extra tunneling protocol steps required.

    Plus by extension and with any OCuLink Ports being pure PCIe Protocol Based, that opens up the possibility of OCuLink to TB/USB/Whatever Adapters being utilized for maximum flexibility for other use cases as well.
  • Exotica - Wednesday, September 20, 2023 - link

    OCulink has merit for sure, but again, it is clunky. Unlike thunderbolt, it doesn't tunnel displayport or provide power delivery. It also doesn't support hotplugging. That is why it will most likely remain a niche offering. Also you're saying OcCulink is lower latency, but by how much? Where is the test data to prove that ?

    And does it really matter? Operating systems can be run directly off of thunderbolt NVME storage, the latency is low enough for a smooth experience. And even if OcCulink is technically faster, a GPU such as a 4080 or 4090 or 7900XTX in a PCIe4x4 or even PCIe5x4 eGPU thunderbolt 5 enclosure will be much faster than the iGPU or even internal graphics. And if the eGPU enclosure is thunderbolt enabled, it can power the laptop or host device and probably act as a dock and provide additional downstream thunderbolt ports and possibly USB as well. Thunderbolt provides flexibility that OcCulink does not. Both standards have merit.

    But I have a feeling Thunderbolt 5, if implemented properly in terms of bug-free firmware NVMs from Intel, will gain mass market appeal. The mass market is hungry for the additional bandwidth. AsMedia will probably do extremely well as well with its USB4 and upcoming USB4v2 offerings.
  • TheinsanegamerN - Thursday, September 21, 2023 - link

    Dont waste your time, Trampoline is an OCUlink shill who will ignore any criticism for his beloved zuckertech. The idea that most people dont want to disassemble a laptop to use a dock is totally alien to him.
  • FWhitTrampoline - Thursday, September 21, 2023 - link

    LOL, OCuLink's creator PCI-SIG is a not for profit Standards Organization that's responsible for the PCIe standards so it's not like they are any Business Interest with a Fiduciary responsibility to any investors.

    OCuLink is just a Port/electrical PCIe extension cabling standard that was in fact originally intended to be used in consumer products but Intel, a member of PCI-SIG along with other industry members, had a vested interest in that Intel/Apple co-developed Thunderbolt IP, because of TB controllers and sales of TB controllers related interests.

    And TB4/Later and USB4/Later will never have as low latency owing to the fact that any PCIe signalling will have to be intercepted and encapsulated by the TB/USB/Whatever protocol controller in order to be sent down the TB cabling whereas over the OCuLink ports/cabling that's just the PCIe signalling/packets there and no extra delays there related to any extra tunneling protocol encoding/encapsulation and decoding/de-encapsulating steps required.

    So OCuLink represents the maximum flexibility as that's the better lowest latency solution for eGPUs being just pure unadulterated PCIe signaling. And because it's just PCIe that opens up the possibility of all sorts of external adapters that take in PCIe and can convert that to Display Port/HDMI/USB/TB/Whatever the end users need because all Motherboard external I/O, for the most part, is in the from of PCIe and OCulink just brings that PCIe directly out of devices via Ports/External cables.

    And to be so dogmitacilly opposed to OCulink is the same as being opposed to PCIe! And does any rational person think that that's logical! OCuLink is External PCie and that's all there is to that and it's the lowest latency method to interface with GPUs via any PCIe Slot or externally via an OCuLink connection(PCIe is PCIe).

    Give me a Laptop with at least One OCuLink PCIe X4/4.0 port and with that I can interface to an eGPU at 64Gbs bandwidth/lowest latency possible! And there can and will be adapters that can be plugged into that One OCulink port that can do what any other ports on the laptop can do because those ports are all just connected to some MB PCIe lanes in the first place.
  • Kevin G - Wednesday, September 20, 2023 - link

    The main advantage of the TB4 is that the form factor is USB-C which can be configured for various other IO. This is highly desirable in a portable form factor like laptops or tablets. Performance is 'good enough' for external GPU usage. OCuLink maybe faster but doesn't have the flexibility like TB4 over the USB-C connector does. OCuLink has its niche but a mainstream consumer IO solution is not one of them.
  • FWhitTrampoline - Thursday, September 21, 2023 - link

    OCuLink is just externally routed PCIe lanes and really there can be one OCuLink port on every laptop specifically for the best and lowest latency eGPU interfacing and even OCuLink to HDMI/Display Port/whatever adapters that can make the OCuLink port into any other port at the end users discretion. So for eGPUs/Enclosures that have OCuLink ports that's 64Gbs/Lowest latency there and for any Legacy TB4/USB only external eGPU devices just get an OCuLink to TB4/USB4 adapter in the interim and live with the lower bandwidth and higher latency.

    GPD already has a line of Handheld Gaming devices that utilize a dedicated OCuLink port and a portable eGPU that supports both OCUlink interfacing and TB4/USB4 interfacing. And I do hope that GPD Branches out into the regular laptop market as GPD's external portable eGPU works with other makers products and even products that have M.2/NVMe capable slots available via an M.2/NVMe to OCuLink adapter! LOL, only Vested Interests would Object to OCuLink in the consumer market space, specifically those Vested Interests with Business Models that do not like any competition.
  • TheinsanegamerN - Thursday, September 21, 2023 - link

    Because most people dont want to disassemble their laptop to plug in a m.2 adapter, you knucklehead.
  • FWhitTrampoline - Thursday, September 21, 2023 - link

    No one is forcing you to do that and for others that's an option, albeit and inconvenient one. But really the adapters are not meant for Laptops in the first place and even for Mini Desktop PCs is not an easy task there but still more manageable that doing that with a laptop. It would just better if there was more Mini Desktop PC OEMs/Laptops OEMs where those OEMs would adopt an OCuLink PCie 4.0/x4 Port for eGPU usage like GPD has done with their line of handheld gaming devices. And with mass adoption of OCuLink there could also be adapters as well to support all the other standards as OCuLink being PCIe based by extension will support that as well.

Log in

Don't have an account? Sign up now