Hybrid CPUs: Sunny Cove and Tremont

Now that we’ve gone over the concept of the heterogeneous core design, it’s time to dig into each of the cores separately and some of the tradeoffs that Intel has had to do in order to get this to work.

Big Sunny Cove

As mentioned previously, the big core in Lakefield is known as Sunny Cove, and stands as the same core we currently see in Intel’s Ice Lake mobile processors today. It is officially Intel’s second 10nm-class core (the first one being the DOA Cannon Lake / Palm Cove), but the first one in mass production.

We have covered the Sunny Cove core microarchitecture in great detail, and you can read about it here:

Examining Intel's Ice Lake Processors: Taking a Bite of the Sunny Cove Microarchitecture

The quick recap is as follows.

Very similar to a Skylake design, except that:

  • Better prefetchers and branch predictors
  • +50% L1 Data Cache
  • +100% L1 Store Bandwidth
  • +100% L2 Cache w/improved L2 TLB
  • +50% Micro-op Cache
  • +25% uops/cycle into reorder buffer
  • +57% reorder buffer size
  • +25% execution ports
  • AVX-512 with VNNI

The side effect of increasing the L1 Data cache size was a decrease in latency, with the L1-D moving to a 5-cycle rather than a 4-cycle. Normally that would sound like a 25% automatic speed drop, however the increased L1 size, L1 bandwidth, and L2 cache all help for an overall improvement.

Intel claimed that Sunny Cove should perform ~18% better clock-for-clock compared to a Skylake core design. In our initial review of Ice Lake, we compared the i7-1065G7 processor (Ice Lake) to the Core i9-9900K processor (Coffee Lake, a Skylake derivative), and saw a 19% increase in performance per clock, essentially matching Intel’s advertised numbers.

(However it should be noted that overall we didn’t see that much of an improvement at the overall chip and product level, because the Ice Lake ran at a lower frequency, which removed any raw clock speed gain.)

Small Tremont Atom

Arguably the Tremont core is the more interesting of the two in the Lakefield design. Lakefield will be the first consumer product built with a Tremont core inside, and as a result we have not had a chance to test it yet. But we have gone over the microarchitecture extensively in a previous article.

Intel's new Atom Microarchitecture: The Tremont Core in Lakefield

The reason why Tremont is more exciting is because updates to Intel’s Atom line of processor cores happen at a much slower pace. Traditionally Atom has been a core that focuses on the low cost part of the market, so there isn’t that much of a need to make it right at the bleeding edge as it commands lower margins for the company. It still plays a vital role, but for context, here is what year we’ve seen new Atom designs come into the market:

  • 2008: Bonnell
  • 2011: Saltwell
  • 2013: Silvermont
  • 2015: Airmont
  • 2016: Goldmont
  • 2017: Goldmont Plus
  • 2020: Tremont

Tremont is the first new Atom microarchitecture design for three years, and technically only the third Atom design to be an out-of-order design. However, Tremont is a big jump in a lot of under-the-hood changes compared to Goldmont Plus.

  • Can be in a 1-core, 2-core, or 4-core cluster
  • +33% L1-Data Cache over Goldmont+, no performance penalty
  • Configurable L2 cache per cluster, from 1.5 MB to 4.5 MB
  • +50% L2 TLB (1024-entry, up from 512)
  • New 2x3-wide decoder, rather than single 3-wide decoder
  • +119% re-order buffer (208, up from 92)
  • 8 execution ports, 7 reservation stations
  • 3 ALUs, 2 AGUs
  • Dual 128-bit AES units
  • New Instructions*

What made the most noise is the new dual 3-wide decoder. On Intel’s primary Core line, we haven’t seen much change in the decoder in recent generations – it still uses a 5-wide decoder, split between 1 complex decoder and 4 simple decoders, backed with a micro-op cache. Tremont’s new dual 3-wide decoder can manage dual data streams in order to keep the buffers further down the core fed. Intel stated that for the design targets of Tremont, this was more area and power efficient than a 6-wide decoder, or having a large micro-op cache in the processor design (Atom cores have not have micro-op caches to date). Intel states that the decoder design helps shape the back-end of the core and the balance of resources.

Also worthy of note in Tremont is the L1-Data cache. Intel moved up from a 24 KiB design to a 32 KiB design, an increase of 33%. This is mostly due to using the latest manufacturing node. However, an increase in cache size is typically accompanied with an increase in latency – as we saw on Sunny Cove, we moved from a 4-cycle to a 5-cycle. However in Tremont’s case, the L1-Data cache stays at 3-cycle for an 8-way 32 KiB design. Even Skylake’s L1-D cache, at an 8-way 32 KiB design, is a 4-cycle, which means that Tremont’s L1-D is tuned to surpass even Skylake here.

The final point, Tremont’s new instructions, requires a section all on its own, specifically because none of the new instructions are supported in Lakefield.

What’s Missing in Lakefield

One of the biggest issues with a heterogeneous processor design is software. Even if we go beyond the issues that come with scheduling a workload on such a device, the problem is that most programs are designed to work on whatever microarchitecture they were written for. Generic programs are meant to work everywhere, while big publishers will write custom code for specific optimizations, such as if AVX-512 is detected, it will write AVX-512.

The hair-pulling out moment occurs when a processor has two different types of CPU core involved, and there is the potential for each of them to support different instructions or commands. Typically the scheduler makes no guarantee that software will run on any given core, so for example if you had some code written for AVX-512, it would happily run on an AVX-512 enabled core, but cause a critical fault on a core that doesn’t have AVX-512. The core won’t even know it’s an AVX-512 instruction until it comes time to decode it, and just throw an error when that happens. Not only this, but the scheduler has the right to move a thread when it needs to – if it moves a thread in the middle of an instruction stream, that can cause errors too. The processor could also move a thread to prevent thermal hotspots occurring, which will then cause a fault.

There could be a situation where the programmer can flag that their code has specific instructions. In a program with unique instructions, there’s very often a check that tries to detect support, in order to say to itself something like ‘AVX512 will work here!’. However, all modern software assumes a homogeneous processor – that all cores will support all of the same instructions.

It becomes a very chicken and egg problem, to a certain degree.

The only way out of this is that both processors in a hybrid CPU have to support the same instructions completely. This means that we end up with the worst of both worlds – only instructions supported by both can be enabled. This is the lowest common denominator of the two, and means that in Lakefield we lose support for AVX-512 on Sunny Cove, but also things like GFNI, ENCLV, and CLDEMOTE in Tremont (Tremont is actually rather progressive in its instruction support).

Knowing that Lakefield was going to have to take the lowest common denominator from the two core designs, Intel probably should physically removed the very bulky AVX-512 unit from the Sunny Cove core. Looking at the die shot, it's still there - there was some question going into the recent disclosures as to whether it would still be there, but Intel has stated on the record repeatedly that they removed it. The die shot of the compute silicon shows that not to be the case.

For x86 programmers doing instruction detection by code name or core family, this might have to change. In the smartphone world, where 4+4 processor designs are somewhat the norm, this lowest common denominator issue has essentially been universally adopted. There was some slight issue with a Samsung processor that had a non-unified cache setup, which ended up being rectified in firmware. But both sets of CPUs had to rely on lowest common denominator instructions.

Thermal Management on Stacked Silicon How To Treat a 1+4 Hybrid CPU
Comments Locked

221 Comments

View All Comments

  • ichaya - Sunday, July 19, 2020 - link

    SPEC is useful for some IPC comparisons, but it's questionable to use it for much else. PG bench in the phoronix link has a 50%+ speedup with SMT which is basically inline for perf/W/$ with Graviton 2 instance. The worst case is Casandra, but everything else is within ~5% for similar perf/$ if not comparable perf/W too since comparing TDP is workload dependent as well and not measured by most tests.

    XZ and Blender are ~45% faster with SMT in your openbenchmark link, but that's a 3900X (12-core/24-thread), so any comparisons to server chips (64-core Graviton 2) are unfair given power consumption and core differences. 4 times the L3 is also wrong, it's 50% more L2+L3 with half the cores and SMT if you're being fair between m6g.16xlarge or c6g.16xlarge and c5a.16xlarge.
  • Quantumz0d - Friday, July 3, 2020 - link

    Intel has lost it's edge. And this whole portable nonsense is reaching peaks of stupidity. Those Lakefield processor equipped machines will be close to $1000 for their thin and ultra light 1 USB C / 1 3.5mm audio jack, what a fucking disaster.

    I had owned one ultrabook which is Acer Aspire S3 and I used to even play DotA2 on that, and after 1-2 years the whole machine heated like crazy, I repasted, no dice, cleaned fans, nothing. And then battery also stopped holding a charge. Now what ? That stupid POS is dead, not even worth, meanwhile a Haswell machine with rPGA socket, and an MXM slot from 2013 and guess what ? the GPU got an upgrade to Pascal 1070 MXM from Kepler 860M.

    All these BGA trash machines will no longer hold charge nor have their serviceability, older ultrabooks atleast had a 2.5" drive, newer ones have NVMe SSDs, these 2 in 1 trash like most of the Surface lineup is almost impossible to even repair or service. And because of this thin and light market Windows 10 has been ruined as well to cater to this bs phenomenon and desktop class OS is hit with that ugly Mobile UX which lacks powerful software options, navigation and all. Plus you don't even get to repair it yourself due to non available servicing parts.

    With Apple HW same thing, full BGA not even NVMe SSDs, and now they also started to make their Mac OS look and feel like iOS trash. This whole mobile and ultra portable garbage is ruining everything, from gaming to the HW.
  • PandaBear - Monday, July 6, 2020 - link

    They don't want to cannibalize their highly profitable x86 business, so they have to give you crap for what you want if you want to pay less. The problem right now is other companies don't have to deal with this political monopoly BS and they are eating Intel for lunch.

    Most monopolies die this way: when their monopoly business is obsoleted and they hang on to it to milk the cow till it dies.
  • yeeeeman - Friday, July 3, 2020 - link

    Tigerlake should also be in the pipeline soon, right?
  • Deicidium369 - Saturday, July 4, 2020 - link

    Benchmarks showing it destroying AMD Renoir at single core, and within 17% on MT - despite half the cores...

    https://wccftech.com/intel-10nm-core-i7-1165g7-cpu...
  • watzupken - Sunday, July 5, 2020 - link

    "Benchmarks showing it destroying AMD Renoir at single core, and within 17% on MT - despite half the cores...

    https://wccftech.com/intel-10nm-core-i7-1165g7-cpu...

    Till we see the actual performance, you need to take these leaks with a lot of salt. The test bed are not revealed in leaks and it is not possible to ascertain if it is a realistic number. This we don't have to speculate for long since it should be out pretty soon.
  • pugster - Friday, July 3, 2020 - link

    Lakefield's 2.5w standby sounds kind of high. ARM cpu is probably much lower than that.
  • Ian Cutress - Monday, July 20, 2020 - link

    2.5 mW
  • ProDigit - Friday, July 3, 2020 - link

    Qualcomm has proven that a single fast core isn't enough. Intel needs to at least do 2 fast cores. Then add at least 6 atom cores.
    But if Intel wants to compete with AMD, it'll need to create a quad core big setup, with at least 10 to 12 atom cores.
    Any less will be too little. These are too little as is, competing against the 3000 series of AMD.

    It would be awesome, if Intel could make a 25W quad core cpu, paired with an additional 40 watts on atom cores. That's about 20 additional cores, or a 24 core cpu.
  • abufrejoval - Friday, July 3, 2020 - link

    A great article overall, very informative, deeply technical while still readable to a layman, very little judgement or marketing, allowing readers to form their own opinion: Anandtech at its very best!

    Not mentioned in the article and not covered by the comments so far is that the main driver behind Intel’s low power SoCs has been Apple: This is what Intel thought Apple would want and be happy with!

    And if you contrast it to what Apple will now do on their own, that makes me want to sell all my Intel shares: Good thing I never had any.

    This is another Intel 80432 or i860, tons of great ideas engineered into parts, but great parts don’t automatically make a convincing whole.

    And I simply don’t see them iterate that into many more designs over the next years at competitive prices: With that hot-spot governed layout between the two all the flexibility and cost savings a chiplet design is supposed to deliver goes away and you now have two chips in a very tight symbiosis with no scale-up design benefits.

    It’s a Foveros tech demo, but a super expensive one with very little chance of currying favors even at ‘negative revenues’ in the current market.

    X86 is not competitive in terms of Watts or transistors required for a given amount of compute. It didn’t matter that much in PCs, the competing servers were much worse for a long time, but in the mobile space, phones to ultrabooks, it seems impossible to match ARM, even if you could rewind the clock by ten years and started to take BIG-little seriously. Lakefield is essentially a case study for Core being too big and thus power hungry and Atom failing on performance.

    ISA legacy is still holding x86 from dying completely, but that matters less and less at both the top of the performance range with servers and at the bottom in mobile, where the Linux kernel rules supreme and many userlands and ISAs compile just fine.

    Gaming is a hold-out, but perhaps the last generation consoles on x86, gamer PCs alone too much of a niche to determine the future.

    The desktop will switch to who offers the bigger, longer lasting bang for the buck and there is a very good chance that will be ARM next.

    Microsoft may be allowed to blunder along with lackluster ARM64 support for a couple more days, but Apple’s switch puts them under long deserved pressure. A nice Linux/Android/Chromium hybrid ultrabook running whatever Office could get things moving quicker… at least I hope that, because I’d never want to be forced into the bitten Apple…. by these corporate decision makers I see already twitching.

    No chance I’d ever let a new Apple into my home: The ][ was the last good one they made.

Log in

Don't have an account? Sign up now