The Exynos 990 SoC: Last of Custom CPUs

While we've received a lot of information on the Snapdragon 865 over the last few months due to Qualcomm’s openness and willingness to share details with the public, until now we’ve known almost nothing about the new Exynos 990. Samsung LSI’s newest flagship processors was announced way back in October, but we had to be patient and await commercial devices before we could get any concrete details on the chip’s makings. What we do know is that the new chip employs a new generation M5 CPU microarchitecture, upgrades the mid-cores to Cortex-A76 designs, and employs a new Mali-G77 GPU, all manufactured on a 7nm 7LPP process that uses EUV lithography.

An Exynos 9820 Retrospective


Exynos 9820 CPU Topology

Before we get into the Exynos 990 itself, I want to do a quick retrospective on last's year's flagship Samsung SoC, the Exynos 9820, both to catch up on things we've learned since the Galaxy S10 launch, and to illustrate how the Exynos 990 has changed things.

The first thing to note on the Exynos 9820 is that Samsung’s custom CPU cores reside in a completely different cluster than Arm’s cores – both being interconnected and being cache coherent only via Samsung’s Coherent Interconnect. My more recently written core-to-core latency test demonstrates this topology difference as the latencies between CPUs on the different cores is significantly higher than what we see on the cores within the Arm cluster, and higher than what we saw on the Snapdragon 865 on the previous page.

The second correction is that the M4 cores didn’t just have 512KB L2 caches, but rather 1MB. This wasn’t very visible in the latency tests due to issues with the microarchitecture which we’ll revisit in a later page as well.

The weird cache behavior that we originally reported on in the bandwidth figures of the A75 cores last year ended up being a side-effect of a 2MB last-level cache on the SoC. This SLC acts the same as the 3MB SLC on the Snapdragon 865 and allows for efficient caching of various memory accesses of the SoC IP blocks, saving power for the system.

Enter the Exynos 990


Exynos 990 CPU Topology

Where the Exynos 990 differs from the Exynos 9820 is in a few areas. First off, let’s focus on the Arm cluster. Here Samsung has finally donned the small A55 cores with private, 64KB L2 caches. This was notoriously missing from both the Exynos 9810 and Exynos 9820’s A55 cores, which lead them to be less performant and seemingly less efficient than their counterparts on the Snapdragon SoCs. The 64KB L2 caches here are still only half of the 128KB that we find on the Snapdragon 865, so Samsung continues to be extremely conservative in the cache configuration of the Arm CPUs. The new small cores see a slight clock frequency upgrade, going up to 2GHz this time around.

The middle cores see an upgrade from Arm Cortex-A75s to Cortex-A76s, while also getting a frequency lift from 2.3GHz up to 2.5GHz. This is actually a massive performance boost of 38% to 50% depending on the workload, and essentially serve as the Exynos 990’s workhorses for the vast majority of tasks. The L2 caches are still configured at 256KB per core, and the shared L3 of the Arm cluster remains at a more conservative 1MB.

On the big core side, we see the evolution of the microarchitecture from the M4 cores, codenamed Cheetah, to the newer M5 cores, codenamed Lion. Whilst Samsung has kept the maximum clock frequencies unchanged at 2.73GHz, they did promise a 20% uplift, which should mostly come from IPC improvements.

The biggest externally observable change is the fact that these new cores no longer have private L2 caches for themselves, but rather now come with a shared L2 of 2MB. That’s actually quite the huge microarchitectural design change in an era where we’re used designs actually introducing private L2 caches. The topology change can be evidenced by the drastic reduction in the core-to-core latencies between the two M5 cores compared to the M4 counterparts in the previous generation, as the coherency now happens at a lower cache level that's closer to the CPUs.

The Exynos 990 is manufactured on Samsung’s 7LPP node, which uses EUV lithography. It’s actually not the first chip on the process, as that title goes to the Exynos 9825 found in the Note10 series last year. However if TechInsight’s reporting is accurate, it seems that the that the Exynos 990 is the first chip to be actually designed with the full 7LPP PDK rather than being just a relaxed conversion of the design to another process (The 9825 is functionally identical to the 9820, and it seems this also applies to its lithography implementation).

Samsung describes the 7LPP process as having 7% higher performance than its 8LPP node, which should also manifest itself as a power reduction of a design at otherwise equal frequency. Comparing the voltage curves of one of our S20 Exynos 990 units to the S10 unit last year, we see that there are some differences, but these are somewhat lackluster in the end. First of all it’s to be noted that the bins of our Exynos 990 units are seemingly bad this year, and I’ve seen that most units out there are in the same classification or even worse, pointing to the possibility of bad yields for the chip.

The A55 cores do clock slightly higher this generation, but at the peak frequencies the voltages still remain very high. At more medium frequencies we do however see improvements of up to around -43mV. The A76 cores can’t really be compared to the A75 cores of the previous generation due to their different microarchitectures, but also here we see the voltage curves being lower than on the 9820 even though the binning of our 990 units here are quite worse.

Finally, the M5’s core voltages are extremely disappointing. Not only are there no improvements at equal frequency to the M4 cores on 8nm, but there’s actually a degradation in the frequency scaling: the new Lion cores require higher voltages to reach the same frequencies. Peak voltages at 2.73GHz have gone up from 1068mV to 1118mV in our review sample units between the M4 and M5, meaning the new microarchitecture just scales worse in frequency. This all doesn’t bode all to well for power efficiency of the new design.

Samsung’s own scheduler and CPU characterization is very clear on the power and efficiency curves: throughout its performance scaling, the M5 cores are notably less efficient than the Cortex-A76 cores on the same SoC. We also note that the A55 data this year seemingly looks more realistic than what we’ve encountered on the Exynos 9820’s drivers last year.

The most striking differences in the power data from Samsung is the static leakage characteristics of the A76 and M5 cores. At an equal 1050mV voltage (2.5GHz on the A76, 2.6GHz on the M5), the Arm cores are characterized as leaking 78mW statically while the M5 cores use up 297mW. Static leakage is roughly corresponding to die area of the block – last year’s M4 cores were 3.72x larger than the A75 cores, and the static leakage difference here on the Exynos 990 is 3.8x, and I wouldn’t be surprised if this also ends up being the difference in area between the two CPU types.

One odd mechanism that Samsung had introduced in the Exynos 9820 was a more complex scheduler that differentiated power models based on the running ISA of the application. It tracked 32 and 64-bit apps separately and made scheduler decisions based on the microarchitectural performance and power characteristics of the different CPUs on the different execution modes.

This is said to help power efficiency, mostly by scheduling things more often onto the Arm middle cores which seemingly have a better 32-bit execution efficiency.

I was curious and I tried this out on the Exynos 990, comparing the relative differences in performance and efficiency between the M5 cores and the A76 cores. In the aggregate figures of SPECint2006, I unfortunately didn’t see any big difference at all in the execution modes. However individual subtests such as 456.hmmer, which are mostly execution bound, saw large advantages on the A76 cores, actually outperforming the M5 cores with a score of 13.53 vs 12.83 while using only half the energy. So in that regard, Samsung’s scheduling methodology makes a lot of sense. 400.perbench was another case of the A76 cores outperforming the M5 cores in 32-bit mode, using less than half the power. However, any more memory intensive workloads heavily favored the M5 cores, probably due to the stark differences in cache sizes. While I’m sure Samsung’s ISA based scheduling model reduces power, I do have to wonder what the absolute performance impact is in terms of using this mechanism.

Also unrelated to the whole ISA scheduling mechanism, I think this is the first time we’ve ever published benchmark numbers on the differences between AArch32 and AArch64 execution modes. The AArch64 performs significantly better due to it having more architectural registers available and being able to execute out-of-order code more efficiently, along with some ISA instruction improvements. Whilst there’s a power increase in this mode, we’re seeing much better efficiency as the performance improvements are greater. It’s also a good reason as to why the wider ecosystem is shifting to deprecate 32-bit on Arm.

It’s also to be noted that the M5 Lion core will be Samsung’s last commercial custom CPU design, as the design team had been disbanded back in October, and most employees by now have found new homes at different companies. I’ll be coming back to this decision in the context of the wider competitive landscape after we dissect the M5’s performance and efficiency.

The Snapdragon 865 SoC: Beating Expectations Memory Subsystems Compared
POST A COMMENT

135 Comments

View All Comments

  • id4andrei - Friday, April 3, 2020 - link

    No need for me to praise this review any longer. Still, I must nitpick. The 3dmark GPU test always has caveats in your reviews. Drop it if you feel it is detected by OEMs or it's a false GPU test like the physics one.

    On web tests. I read on wiki that JetStream is an Apple made test, literally. Wouldn't you say that's a big caveat when testing against ios? Similarly Speedometer is developed by the webkit team at Apple. With Android webview based on Blink, not webkit, wouldn't Android smartphones be at a disadvantage against iphones? I don't see Kraken(Firefox) or Octane(Google) being used.

    Kraken would actually be neutral to both. Other 3rd party tests might be Testdrive(Microsoft) or Basemark.
    Reply
  • Andrei Frumusanu - Friday, April 3, 2020 - link

    I don't think that the fact that the WebKit team made those tests is a valid argument against using them. You can go and read the source JS yourself if you wish, and they're industry accepted benchmarks. Both Kraken and Octane are ancient and outdated and we dropped them just like we dropped SunSpider of the early days. Reply
  • id4andrei - Friday, April 3, 2020 - link

    Thank you for the prompt answer. Reply
  • s.yu - Friday, April 3, 2020 - link

    Thank you Andrei, again the most comprehensive and reliable set of samples anywhere!
    There seems to be considerable sample variation again (last time with Samsung was the main module since S9 with the variable aperture) in the UWA, S20+E and S20UE should have absolutely identical UWA performance but the S20UE seems to have far worse sagittal resolution than the S20+E, and Samsung's processing isn't that good in the first place, considering the 12MP 1.4μm could produce incredibly sharp pictures as that been the specs of the Pixels' main module for generations.
    I don't regret their switch to f/1.8 because the old module that went up to f/1.5 wasn't sharp wide open, especially in the corners, but a further two stops' variation to f/3.3 could be useful for more DoF in closeups provided inserting that physical aperture into the tiny module doesn't compromise the optical design otherwise.
    This time around the E seems to generally outperform the S, except in color as E doesn't seem to have proper color fidelity...almost as if chroma NR is set too high even in broad daylight, and the "hybridization" of the digital zoom, in which the E clearly uses a smaller portion from the periscope's readout than the S in the resulting merge. Speaking of the zoom, S20+ still performs slightly worse at 2x(16MP readout) than S10's native 12MP, though the difference is small and could be down to lens variation. Considering S10U's Z height, they could've easily fixed the S20U like Xiaomi, going 1/2.3" f/2 12MP with the 2x. Xiaomi used it despite a 4-1 bin, all the more reason to use it with a 9-1 bin. S20U's corner performance at 3x would also be much improved.
    Regarding the comparison with the Fuji though, I suspect your unit has trouble focusing to infinity correctly, because the train and forest samples show clear superiority of the Fuji's zoom. I especially recognize that kind of slight haziness as being very responsive to dehaze and low radius sharpening in LR and would result in far more detail with extraction in post. Also, with an ILC, there's always stopping down a little for more sharpness and more DoF.
    Regarding the full res modes, it's not worth storing 108MP of data with the CFA asking for a 9-1 bin, of course the 64MP would be better, without the RAW it's hard to say for sure, but the 64MP seems to be quad bayer.
    Reply
  • s.yu - Friday, April 3, 2020 - link

    I don't agree with your remark about the night comparison with Mate30P though, the UWA is not "UW" so it has better image quality, that's true, and the night mode of the Mate30P is far superior, that's also true, but not auto mode, nor any aspect of the telephoto as it's clearly using a crop of the main for 3x. Samsung does attempt to use the 4x for telephoto and although there's a significant issue of chroma noise, it's far sharper than Mate30P's crop, with at least twice 3 times the effective resolution in night mode. With S20U you could also crop out a single shot 3-4x of similar brightness to the Mate30P crop, but it's just a crop.
    As for the potential of P40P surpassing S20U, that model operates on a 9.4MP crop by default, interpolated to 12.5MP which clearly has consequences. In daylight it's often a regression compared to P30P (much less match Mate30P), and in night shots using the current firmware it has severe color issues of rendering large portions of the scene as a crimson red, so it's hard to say at this point too.
    Reply
  • s.yu - Friday, April 3, 2020 - link

    Oh, there's exception of the Mate30P auto mode in the last sample, but the night mode isn't constantly superior either. Reply
  • RealBeast - Friday, April 3, 2020 - link

    I've been looking forward to getting one of these, not sure which yet. The fly in the ointment now is that I won't see my Mom (who gets my old S9+) until the Fall due to the whole COVID problem, not to mention less income. That will weigh heavily on sales of what is otherwise an amazing looking phone for me. Reply
  • 29a - Friday, April 3, 2020 - link

    How large are the picture file sizes created by this thing? Reply
  • BedfordTim - Sunday, April 5, 2020 - link

    The same size as any other 12MP camera. They will depend on content, hdr, motion and compression options but I would expect about 36MB for a raw image and 8MB for a high quality jpeg. Reply
  • toyeboy89 - Friday, April 3, 2020 - link

    I'm really amazed in the fact that the iPhone XR is still beating snapdragon 865 in GFXBench in both peak and sustained performance. I am hoping the OnePlus 8 has better sustained performance. Reply

Log in

Don't have an account? Sign up now