Conclusion & End Remarks

It’s been a tumultuous and busy week as we’ve only had the new Galaxy S21 Ultra in Snapdragon and Exynos variants for just a few days now, but that’s sufficient as we can generally come to a representative conclusion as to how Qualcomm’s and Samsung’s new generation flagship SoCs will play out in 2021 – and for the most part, it’s probably not what people were expecting.

Starting off with the most hyped up part of the new SoCs (mea culpa), both SoCs are the first to employ Arm’s newest Cortex-X1 cores, the first CPU generation in which Arm really went for a more “performance first” design philosophy. In general, the new CPU IP does live up to its claims, however Arm’s and our own performance projections weren’t met by the new SoCs, as they didn’t quite reach the configurations and clock frequencies we had hoped for 2021 designs. Both Qualcomm and Samsung didn’t invest on an 8MB L3 cache, and in particular Samsung didn’t even don their X1 core with a full 1MB of L2 cache. This does seem to be noticeable in the performance as the Snapdragon 888 does have small performance edge over the Exynos 2100. Samsung’s choice here given their years of wasting lots of silicon on humongous custom CPUs seems to be rather puzzling, but generally both vendors aren’t as aggressive as Apple is on investing die area into caches.

Qualcomm still has a clear memory subsystem advantage as the company has made large strides in latency this generation, and this results in even more extra performance. The Exynos this year surprised us with a much larger system level cache – which however seems to also add to latency and reduce performance.

More worrisome for the Exynos is its weird clock behaviour, with the new chip really struggling in maintaining its peak frequencies other than for very brief moments – the Snapdragon 888’s X1 core had no such issues. My Exynos S21 Ultra chip bin was quite terrible here, but the better silicon on my second S21 doesn’t improve things too much either.

The Exynos 2100’s Cortex-A78 cores are clocked higher than the Snapdragon 888’s, and this show up in performance, however in every-day workloads the DVFS of the Exynos actually behaves more similarly to the Snapdragon as it generally scales things to 2600MHz and only uses the 2808MHz peak frequencies of these cores in brief multi-threaded workloads, as long as thermals even allow it, as even these middle cores can get quite power hungry this generation.

Although both are using the same IP on the same process node, the Exynos 2100’s CPU just look to be more power hungry than the Snapdragon 888’s implementations. Given the apples-to-apples comparison, the only remaining possibility is just a weaker physical design implementation on Samsung LSI’s part – which is actually a point of concern, as we had hoped Exynos SoCs would catch up this year following their ditching of their custom CPU cores. Make no mistake – the new X1 cores are massively improved in performance and efficiency over last year’s M5 cores, it’s just that Qualcomm shows that it can be done even better.

On the GPU side of things, this generation feels wrong to me, and that’s solely due to the peak power levels these new SoCs reach, and which vendors actually left enabled in commercial devices.

Qualcomm had advertised 35% improved GPU performance this generation with the Snapdragon 888, and that might indeed be valid for peak performance, but certainly for Samsung devices that figure is absolutely unreachable for any reasonable amount of gaming periods, as the power consumption is through the roof at over 8W. I don’t see how other vendors might be able to design phones with thermal dissipations that allows for such power levels to actually be maintained without the phone’s skin temperatures exceeding +50°C (122°F), it’s just utterly pointless in my opinion.

In terms of sustained performance, the Snapdragon 888 is generally a 10-15% improvement over the Snapdragon 865 and 865+ - at least in these Samsung devices whose thermal limits and thermal envelopes are similar this generation, attempting to target 42°C peak skin temperatures, although the phones failed to stay below that threshold during the initial few minutes of the performance burn.

On the Exynos 2100 side, Samsung’s +40% performance claim can be considered accurate just for the fact that it generally applies to both peak and sustained performance figures. At peak performance, the SoC is just as absurd at 8W load, which is impossible to maintain. The good news here though, is that when throttling down, the Exynos 2100 is notably better than the Exynos 990 – however that’s not sufficient to catch up to last year’s Snapdragon 865, much less the new Snapdragon 888.

Samsung’s 5LPE process appears to be lacking

We don’t have deeper technical insights as to how Samsung’s process node compares in relation to TSMC’s nodes other than the actual performance of the chips we have in ours hands, so I’m basing my arguments based on the measured data that I’m seeing here.

At lower performance levels, we noted that the 5LPE node doesn’t look to be any different than TSMC’s N7P node, as the A55 cores in the Snapdragon 888 performed and used up exactly the same amount of power as in the Snapdragon 865. At higher performance levels however, we’re seeing regressions – the middle Cortex-A78 cores of the S888 should have been equal power, or at least similar, to the identically clocked A77 cores of the S865, however we’re seeing a 25% power increase this generation.

Similarly, in theory, the Exynos 2100 Cortex-A78 cores at 2.81GHz should have been somewhat similar in power to the 2.84GHz A77 cores of a Snapdragon 865, but it’s again at a 20-25% disadvantage in efficiency.

In fact, both SoCs on the CPU side don’t seem to be able to reach the Kirin 9000’s lower power levels and efficiency even though that chip is running at 3.1GHz – it’s clear to me that TSMC’s N5 node is quite superior in terms of power efficiency.

There are two conclusions here: For Samsung’s Exynos 2100 – it doesn’t really change the situation all that much. 5LPE does seem to be better than 7LPP, and the new chip is definitely more energy efficient than the Exynos 990 – although it does look that the new much more aggressive behaviour of the CPUs, while benefiting performance, can have an impact on battery life. We need more time with the phones to get to a definitive conclusion in that regard.

For Qualcomm’s Snapdragon 888, the new chip’s manufacturing seems to be giving it headwinds. At best, we’re seeing flat energy efficiency, and at worst, we’re seeing generational regressions. This all depends on the operating point, but generally, the new chip seems to be slightly more power hungry than its predecessor – although again, performance has indeed improved. On the CPU side, the performance boost could be noticeable, but more problematic is the sustained GPU performance increase, which is still quite minor. It’s at this point where we have to talk about things other than CPU and GPU, such as Qualcomm’s new Hexagon accelerator, or new camera and ISP capabilities. We weren’t able to test the AI/NPUs today as the software frameworks on the S21 Ultra aren’t complete so it’s something we’ll have to revisit in the future. Looking at all these results, it suddenly makes sense as to why Qualcomm launched another bin/refresh of the Snapdragon 865 in the form of the Snapdragon 870.

Overall, this generation seems a bit lacklustre. Samsung LSI still has work ahead of them in improving fundamental aspects of the Exynos SoCs, maturing the CPU cluster integration with the memory subsystem and adopting AMD’s RDNA architecture GPU in the next generation seem two top items on the to-do list for the next generation, along with just general power efficiency improvements. Qualcomm, while seemingly having executed things quite well this generation, seem to be limited by the process node. We can’t really blame them for this if they couldn’t get the required TSMC volume, but it also means we’re nowhere near in closing the gap with Apple’s SoCs.

In general, I’m sure this year’s devices will be good – but one should have tempered expectations. We'll be following up with full device reviews of the Galaxy S21 Ultras as well as the smaller Galaxy S21 soon - so stay tuned.

GPU Performance & Power: Very, Very Hot
POST A COMMENT

121 Comments

View All Comments

  • Spunjji - Thursday, February 11, 2021 - link

    I'm not an expert by any means, but I think Samsung's biggest problem was always optimisation - they use lots of die area for computing resources but the memory interfaces aren't optimised well enough to feed the beast, and they kept trying to push clocks higher to compensate.

    The handy car analogy would be:
    Samsung - Dodge Viper. More cubes! More noise! More fuel! Grrr.
    Qualcomm / ARM - Honda Civic. Gets you there. Efficient and compact.
    Apple - Bugatti Veyron. Big engine, but well-engineered. Everything absolutely *sings*.
    Reply
  • Shorty_ - Monday, February 15, 2021 - link

    you're right but you also don't really touch why Apple can do that and X86 designs can't. The issue is that uOP decoding on x86 is *awfully* slow and inefficient on power.

    This was explained to me as follows:

    Variable-length instructions are an utter nightmare to work with. I'll try to explain with regular words how a decoder handles variable length. Here's all the instructions coming in:

    x86: addmatrixdogchewspout
    ARM: dogcatputnetgotfin

    Now, ARM is fixed length (3-letters only), so if I'm decoding them, I just add a space between every 3 letters.
    ARM: dogcatputnetgotfin
    ARM decoded: dog cat put net got fin

    done. Now I can re-order them in a huge buffer, avoid dependencies, and fill my execution ports on the backend.

    x86 is variable length, This means I cannot reliably figure out where the spaces should go. so I have to try all of them and then throw out what doesn't work.
    Look at how much more work there is to do.

    x86: addmatrixdogchewspoutreading frame 1 (n=3): addmatrixdogchewspout
    Partially decoded ops: add, , dog, , ,
    reading frame 2 (n=4): matrixchewspout
    Partially decoded ops: add, ,dog, chew, ,
    reading frame 3 (n=5): matrixspout
    Partially decoded ops: add, ,dog, chew, spout,
    reading frame 4 (n=6): matrix
    Partially decoded ops: add, matrix, dog, chew, spout,
    Fully Expanded Micro Ops: add, ma1, ma2, ma3, ma4, dog, ch1, ch2, ch3, sp1, sp2, sp3

    This is why most x86 cores only have a 3-4 wide frontend. Those decoders are massive, and extremely energy intensive. They cost a decent bit of transistor budget and a lot of thermal budget even at idle. And they have to process all the different lengths and then unpack them, like I showed above with "regular" words. They have excellent throughput because they expand instructions into a ton of micro-ops... BUT that expansion is inconsistent, and hilariously inefficient.

    This is why x86/64 cores require SMT for the best overall throughput -- the timing differences create plenty of room for other stuff to be executed while waiting on large instructions to expand. And with this example... we only stepped up to 6-byte instructions. x86 is 1-15 bytes so imagine how much longer the example would have been.

    Apple doesn't bother with SMT on their ARM core design, and instead goes for a massive reorder buffer, and only presents a single logical core to the programmer, because their 8-wide design can efficiently unpack instructions, and fit them in a massive 630μop reorder buffer, and fill the backend easily achieving high occupancy, even at low clock speeds. Effectively, a reorder buffer, if it's big enough, is better than SMT, because SMT requires programmer awareness / programmer effort, and not everything is parallelizable.
    Reply
  • Karim Braija - Saturday, February 20, 2021 - link

    Je suis pas sur si le benchmark SPENCint2006 est vraiment fiable, en plus je pense que ça fait longtemps que ce benchmark est là depuis un moment et je pense qu'il n'a plus bonne fiabilité, ce sont de nouveaux processeurs puissant. Donc je pense que ce n'est pas très fiable et qu'il ne dit pas des choses précises. Je pense que faut pas que vous croyez ce benchmark à 100%. Reply
  • serendip - Monday, February 8, 2021 - link

    "Looking at all these results, it suddenly makes sense as to why Qualcomm launched another bin/refresh of the Snapdragon 865 in the form of the Snapdragon 870."

    So this means Qualcomm is hedging its bets by having two flagship chips on separate TSMC and Samsung processes? Hopefully the situation will improve once X1 cores get built on TSMC 5nm and there's more experience with integrating X1 + A78. All this also makes SD888 phones a bit pointless if you already have an SD865 device.
    Reply
  • Bluetooth - Monday, February 8, 2021 - link

    Why would they skimp on the cache. Was neural engine or something else with higher priority getting silicon? Reply
  • Kangal - Tuesday, February 9, 2021 - link

    I think Samsung was rushing, and its usually easier to stamp out something that's smaller (cache takes alot of silicon estate). Why they rushed was due to a switch from their M-cores to the X-core, and also internalising the 5G-radio.

    Here's the weird part, I actually think this time their Mongoose Cores would be competitive. Unlike Andrei, I estimated the Cortex-X1 was going to be a load of crap, and seems I was right. Having node parity with Qualcomm, the immature implementation that is the X1, and the further refined Mongoose core... it would've meant they would be quite competitive (better/same/worse) but that's not saying much after looking at Apple.

    How do I figure?
    The Mongoose core was a Cortex A57 alternative which was competitive against Cortex A72 cores. So it started as midcore (Cortex A72) and evolved into a highcore implementation as early as 2019 with the S9 when they began to get really wide, really fast, really hot/thirsty. Those are great for a Large Tablet or Ultrabook, but not good properties for a smaller handheld.

    There was a precedence for this, in the overclocked QSD 845 SoCs, 855+, and the subpar QSD 865 implementation. Heck, it goes all the way back to 2016 when MediaTek was designing 2+4+4 core chipsets (and they failed miserably as you would imagine). I think when consumers buy these, companies send orders, fabs design them, etc... they always forget about the software. This is what separates Apple from Qualcomm, and Qualcomm from the rest. You can either brute-force your way to the top, or try to do things more cost/thermal efficiently.
    Reply
  • Andrei Frumusanu - Tuesday, February 9, 2021 - link

    > Unlike Andrei, I estimated the Cortex-X1 was going to be a load of crap, and seems I was right.

    The X1 *is* great, and far better than Samsung's custom cores.
    Reply
  • Kangal - Wednesday, February 10, 2021 - link

    First of all, apologies for sounding crass.
    Also, you're a professional in this field, I'm merely an enthusiast (aka Armchair Expert) take what I say with a grain of salt. So if you correct me, I stand corrected.

    Nevertheless, I'm very unimpressed by big cores: Mongoose M5, to a lesser extent the Cortex-X1, and to a much Much much lesser extent the Firestorm. I do not think the X1 is great. Remember, the "middle cores" still haven't hit their limits, so it makes little sense to go even thirstier/hotter. Even if the power and thermal issues weren't so dire with these big-cores, the performance difference between the middle cores vs big cores is negligible, also there is no applications that are optimised/demand the big cores. Apple's big-core implementation is much more optimised, they're smarter about thermals, and the performance delta between it and the middle-cores is substantial, hence why their implementation works and why it favours compared to the X1/M5.

    I can see a future for big-cores. Yet, I think it might involve killing the little-cores (A53/A55), and replacing it with a general purpose cores that will be almost as efficient yet be able to perform much better to act as middle-cores. Otherwise latency is always going to be an issue when shifting work from one core to another then another. I suspect the Cortex-X2 will right many wrongs of the X1, combined with a node jump, it should hopefully be a solid platform. Maybe similar to the 20nm-Cortex A57 versus the 16nm-Cortex A72 evolution we saw back in 2016. The vendors have little freedom when it comes to implementing the X1 cores, and I suspect things will ease up for X2, which could mean operating at reasonable levels.

    So even with the current (and future) drawbacks of big-cores, I think they could be a good addition for several reasons: application-specific optimisations, external dock. We might get a DeX implementation that's native to Android/AOSP, and combined that with an external dock that provides higher power delivery AND adequate active-cooling. I can see that as a boon for content creators and entertainment consumers alike. My eye is on emulation performance, perhaps this brute-force can help stabilise the weak Switch and PS2 emulation currently on Android (WiiU next?).
    Reply
  • iphonebestgamephone - Monday, February 15, 2021 - link

    The improvement with the 888 in damonps2 and eggns are quite good. Check some vids on youtube. Reply
  • Archer_Legend - Tuesday, February 9, 2021 - link

    Actually samsung has still M6 cores in its belly, the development team was shut down only after they completed the M6 cores.

    Difficoult to say if they would have been better than an X1.

    However it seems that arm has rushed this whole a78 and X1 thing and samsung rushed to put too much stuff in the cpu with evidently not enough time to do it well
    Reply

Log in

Don't have an account? Sign up now