Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh

Name: Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh
Item: Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh
Author: Anand Lal Shimpi

by Anand Lal Shimpi on December 17, 2013 11:56 PM EST

Posted in
Ask the Experts

20 Comments | Add A Comment

20 Comments

Last week we held an awesome Ask the Experts Q&A with ARM's Peter Greenhalgh, lead architect for the Cortex A53. Peter did a great job answering questions in the comments, but for those of you who missed them we're compiling all of the Q&A here for you to go through.

On Friday, December 20th at 12:00PM ET, Peter will be joining me for a live Google Hangouts chat. We'll be posting more details on that later this week. For now, enjoy the responses!

Question from shodanshok

Hi, it would be interesting to know two thing:

- the cache memories (L1/L2) are write-back or write-through? Inclusive or exclusive?

- multiprocessor capabilities are limited to 4 cores or they can scale to 8+ cores without additional glue logic?

Thanks.

Answer

Hi Shodanshok,

All cacheable attributes are supported, but Cortex-A53 is optimised around write-back, write-allocate. The L2 cache is inclusive on the instruction side and exclusive on the data side.

A Cortex-A53 cluster only supports up to 4-cores. If more than 4-cores are required in a platform then multiple clusters can be implemented and coherently connected using an interconnect such as CCI-400. The reason for not scaling to 8-cores per cluster is that the L2 micro-architecture would need to either compromise energy-efficiency in the 1-4 core range to achieve performance in the 4-8 core range, or compromise performance in the 4-8 core range to maximise energy-efficiency in the 1-4 core range. This isn’t a hard and fast rule for all clusters, but is the case for a cluster at the Cortex-A53 power/performance point. For the majority of mobile use cases it is best to focus on energy efficiency and enable more than 4-cores through multi-cluster solutions.

Question from lukarak

We have seen MediaTek introducing an 8xA7 SOC, instead of going to the big.LITTLE configuration of some sorts. Do you expect the same thing to happen with the A53 and A57 generation for low budget SOCs or will this generation's combo be a little easier and cheaper to implement?

Answer

Hi Lukarak,

We expect to see a range of platform configurations using Cortex-A53. A 4+4 Cortex-A53 platform configuration is fully supported and a logical progression from a 4+4 Cortex-A7 platform. A Cortex-A57 in the volume smartphone markets is less likely, but that’s a decision in the hands of the ARM partners. It will be interesting to see the range of Cortex-A53 platforms and configurations announced by partners over the coming months.

Question from ehsan.nitol

Hi there, I have some questions.

We have already seen how well Qualcomm's Cortex A7 can perform thanks to Moto G. How much will it improve with the new Cortex A53? What will be the core and performance wise difference? How will you compare it against Cortex A9, A12 and A15 in terms of performance, battery consumption and all.

With the Exynos Octa core processor Battery Test we haven't seen much battery improvements compared to Qualcomm's Snapdragon 600 and 800 Processor. How will it perform this time?

What is ARM planning do with its Mali GPU? What will be next after Cortex A53 and A57?

Answer

Hi Ehsan,

Cortex-A53 has the same pipeline length as Cortex-A7 so I would expect to see similar frequencies when implemented on the same process geometry. Within the same pipeline length the design team focussed on increasing dual-issue, in-order performance as far as we possibly could. This involved symmetric dual-issue of most of the instruction set, more forwarding paths in the datapaths, reduced issue latency, larger & more associative TLB, vastly increased conditional and indirect branch prediction resources and expanded instruction and data prefetching. The result of all these changes is an increase in SPECInt-2000 performance from 0.35-SPEC/Mhz on Cortex-A7 to 0.50-SPEC/Mhz on Cortex-A53. This should provide a noticeable performance uplift on the next generation of smartphones using Cortex-A53.

Question from Techguy99X

Why are the current A7 quad core phones performing similar to the A9 quad (exynos 4412 , tegra 3), although A9 is more advanced and OoO? What is the main difference between A5 and A7, becuase the A7 is just a bit faster than the A5?

Answer

Hi Techguy99X,

Overall platform performance is dependent on many factors including processor, interconnect, memory controller, GPU, video and more. While the Cortex-A9 is a higher performance processor both in IPC and frequency, ARM partners are continuously improving their platforms and porting them to new process geometries. This allows a new generation Cortex-A7 based platform to improve on an older generation Cortex-A9 based platform.

Compared to Cortex-A5, Cortex-A7 increased load-store bandwidth, allowed more common data-processing operations to dual-issue and made some small improvements in the branch-predictors.

Question from barleyguy

When designing a processor and deciding on which performance attributes to emphasize, do you target current workloads for short term market concerns, or do you target possible future workloads for the market a year or two from now? Or is performance tuning more workload agnostic, and do you say "I want this to be fast for everything"?

For example, since ARM processors are very popular in the Android market, do you tune for content comsumption and gaming? Or since Android may be trending towards more of a primary computing device in the future, is it important to tune for desktop applications?

And finally, what are the considerations of performance tuning for thermally constrained devices?

Answer

Hi Barleyguy,

A good question. A general purpose processor has to be good on all workloads. After all, we expect to see Cortex-A53 in everything from mobile phones to cars, set-top-box to micro-servers and routers. However we do track the direction software is evolving – for example the increased use of Javascript which puts pressure on structures such as indirect predictors. Therefore we design structures within the processor to cope with future code as well as existing expected workloads.

Question from psychobriggsy

Do you have any plans to support various forms of turbo functionality within your next generation ARM cores? An a potential example, in a 28nm quad-core A53 setup at 1.2GHz, you could support dual-core at >1.4GHz and single core at >1.6GHz within the same power consumption (core design allowing, of course), yet single threaded performance could improve significantly.

ARM cores have been historically low power, however that doesn't mean there aren't more power savings to be made. Examples include deeper sleep states, power gated cores, and so on - features that Intel and AMD have had to include in order to reduce their TDPs whereas ARM cores haven't need them (yet). What are the future power saving methods that ARM is considering for its future cores (that you can give away)?

Answer

Hi Psychobriggsy,

A Turbo mode is typically a form of voltage overdrive for brief periods of time to maximise performance, which ARM partners have been implementing on mobile platforms for many years. Whether this is applied to 1,2 or more cores is a decision of the Operating System and the platform power management software. If there is only one dominant thread you can bet that mobile platforms will be using Turbo mode. Due to the power-efficiency of Cortex-A53 on a 28nm platform, all 4 cores can comfortably be executing at 1.4GHz in less than 750mW which is easily sustainable in a current smartphone platform even while the GPU is in operation.

In terms of further power saving techniques, power gating unused cores is a technique that has been used since the first Cortex-A9 platforms appeared on the market several years ago. The technique is so fundamental that I think many in the mobile industry use it automatically and forget that it’s a highly beneficial power saving technique. But you are correct that there is more milage to come from deeper sleep states which is why both Cortex-A53 and Cortex-A57 support state retention techniques in caches and the NEON unit to further improve leakage power.

Question from Xebec

Peter, thanks for offering your time to Anandtech!

I was curious if you could talk a bit about how easy/difficult A53-derived SoCs might be to integrate into solutions that are already using A7/A9 type chips? i.e. Devices like Beagleboards, Raspberry Pis, ODROIDs, etc. Is there anything that makes the A53 particularly difficult or easy to suit to these types of devices?

Also, for Micro and "regular" servers, do you see A57/A53 big.LITTLE being the norm, or do you anticipate a variety of A53-only and A57-only designs? Any predictions on market split between the A5x series here?

Respectfully,

John

Answer

Hi Xebec,

Cortex-A53 has been designed to be able to easily replace Cortex-A7. For example, Cortex-A7 supports the same bus-interface standards (and widths) as Cortex-A7 which allows a partner who has already built a Cortex-A7 platform to rapidly convert to Cortex-A53.

With servers I think we will see a mix of solutions. The most popular approach will be to use Cortex-A57 due to the performance that micro-architecture is capable of providing, but I still expect some Cortex-A53 servers and big.LITTLE too!

Question from Doormat

ARM CPU vendors (Qualcomm, Nvidia, etc) seem to be choosing slower quad core over faster dual core, and I'm suspecting its all a marketing game (e.g. more cores is better, see Motorola's X8 announcement of an "8 core" phone). Do those non-technical decisions impact the decisions of the engineers in developing the ARM architecture?

Answer

Hi Doormat,

You are quite correct that there are a variety of frequencies and core-counts being offered by ARM partners. However, for ARM design micro-architectures these do not have an effect on micro-architectures as we must be able to support a variety of target frequencies and core-counts across many different process geometries.

Question from Factory Factory

How does designing a CPU "by hand" differ from using an automated layout tool? What sort of trade-offs does/would using automated tools cause for ARM's cores?

Second question: With many chips from many manufacturers now implementing technologies like fine-grained power gating, extremely fine control of power and clock states, and efficient out-of-order execution pipelines, where does ARM go from here to keep its leadership in low-power compute IP?

Answer

Hi Factory,

Hand layout versus automated layout is an interesting trade-off. From one perspective, full hand-layout for all circuits in a processor is rarely used now. Aside from cache RAMs which are always custom, hand-layout is reserved for datapath and queues which are regular structures that allow a human to spot the regularity and ‘beat’ an automated approach. However, control logic is not amenable to hand-layout as it’s very difficult to beat automated tools which means that the control logic can end up setting the frequency of the processor without significant effort.

In general the benefit from hand-layout has been reducing in recent years. Partly this is due to the complexity of the design rules for advanced process generations reducing the scope for more specific circuit tuning techniques to be used. But another factor is the development of advanced standard cell libraries that have a large variety of cells and drive strengths which lessens the need for special circuit techniques. When we’re developing our processors we’re fortunate to have access to a large team in ARM designing standard cell libraries and RAMs who can advise us about upcoming nodes (for example 16nm and 10nm). In turn the processor teams can suggest & trial new advanced cells for the libraries which we call POPs (Processor Optimization Packages) that improve frequency, power and area.

A final trade-off to consider is process portability. After an ARM processor is licensed we see it on many different process geometries which is only possible because the designs are fully synthesizable. For example, there are Cortex-A7 implementations on all the major foundries from 65nm to 16nm. In combination with the advanced standard cell libraries for these processes there is little need to go to a hand-layout approach and we instead enable our partners to get to market more rapidly on the process geometry and foundry of their choosing.

More Q&A

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

20 Comments

View All Comments

Exophase - Wednesday, December 18, 2013 - link
Wonderful article. Thank you very much for your time and information.
syxbit - Wednesday, December 18, 2013 - link
Great answers!
It's too bad none of the questions about A15 losing so badly to Krait and Cyclone weren't brought up.
ciplogic - Wednesday, December 18, 2013 - link
It was about Cortex A53. Also, the answers were politically neutral (as they should), as the politics and other companies future development are not the engineer's talk. Maybe an engineer from Qualcomm could answer accurately.
lmcd - Wednesday, December 18, 2013 - link
Krait 200 is way worse than A15. If A15 revisions come in then A15 could easily keep pace with Krait. But idk if ARM does those.

Cyclone is an ARM Cortex-A57 competitor.
Wilco1 - Wednesday, December 18, 2013 - link
A15 has much better IPC than Krait (which is why in the S4 Krait needs 2.3GHz to get the similar performance as A15 at just 1.6GHz). The only reason Krait can keep up at all is because it uses 28nm HPM, which allows for much higher frequencies.
ddriver - Wednesday, December 18, 2013 - link
Really? The Note3 with krait is pretty much neck to neck with the exynos octa version at 1.9, which was a A15 design last time I checked.
Wilco1 - Wednesday, December 18, 2013 - link
Sorry, it was 1.6 vs 1.9GHz in the S4 and 1.9 vs 2.3GHz in the Note3. Both are pretty much matched on performance, so Krait needs ~20% higher clock.
cmikeh2 - Wednesday, December 18, 2013 - link
I don't know if those are normalized for actual power consumption, although we have to deal with different process technologies as well. Good IPC is pretty much meaningless in this segment if it requires ridiculous voltages to hit the frequencies it needs to.
Wilco1 - Thursday, December 19, 2013 - link
It does seem Samsung had some issues with its process indeed. NVidia was able to reach higher frequencies easily at low power. I haven't seen detailed power consumption comparisons between the 2 variants of S4 and N3 at load, but there certainly is a power cost to pushing your CPU to high frequencies (high voltages indeed!), so having better IPC helps.
twotwotwo - Wednesday, December 18, 2013 - link
Not even sure I'd count A15 out yet. I have a vague that impression power draw is part of why it didn't get more wins; if so, the next process gen might help with that. Folks on AT will have moved on to thinking about 64-bit chips by then, but as Peter put it there will be plenty of lower-end sockets left to fill.

Also, there was an A15 in the most popular ARM laptop yet (the Exynos in the Chromebook) so at least it got one really neat win. :)

Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh

Post Your Comment

20 Comments

View All Comments

Exophase - Wednesday, December 18, 2013 - link

syxbit - Wednesday, December 18, 2013 - link

ciplogic - Wednesday, December 18, 2013 - link

lmcd - Wednesday, December 18, 2013 - link

Wilco1 - Wednesday, December 18, 2013 - link

ddriver - Wednesday, December 18, 2013 - link

Wilco1 - Wednesday, December 18, 2013 - link

cmikeh2 - Wednesday, December 18, 2013 - link

Wilco1 - Thursday, December 19, 2013 - link

twotwotwo - Wednesday, December 18, 2013 - link

Log in

Don't have an account? Sign up now