The Google Nexus 9 Review

Name: The Google Nexus 9 Review
Item: The Google Nexus 9 Review

by Joshua Ho & Ryan Smith on February 4, 2015 8:00 AM EST

169 Comments | Add A Comment

169 Comments

SPECing Denver's Performance

Finally, before diving into our look at Denver in the real world on the Nexus 9, let’s take a look at a few performance considerations.

With so much of Denver’s performance riding on the DCO, starting with the DCO we have a slide from NVIDIA profiling the execution of SPECInt2000 on Denver. In it NVIDIA showcases how much time Denver spends on each type of code execution – native ARM code, the optimizer, and finally optimized code – along with an idea of the IPC they achieve on this benchmark.

What we find is that as expected, it takes a bit of time for Denver’s DCO to kick in and produce optimized native code. At the start of the benchmark execution with little optimized code to work with, Denver initially executes ARM code via its ARM decoder, taking a bit of time to find recurring code. Once it finds that recurring code Denver’s DCO kicks in – taking up CPU time itself – as the DCO begins replacing recurring code segments with optimized, native code.

In this case the amount of CPU time spent on the DCO is never too great of a percentage of time, however NVIDIA’s example has the DCO noticeably running for quite some time before it finally settles down to an imperceptible fraction of time. Initially a much larger fraction of the time is spent executing ARM code on Denver due to the time it takes for the optimizer to find recurring code and optimize it. Similarly, another spike in ARM code is found roughly mid-run, when Denver encounters new code segments that it needs to execute as ARM code before optimizing it and replacing it with native code.

Meanwhile there’s a clear hit to IPC whenever Denver is executing ARM code, with Denver’s IPC dropping below 1.0 whenever it’s executing large amounts of such code. This in a nutshell is why Denver’s DCO is so important and why Denver needs recurring code, as it’s going to achieve its best results with code it can optimize and then frequently re-use those results.

Also of note though, Denver’s IPC per slice of time never gets above 2.0, even with full optimization and significant code recurrence in effect. The specific IPC of any program is going to depend on the nature of the code, but this serves as a good example of the fact that even with a bag full of tricks in the DCO, Denver is not going to sustain anything near its theoretical maximum IPC of 7. Individual VLIW instructions may hit 7, but over any period of time if a lack of ILP in the code itself doesn’t become the bottleneck, then other issues such as VLIW density limits, cache flushes, and unavoidable memory stalls will. The important question is ultimately whether Denver’s IPC is enough of an improvement over Cortex A15/A57 to justify both the power consumption costs and the die space costs of its very wide design.

NVIDIA's example also neatly highlights the fact that due to Denver’s favoritism for code reuse, it is in a position to do very well in certain types of benchmarks. CPU benchmarks in particular are known for their extended runs of similar code to let the CPU settle and get a better sustained measurement of CPU performance, all of which plays into Denver’s hands. Which is not to say that it can’t also do well in real-world code, but in these specific situations Denver is well set to be a benchmark behemoth.

To that end, we have also run our standard copy of SPECInt2000 to profile Denver’s performance.

SPECint2000 - Estimated Scores
	K1-32 (A15)	K1-64 (Denver)	% Advantage
164.gzip	869	1269	46%
175.vpr	909	1312	44%
176.gcc	1617	1884	17%
181.mcf	1304	1746	34%
186.crafty	1030	1470	43%
197.parser	909	1192	31%
252.eon	1940	2342	20%
253.perlbmk	1395	1818	30%
254.gap	1486	1844	24%
255.vortex	1535	2567	67%
256.bzip2	1119	1468	31%
300.twolf	1339	1785	33%

Given Denver’s obvious affinity for benchmarks such as SPEC we won’t dwell on the results too much here. But the results do show that Denver is a very strong CPU under SPEC, and by extension under conditions where it can take advantage of significant code reuse. Similarly, because these benchmarks aren’t heavily threaded, they’re all the happier with any improvements in single-threaded performance that Denver can offer.

Coming from the K1-32 and its Cortex-A15 CPU to K1-64 and its Denver CPU, the actual gains are unsurprisingly dependent on the benchmark. The worst case scenario of 176.gcc still has Denver ahead by 17%, meanwhile the best case scenario of 255.vortex finds that Denver bests A15 by 67%, coming closer than one would expect towards doubling A15's performance entirely. The best case scenario is of course unlikely to occur in real code, though I’m not sure the same can be said for the worst case scenario. At the same time we find that there aren’t any performance regressions, which is a good start for Denver.

If nothing else it's clear that Denver is a benchmark monster. Now let's see what it can do in the real world.

Gallery: NVIDIA Denver Hot Chips Disclosure

The Secret of Denver: Binary Translation & Code Optimization CPU Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

169 Comments

View All Comments

seanleeforever - Wednesday, February 4, 2015 - link
2nd that.
I am not here to read about how fast the tablet is or how nice it looks. i am here for in depth content about the chip. would it be nice that this content was available since the release of the product? absolutely, but given the resource it would either be a brief review that is going to be the same as review you can find from hundred of other websites, or late but in depth.
honestly i think anand should be targeting at more tech oriented contents that's few but in depth, and leave the quick/dirty review for other websites.

superb job.
WaitingForNehalem - Wednesday, February 4, 2015 - link
Yeah but who cares about tablets??!! I don't come to Anandtech to read about budget tablets, or SFF PCs, or more smartphones. The Denver coverage was not even that in depth TBH, just commentary on the NVidia slides. I have a EE degree and some of the previous write ups were so in depth they could be class material. This one isn't which is fine but I don't think it excuses how late it came out. The enthusiast market is growing and you should be targeting that demographic as you previously have, not catering to the mainstream like hundreds of sites already do.
retrospooty - Wednesday, February 4, 2015 - link
The enthusiast market is growing ? What with CPU's not really getting, or needing to be any faster for several years now, and a standard mid range quad core i5 (non-overclocked) being WAY more than powerful enough to run 99.9% of anything out there, how is the enthusiast market is growing? Most enthusiasts I know don't even bother any more... There just isnt a need. Any basic PC is great these days.
WaitingForNehalem - Wednesday, February 4, 2015 - link
I totally agree with you. That doesn't change the fact that the market is growing as more users are adopting gaming PCs. Enthusiasts now actually command a sizable portion of desktops sold. Intel's Devil's Canyon was in response to that.
retrospooty - Thursday, February 5, 2015 - link
OK, I get what you mean.

I guess I am still in a mind set where a PC "enthusiast" is your overclocker, tweaker, buying the latest and fastest of everything to eek out that extra few frames per second.

Today, a mid range quad core i5 from 3 years ago and a decent mid-high range card runs any game quite nicely.
FunBunny2 - Thursday, February 5, 2015 - link
There was a time, readers may be too young to have been there, when there was a Wintel monopoly: M$ needed faster chips to run ever more bloated Windoze and Intel needed a cycle-sink to soak up the increase in cycles that evolving chips provided. Now, we're near (or at?) the limits of single-threaded performance, and still haven't found a way to use multi-processor/core chips in individual applications. There just aren't a) many embarrassingly parallel problems and b) algorithms to turn single-threaded problems into parallel code. I mean, the big deal these days is 4K displays? It looks prettier, to some eyes, but doesn't change the functionality of an application (medical and such excepted, possibly).

Does anyone really need an i7 to surf the innterTubes for neater porn?
nico_mach - Friday, February 6, 2015 - link
I think the chip coverage was superb, I don't have an EE degree and I'm pretty sure that's what the website is steered towards. And I still think I got it.

It's fascinating the number of layers involved in this Android tablet, and speaks to why Apple can optimize so much better. There's the chip->NVIDIA chip optimizer->executable code->Dalvik compiler/runtime->dalvik code. I mean, when the lags are encountered, that's twice as many suspects to investigate.

I still think that the review is a little harsh on Denver. It's hitting the right performance envelope at the right price. While it's an mildly inefficient design, clearly NVIDIA is pricing it accordingly, and that might be a function of moving some of the optimization work to software. And that's work that Apple and MS do all the time - Apple much more successfully, obviously. There's a real gap in knowledge of how efficient Apple's chips are vs how optimized the software/hardware pairing is.
dakishimesan - Wednesday, February 4, 2015 - link
I have no interest in tablets, but the deep dive on Denver was a fascinating read, and still completely relevant even if the product is a few months old. Thanks for the great review.
Sindarin - Wednesday, February 4, 2015 - link
...can I offer you a cup of hot chicken soup laddy? .....maybe some vicks vapor rub? lol! c'mon dude! we're all sick(vaca) in December!
hahmed330 - Wednesday, February 4, 2015 - link
Hi, outstanding article with incredible attention to detail... Do you think its possible to run Dynamic Code Optimizer on per say 2 or maybe even 4 small cpu cores dedicated to doing all the software OoOE functions instead of using time slicing? (A53s or just some XYZ narrow cores for a potential 2+2 or 4+4 or maybe even 8+8)

Also whats the die size of a denver core in comparison to a enhanced cyclone core?? That is where a lot of gains are possible potentially 30%-50%..

The Google Nexus 9 Review

SPECing Denver's Performance

Post Your Comment

169 Comments

View All Comments

seanleeforever - Wednesday, February 4, 2015 - link

WaitingForNehalem - Wednesday, February 4, 2015 - link

retrospooty - Wednesday, February 4, 2015 - link

WaitingForNehalem - Wednesday, February 4, 2015 - link

retrospooty - Thursday, February 5, 2015 - link

FunBunny2 - Thursday, February 5, 2015 - link

nico_mach - Friday, February 6, 2015 - link

dakishimesan - Wednesday, February 4, 2015 - link

Sindarin - Wednesday, February 4, 2015 - link

hahmed330 - Wednesday, February 4, 2015 - link

Log in

Don't have an account? Sign up now