The Secret of Denver: Binary Translation & Code Optimization

As we alluded to earlier, NVIDIA’s decision to forgo a traditional out-of-order design for Denver means that much of Denver’s potential is contained in its software rather than its hardware. The underlying chip itself, though by no means simple, is at its core a very large in-order processor. So it falls to the software stack to make Denver sing.

Accomplishing this task is NVIDIA’s dynamic code optimizer (DCO). The purpose of the DCO is to accomplish two tasks: to translate ARM code to Denver’s native format, and to optimize this code to make it run better on Denver. With no out-of-order hardware on Denver, it is the DCO’s task to find instruction level parallelism within a thread to fill Denver’s many execution units, and to reorder instructions around potential stalls, something that is no simple task.

Starting first with the binary translation aspects of DCO, the binary translator is not used for all code. All code goes through the ARM decoder units at least once before, and only after Denver realizes it has run the same code segments enough times does that code get kicked to the translator. Running code translation and optimization is itself a software task, and as a result this task requires a certain amount of real time, CPU time, and power. This means that it only makes sense to send code out for translation and optimization if it’s recurring, even if taking the ARM decoder path fails to exploit much in the way of Denver’s capabilities.

This sets up some very clear best and worst case scenarios for Denver. In the best case scenario Denver is entirely running code that has already been through the DCO, meaning it’s being fed the best code possible and isn’t having to run suboptimal code from the ARM decoder or spending resources invoking the optimizer. On the other hand then, the worst case scenario for Denver is whenever code doesn’t recur. Non-recurring code means that the optimizer is never getting used because that code is never seen again, and invoking the DCO would be pointless as the benefits of optimizing the code are outweighed by the costs of that optimization.

Assuming that a code segment recurs enough to justify translation, it is then kicked over to the DCO to receive translation and optimization. Because this itself is a software process, the DCO is a critical component due to both the code it generates and the code it itself is built from. The DCO needs to be highly tuned so that Denver isn’t spending more resources than it needs to in order to run the DCO, and it needs to produce highly optimal code for Denver to ensure the chip achieves maximum performance. This becomes a very interesting balancing act for NVIDIA, as a longer examination of code segments could potentially produce even better code, but it would increase the costs of running the DCO.

In the optimization step NVIDIA undertakes a number of actions to improve code performance. This includes out-of-order optimizations such as instruction and load/store reordering, along register renaming. However the DCO also behaves as a traditional compiler would, undertaking actions such as unrolling loops and eliminating redundant/dead code that never gets executed. For NVIDIA this optimization step is the most critical aspect of Denver, as its performance will live and die by the DCO.


Denver's optimization cache: optimized code can call other optimized code for even better performance

Once code leaves the DCO, it is then stored for future use in an area NVIDIA calls the optimization cache. The cache is a 128MB segment of main memory reserved to hold these translated and optimized code segments for future reuse, with Denver banking on its ability to reuse code to achieve its peak performance. The presence of the optimization cache does mean that Denver suffers a slight memory capacity penalty compared to other SoCs, which in the case of the N9 means that 1/16th (6%) of the N9’s memory is reserved for the cache. Meanwhile, also resident here is the DCO code itself, which is shipped and stored as already-optimized code so that it can achieve its full performance right off the bat.

Overall the DCO ends up being interesting for a number of reasons, not the least of which are the tradeoffs are made by its inclusion. The DCO instruction window is larger than any comparable OoOE engine, meaning NVIDIA can look at larger code blocks than hardware OoOE reorder engines and potentially extract even better ILP and other optimizations from the code. On the other hand the DCO can only work on code in advance, denying it the ability to see and work on code in real-time as it’s executing like a hardware out-of-order implementation. In such cases, even with a smaller window to work with a hardware OoOE implementation could produce better results, particularly in avoiding memory stalls.

As Denver lives and dies by its optimizer, it puts NVIDIA in an interesting position once again owing to their GPU heritage. Much of the above is true for GPUs as well as it is Denver, and while it’s by no means a perfect overlap it does mean that NVIDIA comes into this with a great deal of experience in optimizing code for an in-order processor. NVIDIA faces a major uphill battle here – hardware OoOE has proven itself reliable time and time again, especially compared to projects banking on superior compilers – so having that compiler background is incredibly important for NVIDIA.

In the meantime because NVIDIA relies on a software optimizer, Denver’s code optimization routine itself has one last advantage over hardware: upgradability. NVIDIA retains the ability to upgrade the DCO itself, potentially deploying new versions of the DCO farther down the line if improvements are made. In principle a DCO upgrade not a feature you want to find yourself needing to use – ideally Denver’s optimizer would be perfect from the start – but it’s none the less a good feature to have for the imperfect real world.

Case in point, we have encountered a floating point bug in Denver that has been traced back to the DCO, which under exceptional workloads causes Denver to overflow an internal register and trigger an SoC reset. Though this bug doesn’t lead to reliability problems in real world usage, it’s exactly the kind of issue that makes DCO updates valuable for NVIDIA as it gives them an opportunity to fix the bug. However at the same time NVIDIA has yet to take advantage of this opportunity, and as of the latest version of Android for the Nexus 9 it seems that this issue still occurs. So it remains to be seen if BSP updates will include DCO updates to improve performance and remove such bugs.

Designing Denver SPECing Denver's Performance
Comments Locked

169 Comments

View All Comments

  • melgross - Wednesday, February 4, 2015 - link

    So, people only buy devices during the first three months?
  • Impulses - Wednesday, February 4, 2015 - link

    Apparently... Although getting the review in before February would've shut all these people up, cheapest place to get the Nexus 9 all thru the holidays was Amazon ($350 for 16GB) and they gave you until January 31 to return it regardless of when you bought it.

    Only reason I'm so keenly aware is I bought one as a February birthday gift, opened it last weekend just to check it was fine before the return window closed... Not much backlight bleed at all even tho it was manufacturerd in October (bought in late December), some back flex but it's going in a case anyway.
  • blzd - Friday, February 6, 2015 - link

    What does the month of manufacture have to do with the back light bleed? You don't actually believe those "revision" rumors, do you?

    If you do, consider how practical it is for a hardware revision to come out 1 month after release. Then consider how one set of pictures on a Reddit post proves anything other than that their RMA worked as intended.
  • ToTTenTranz - Wednesday, February 4, 2015 - link

    I wish more smartphone/tablet makers put as much thought into their external speakers as HTC does.

    Once having a HTC One M7, I simply can't go back to mono speakers at the back of devices.
  • Dribble - Wednesday, February 4, 2015 - link

    Glad the review is here at last, next one a little bit quicker please :)
  • UpSpin - Wednesday, February 4, 2015 - link

    I have following issues with your review:
    1. You run webbrowser tests and derive CPU performance from it. That's nonsense! It's a web-browser test, and it won't be a CPU test whatever you do. If you want to test raw CPU performance you have to run native CPU test applications.

    2. Your battery life analysis is based on false assumptions and you derive doubtful claims from it.
    The error is quite evident on the iPad Air test. In your newly introduced white display test, with airplane on, CPU/GPU idling, etc. the iPad Air 2 has a battery life of 10:18 hours. Now in your web-browsing battery test with WiFi on and the CPU busy, the iPad Air 2 has a battery life of 9:76 hours. That's a difference of 4%. The Nexus 9 has a difference of 30%, the Note 4 15%, the Shield Tablet 25%.
    You conclude: The Tegra K1 is inefficient. But I could also conclude that the A8 is inefficient and the Tegra K1 very efficient. The Tegra K1 needs significantly less power while idling, compared to the A8, which consumes always the same, mostly independent on the load. So finally, the A8 lacks any kind of power saving mode.
    That's abstruse, but the consequence of your test. Or maybe your test is flawed from the beginning on.

    3. " I suspect we’re looking at the direct result of the large battery, combined with an efficient display as the Nexus 9 can last as long as 15 hours in this test compared to the iPad Air 2’s 10 hours."
    Sorry, but I don't get this either. The Nexus 9 has a 25.46 WHr battery, the iPad Air 2 a 27.3 WHr battery (+7%). The Nexus 9 has a 8.9" Display, the iPad Air 2 a 9.7". (+19% area). The resolution is the same, thus the DPI on the Nexus 9 higher. The display techonoly is the same, as you said in your analysis. So the difference must be related to something else, like a highly efficient idle SoC in the Nexus 9.
  • Andrei Frumusanu - Wednesday, February 4, 2015 - link

    The battery life tests analysis is based on true facts on the technical workings of the SoC and its idle power states and we are confident in the resulting conclusions.
  • JarredWalton - Wednesday, February 4, 2015 - link

    Going along with what Andrei said, an SoC isn't "efficient" if it's doing no work -- the A8 may not have idle power as low as the K1-64, but when you're actually doing anything more with the tablet in question is when efficiency matters. It's clear that the Air 2 wins out over the Nexus 9 in some of those tests (GFX in particular). Doing more (or equivalent) work while using less power is efficient.

    Imagine this as an example of why idle power only matters so far: if you were to start comparing cars on how long they could idle instead of actual gas mileage, would anyone care? "Car XYZ can run for 20 hours off a tank while idle while Car ZYX only lasts 15 hours!" Except, neither car is actually doing what a car is suppose to do, which is take you from point A to point B.

    The white screen test is merely a way to look at the idle power draw for a device, and by that we can get an idea of how much additional power is needed when the device is actually in use. Also note that it's possible due to the difference in OS that Android simply better disables certain services in the test scenario and iOS might be wasting power -- the fact that the battery life hardly changes in our Internet WiFi test even suggests that's the case.

    To that end, the battery life of the N9 is still quite good. Get rid of the smartphones in the charts and it's actually pretty much class leading. But it's still odd that the NVIDIA SHIELD Tablet and iPad Air 2 only show a small drop between idle and Internet, while N9 loses 33% of its battery life.
  • ABR - Thursday, February 5, 2015 - link

    Idle power is pretty important for real world use for tablets, for example where you are reading something and the system is just sitting there. Those "load web page then pause for xx time" test would probably be really good for measuring.
  • JarredWalton - Thursday, February 5, 2015 - link

    That's exactly what our Internet test does, which is why the 33% drop in battery life is so alarming. What exactly is going on that N9 loading a generally not too complex web page every 15 seconds or so kills battery life?

Log in

Don't have an account? Sign up now