Today’s piece is a bit of an unusual review; NVIDIA’s new Jetson AGX embedded system kit isn’t really a device platform we’re expecting the average reader to think about, much less buy. NVIDIA’s shift over the last few years from offering consumer grade Tegra chipsets to more specialised silicon applications isn’t any more evident than in the new Tegra Xavier which powers the Jetson AGX. While the board's capabilities certainly fall outside of the use-cases of most consumers, it still represents a very interesting platform with a lot of functionality and silicon IP that we don’t find in any other device to this day. So when NVIDIA reached out to offer us a sample, we decided to have a go at assembling a high-level overview of what the board and the new Xavier chip can do.

First of all, we have to describe what this actually is. The Jetson AGX is a full-fledged small form-factor computer / embedded system, with the form of the whole unit not any bigger than 105x105mm. The AGX module itself is designed to be a complete commercial off the shelf (COTS) system for use in finished products, with NVIDIA aiming it at AI (read: neural networking) centric use cases such as robotics and industrial automation. Jetson boards typically occupy the small-to-mid volume end of the market, showing up in one-off products and items with limited production runs, where it doesn't make sense for a manufacturer to develop and deploy their own custom hardware.

But of course the bare module is only half the story. You can't do development against a bare module, and this is where NVIDIA's complete Jetson AGX development kit comes in. The AGX dev kit comes with everything needed to run a single module, including a power supply, a heatsink, and more important of all, a breakout board.  The breakout board offers various I/O headers and ports, ranging from your standard double USB-C 3.1 ports, HDMI connectors and Gigabit Ethernet ports, to more specialised connectivity such as MIPI CSI-2 connectors for camera connectivity and a range of typical development board headers such as a 40 pin GPIO connector.

The more unusual connectivity options of the Jetson AGX are the PCIe Gen4 x16 slot as well as a M.2 PCIe x1 extension slot that is meant to be used for connectivity add-ons such as WiFi or cellular modules, both features that aren’t common among Arm development board as most SoCs don’t have the spare PCIe controllers.

The board comes with many other connectors, and that’s one regard in which the new Jetson AGX doesn’t lack at all in flexibility. Power is supplied by an external generic 19V power supply – the stock one supplied by NVIDIA is a 65W LiteOn unit that seems no different than most laptop charger bricks.

Underneath the quite heavy and solid aluminium heatsink we find what actually powers the Jetson AGX board: the AGX Xavier module. This is a system module that sits on top of the Jetson motherboard – the module has no I/O ports by itself and merely serves as the brains of the system, integrating the core components surrounding the Xavier chip, such as the 16GB of LPDDR4x memory, a small 32GB eMMC storage chip as well as all the power delivery circuits for powering the different power rails of the DRAM as well as IP blocks of the Xavier SoC.

The Xavier chip, as said, is the brains of the platform and represents NVIDIA’s biggest and most complex SoC to date. With 9 billion transistors on a die size of 350mm², it’s among one of the heavyweights of the Arm ecosystem, although between its initial announcements and today Apple has one-upped NVIDIA in terms of transistor count as the new A12X is a 10B chip – in a much smaller manufacturing node.

Coming from the traditional PC industry, NVIDIA doesn’t shy away from showing die shots of their products, which is something that is quite rare these days among the Arm SoC vendors. The Xavier SoC is mainly dominated by two big IP blocks which consist of the majority of the space allocated on the die: The 8-core “Carmel” CPU complex as well as a four-cluster Volta GPU.

At the high level, the CPU complex contains 8 Carmel CPU cores configured in four clusters, each with a pair of Carmel CPU cores. Each cluster has an independent clock plane and shares a 2MB cache among two CPU cores. At the higher CPU complex level we find a 4MB L3 cache serving all clusters. We don’t know too much about the microarchitecture of the new Carmel cores - seemingly this looks to be a successor to NVIDIA’s Denver µarch, a design that was characterised by its dynamic code optimisation capability. The only thing that NVIDIA does advertise is that this is a 10-wide superscalar machine (10 execution ports in this case, not 10-wide decode) and has support for the ARMv8.2+RAS instruction set. We’ll come back to the CPU core later in the article.

The GPU in Xavier has its roots in the Volta architecture. Here we find the GPU configured into four TPC (Texture processing clusters), each with two SMs (Stream multiprocessors), for a total of 8 SMs or 512 ALU lanes/CUDA cores. A most interesting aspect of the GPU is that because it is based on Volta, it also inherits the Tensor processing units from its bigger brethren. This augments the total processing power of the GPU by up to 22.6 8-bit TOPs or 11.3 FP16 TOPS on the part of the Tensor cores, on top of the respectively 2.8 and 1.4 TFLOPs for FP16 and FP32 CUDA operations provided by the SMs.

Alongside the CPU and GPU there’s many other important blocks, many of which NVIDIA had covered already at its HotChips 2018 presentation last summer. The one block that really does augment the Xavier SoC is the new DLA IP block: this is very much a new type of block that follows the trend we’ve seen in the mobile SoC space – a dedicated machine inferencing acceleration unit not unlike that which we’ve seen from the likes of HiSilicon or Apple. NVIDIA’s DLA promises performances of up to 11.4 int8 TOPS and is also capable of FP16 operation at half speed at 5.7 TOPS. On the SoC, the unit is implemented as a dual-core instance.

Alongside the DLA, the programmable vision accelerator is again a key component of the Xavier system that allows it focus on vision and in particular robotics, embedded AI and automotive use-cases. The PVA is a more traditional vision IP block that handles more rudimentary tasks such as object detections in a much more efficient way than it would be able to be handled by the GPU or machine inferencing algorithms. Here the PVA will be the first IP block after the ISP in the vision pipeline that will serve to segment parts of an image into objects that will be then forwarded to other algorithms that then would happen on the GPU or DLA.

Machine Inference Performance & What's it For?
POST A COMMENT

46 Comments

View All Comments

  • webdoctors - Friday, January 04, 2019 - link

    I was also confused by the news of Audi using Samsung chips. I don't think this changes the Audi/Nvidia relationship from googling: http://fortune.com/2017/01/05/audi-nvidia-2020/

    I think in the infotainment sector there's just a lot of competition for cheap chips and a low bar for entry. Any Mediatek or run of the mill cellphone chip should do. I doubt you'd care about ECC or safety in the HW playing your music or watching movies. My current car has an aftermarket unit that's 10 years old that can play DVD movies, has GPS maps and integrates a backup camera.

    I'm not sure how you'd program a beast of a chip here, or even what the right benchmarks are since you wouldn't need it just play movies, show maps or run CPU benchmarks. With all the inferencing and visual processing it'd be a waste of resources and money to use it for the traditional tasks done today in cars.

    I'm really curious how Anandtech evaluates these specialized products that aren't your run of the mill CPU/GPU/HDD.
    Reply
  • unrulycow - Saturday, January 05, 2019 - link

    This is obviously overkill for the entertainment system. It's main purpose is for the semi-autonomous driving systems like Cadillac's SuperCruise or Tesla's Autopilot. Reply
  • Andrei Frumusanu - Friday, January 04, 2019 - link

    As far as I know their mobile roadmap still uses custom cores, there's probably different requirements for automotive or they could have simply said that 8 A76s make a lot more sense than 8 custom cores. Reply
  • eastcoast_pete - Saturday, January 05, 2019 - link

    Thanks Andrei! Yes, design requirements for automotive/vehicle-embedded are different in key areas (safety/security). However, I was/am struck by Samsung not adapting their own Mongoose design for AE use. Maybe their client (Audi) preferred the stock A76 AE design, and it wasn't economical to adapt Mongoose. However, this now means that the most powerful Samsung SoC design (A76 octacore) might be found in - Audi cars. Reply
  • unrulycow - Saturday, January 05, 2019 - link

    They are also losing Tesla as a client. Tesla decided to create their own chip which will theoretically start going into cars in Q2. I would love to see a comparison between the two chips. Reply
  • CheapSushi - Friday, January 04, 2019 - link

    This is very minor but I'm surprised the ports/connectors aren't more secure on something meant to be in a car. I would expect cables to be screwed in like classic DVI or twist locked in or some other implementation. I feel like the vibration of the car, or even a minor accident, could loosen the cables. Or maybe I got the wrong impression from the kit. Reply
  • KateH - Friday, January 04, 2019 - link

    afaik the generic breakout boards included in dev kits are just for the "dev" part- development and one-offs. a final design would probably use a custom breakout board with just the interfaces needed and in a more rugged form factor thats integrated into the product. Reply
  • mode_13h - Friday, January 04, 2019 - link

    Would've loved to see a Denver2 (Tegra TX2) in that comparison. According to this, they're actually faster than Carmel:

    https://openbenchmarking.org/result/1809258-RA-180...

    Note that the benchmark results named "TX2-6cores-enabled-gcc-5.4.0" refer to the fact that TX2 had the Denver2 cores disabled by default! Out of the box, it just ran everything on the quad-A57 cluster.
    Reply
  • edatech - Saturday, January 05, 2019 - link

    Same results also says TX2 is running with higher frequency (TX2 @ 2.04GHz while Jetson Xavier @ 1.19GHz), so not quite an apple to apple comparison. Reply
  • mode_13h - Saturday, January 05, 2019 - link

    I'm not sure how much to read into that number. Would they really run the A57 and Denver2 cores at the same frequency? Is the Xavier figure really the boost, and not just the base clock?

    There's also this (newer) result:

    https://openbenchmarking.org/result/1812170-SK-180...

    Again, my point is that I wish the article had looked at Denver2. It sounds like an interesting, if mysterious core.

    Jetson TX2 boards are still available - and at much lower prices than Xavier. So, it's still a worthwhile and relevant question how it compares - especially for those not needing Xavier's Volta and Tensor cores.
    Reply

Log in

Don't have an account? Sign up now