HPC: Dell Says a Single PCIe x16 Bus Is Enough for Multiple GPUs – Sometimes

One of the more interesting sessions we saw was a sold-out session being held by Dell, discussing whether a single PCIe x16 bus provides enough bandwidth to GPUs working on HPC tasks. Just as in the gaming world, additional x16 busses come at a premium, and that premium doesn’t always offer the kind of performance that justifies the extra cost. In the case of rackmount servers, giving each GPU a full x16 bus either means inefficiently packing GPUs and systems together by locating the GPU internally with the rest of the system, or running many, many cables from a system to an external cage carrying the GPUs. Just as with gaming, if you can get more GPUs to share a PCIe bus then the cheaper it becomes. While for gaming this means using cheaper chipsets, for servers this means being able to run more GPUs off of a single host system, reducing the number of hosts an HPC provider would need to buy.

Currently the most popular configurations for the market Dell competes in are systems with a dedicated x16 bus for each GPU, and systems where 2 GPUs share an x16 bus. Dell wants to push the envelope here, and go to 4 GPUs per x16 bus in the near future, and upwards of 8 and 16 GPUs per bus in the far future when NVIDIA enables support for that in their drivers. To make that happen, Dell is introducing the PowerEdge C410x PCIe Expansion Chassis, a 3U chassis capable of holding up to 16 GPUs. Their talk in turn what about what they found when testing this cassis when filled with Tesla 1000 series cards (GT200 based) in conjunction with a C410 server.

Ultimately Dell’s talk, delivered by staff Ph.D Mark R. Fernandez, ended up being an equal part about the performance hit of sharing a x16 bus, and whether the application in question will scale with more GPUs in the first place. Compared to the gold standard of one bus per GPU and internally locating the GPUs, simply moving to an external box and sharing an x16 bus among 2 GPUs had a negative impact in 4 of the 6 applications Dell tested with. The external connection would almost always come with a slight hit, while the sharing of the x16 bus is what imparted the biggest part of the performance hit as we would expect.

However when the application in question does scale beyond 1-2 GPUs, what Dell found was that the additional GPU performance more than offset the loss through a shared bus. In this case 4 of the same 6 benchmarks saw a significant performance improvement in moving from 2 to 4 GPUs; ranging between a 30% improvement and a 90% improvement. With these many GPUs it’s hard to separate the effects of the bus from scaling limitations, but it’s clear there’s a mix of both going on, in what seems particularly dependent on just how much bus chatter an application eventually causes.

So with these results, Dell’s final answer over whether a single x16 PCIe bus is enough was simply “sometimes”. If an application scales against multiple GPUs in the first place, it usually makes sense to go further – after all if you’re already on GPUs, you probably need all the performance you can get. However if it doesn’t scale against multiple GPUs, then the bus is the least of the problem. It’s in between these positions where the bus matters: sometimes it’s a bottleneck, and sometimes it’s not. It’s almost entirely application dependent.

NVIDIA Quadro: 3D for More than Gaming

While we were at GTC we had a chance to meet with NVIDIA’s Quadro group, the first such meeting since I became AnandTech’s Senior GPU Editor. We haven’t been in regular contact with the Quadro group as we aren’t currently equipped to test professional cards, so this was the first step in changing that.

Much of what we discussed we’ve already covered in our quick news blurb on the new Quadro parts launching this week: NVIDIA is launching Quadro parts based on the GF106 and GF108 GPUs. This contrasts from their earlier Fermi Quadro parts, which used GF100 GPUs (even heavily cut-down ones) in order to take advantage of GF100’s unique compute capabilities: ECC and half-speed double precision (FP64) performance. As such the Quadro 2000 and 600 are more focused on NVIDIA’s traditional professional graphics markets, while the Quadro 4000, 5000, and 6000 cover a mix of GPU compute users and professional graphics users who need especially high performance.

NVIDIA likes to crow about their professional market share, and for good reason – their share of the market is much more one-sided than consumer graphics, and the profit margins per GPU are much higher. It’s good to be the king of a professional market. It also helps their image that almost every product being displayed is running a Quadro card, but then that’s an NV conference for you.

Along those lines, it’s the Quadro group that gets to claim much of the credit for the big customers NVIDIA has landed. Adobe is well known, as their Premiere Pro CS5 package offers a CUDA backend. However a new member of this stable is Industrial Light & Magic, who just recently moved to CUDA to do all of their particle effects using a new tool they created, called Plume. This is one of the first users that NVIDIA mentioned to us, and for good reason: this is a market they’re specifically trying to break in to. Fermi after all was designed to be a ray tracing powerhouse along with being a GPU compute powerhouse, and while NVIDIA hasn’t made as much progress here (the gold standard without a doubt being who can land Pixar), this is the next step in getting there.

Finally, NVIDIA is pushing their 3D stereoscopy initiative beyond the consumer space and games/Blu-Ray. NVIDIA is now looking at ways to use and promote stereoscopy for professional use, and to do so they cooked up some new hardware to match the market in the form of a new 3D Vision kit. Called 3D Vision Pro, it’s effectively the existing 3D Vision kit with all Infrared (IR) communication replaced with RF communication. This means the system uses the same design for the base and the glasses (big heads be warned) while offering the benefits of RF over IR: it doesn’t require line of sight, it plays well with other 3D Vision systems in the same area, and as a result it’s better suited for having multiple people looking at the same monitor. Frankly it’s a shame NVIDIA can’t make RF more economical – removing the LoS requirements alone is a big step up from the IR 3D Vision kit where it can be easy at times to break communication. But economics is why this is a professional product at the time: the base alone is $399, and the glasses are another $349, a far cry from the IR kit’s cost of $199 for the base and the glasses together.

Displays: The Holodeck Beckons Microsoft’s Next RDP Moves Rendering Back to the Server


View All Comments

  • dtdw - Sunday, October 10, 2010 - link

    we had a chance to Adobe, Microsoft, Cyberlink, and others about where they see GPU computing going in the next couple of years.

    shouldnt you add 'the' before adobe ?

    and adding 'is' after computing ?
  • tipoo - Sunday, October 10, 2010 - link

    " we had a chance to Adobe, Microsoft, Cyberlink, and others about where they see GPU computing going "

    Great article, but I think you accidentally the whole sentence :-P
  • Deanjo - Sunday, October 10, 2010 - link

    "While NVIDIA has VDPAU and also has parties like S3 use it, AMD and Intel are backing the rival Video Acceleration API (VA API)."

    Ummm wrong, AMD is using XvBA for it's video acceleration API. VAAPI provides a wrapper library to XvBA much like there is VAAPI wrapper for VDPAU. Also VDPAU is not proprietary, it is part of Freedesktop and the open source library package contains a wrapper library and a debugging library allowing other manufacturers to implement VDPAU support into their device drivers. In short every device manufacturer out there is free to include VDPAU support and it is up to the driver developer to add that support to a free and truly open API.
  • Ryan Smith - Sunday, October 10, 2010 - link

    AMD is using XvBA, but it's mostly an issue of semantics. They already had the XvBA backend written, so they merely wrote a shim for VA API to get it in there. In practice XvBA appears to be dead, and developers should use VA API and let AMD and the others work on the backend. So in that sense, AMD are backing VA API.

    As for NVIDIA, proprietary or not doesn't really come in to play. NVIDIA is not going to give up VAPAU (or write a VA API shim) and AMD/Intel don't want to settle on using VAPAU. That's the stalemate that's been going on for a couple of years now, and it doesn't look like there's any incentive on either side to come together.

    It's software developers that lose out; they're the ones that have to write in support for both APIs in their products.
  • electroju - Monday, October 11, 2010 - link

    Deanjo, that is incorrect. VA API is not a wrapper. It is the main API from freedesktop.org. It is created by Intel unfortunately, but they help extend the staled XvMC project to a more flexible API. VDPAU and XvBA came later to provide their own way to do about the same thing. They also include a backward compatibility to VA API. VDPAU is not open source. It is just provides structs to be able to use VDPAU, so this means VDPAU can not be changed by the open source community to implement new features. Reply
  • AmdInside - Sunday, October 10, 2010 - link

    Good coverage. Always good to read new info. Often looking at graphics card reviews can get boring as I tend to sometimes just glance at the graphs and that is it. I sure wish Adobe would use GPU more for photography software. Lightroom is one software that works alright on desktops but too slow for my taste on laptops. Reply
  • AnnonymousCoward - Monday, October 11, 2010 - link

    Holodeck? Cmon. It's a 3D display. You can't create a couch and then lay on it. Reply
  • Guspaz - Tuesday, October 12, 2010 - link

    I'm sort of disappointed with RemoteFX. It sounds like it won't be usable remotely by consumers or small businesses who are on broadband-class connections; with these types of connections, you can probably count on half a megabit of throughput, and that's probably not enough to be streaming full-screen MJPEG (or whatever they end up using) over the net.

    So, sure, works great over a LAN, but as soon as you try to, say, telecommute to your office PC via a VPN, that's not going to fly.

    Even if you're working for a company with a fat pipe, many consumers (around here, at least) are on DSL lines that will get them 3 or 4 megabits per second; that might be enough for lossy motion-compensated compression like h.264, but is that enough for whatever Microsoft is planning? You lose a lot of efficiency by throwing away iframes and mocomp.
  • ABR - Tuesday, October 19, 2010 - link

    Yeah, it also makes no sense from an economic perspective. Now you have to buy a farm of GPUs to go with your servers? And the video capability now and soon being built in to every Intel CPU just goes for nothing? More great ideas from Microsoft. Reply

Log in

Don't have an account? Sign up now