HPC: Dell Says a Single PCIe x16 Bus Is Enough for Multiple GPUs – Sometimes
One of the more interesting sessions we saw was a sold-out session being held by Dell, discussing whether a single PCIe x16 bus provides enough bandwidth to GPUs working on HPC tasks. Just as in the gaming world, additional x16 busses come at a premium, and that premium doesn’t always offer the kind of performance that justifies the extra cost. In the case of rackmount servers, giving each GPU a full x16 bus either means inefficiently packing GPUs and systems together by locating the GPU internally with the rest of the system, or running many, many cables from a system to an external cage carrying the GPUs. Just as with gaming, if you can get more GPUs to share a PCIe bus then the cheaper it becomes. While for gaming this means using cheaper chipsets, for servers this means being able to run more GPUs off of a single host system, reducing the number of hosts an HPC provider would need to buy.
Currently the most popular configurations for the market Dell competes in are systems with a dedicated x16 bus for each GPU, and systems where 2 GPUs share an x16 bus. Dell wants to push the envelope here, and go to 4 GPUs per x16 bus in the near future, and upwards of 8 and 16 GPUs per bus in the far future when NVIDIA enables support for that in their drivers. To make that happen, Dell is introducing the PowerEdge C410x PCIe Expansion Chassis, a 3U chassis capable of holding up to 16 GPUs. Their talk in turn what about what they found when testing this cassis when filled with Tesla 1000 series cards (GT200 based) in conjunction with a C410 server.
Ultimately Dell’s talk, delivered by staff Ph.D Mark R. Fernandez, ended up being an equal part about the performance hit of sharing a x16 bus, and whether the application in question will scale with more GPUs in the first place. Compared to the gold standard of one bus per GPU and internally locating the GPUs, simply moving to an external box and sharing an x16 bus among 2 GPUs had a negative impact in 4 of the 6 applications Dell tested with. The external connection would almost always come with a slight hit, while the sharing of the x16 bus is what imparted the biggest part of the performance hit as we would expect.
However when the application in question does scale beyond 1-2 GPUs, what Dell found was that the additional GPU performance more than offset the loss through a shared bus. In this case 4 of the same 6 benchmarks saw a significant performance improvement in moving from 2 to 4 GPUs; ranging between a 30% improvement and a 90% improvement. With these many GPUs it’s hard to separate the effects of the bus from scaling limitations, but it’s clear there’s a mix of both going on, in what seems particularly dependent on just how much bus chatter an application eventually causes.
So with these results, Dell’s final answer over whether a single x16 PCIe bus is enough was simply “sometimes”. If an application scales against multiple GPUs in the first place, it usually makes sense to go further – after all if you’re already on GPUs, you probably need all the performance you can get. However if it doesn’t scale against multiple GPUs, then the bus is the least of the problem. It’s in between these positions where the bus matters: sometimes it’s a bottleneck, and sometimes it’s not. It’s almost entirely application dependent.
NVIDIA Quadro: 3D for More than Gaming
While we were at GTC we had a chance to meet with NVIDIA’s Quadro group, the first such meeting since I became AnandTech’s Senior GPU Editor. We haven’t been in regular contact with the Quadro group as we aren’t currently equipped to test professional cards, so this was the first step in changing that.
Much of what we discussed we’ve already covered in our quick news blurb on the new Quadro parts launching this week: NVIDIA is launching Quadro parts based on the GF106 and GF108 GPUs. This contrasts from their earlier Fermi Quadro parts, which used GF100 GPUs (even heavily cut-down ones) in order to take advantage of GF100’s unique compute capabilities: ECC and half-speed double precision (FP64) performance. As such the Quadro 2000 and 600 are more focused on NVIDIA’s traditional professional graphics markets, while the Quadro 4000, 5000, and 6000 cover a mix of GPU compute users and professional graphics users who need especially high performance.
NVIDIA likes to crow about their professional market share, and for good reason – their share of the market is much more one-sided than consumer graphics, and the profit margins per GPU are much higher. It’s good to be the king of a professional market. It also helps their image that almost every product being displayed is running a Quadro card, but then that’s an NV conference for you.
Along those lines, it’s the Quadro group that gets to claim much of the credit for the big customers NVIDIA has landed. Adobe is well known, as their Premiere Pro CS5 package offers a CUDA backend. However a new member of this stable is Industrial Light & Magic, who just recently moved to CUDA to do all of their particle effects using a new tool they created, called Plume. This is one of the first users that NVIDIA mentioned to us, and for good reason: this is a market they’re specifically trying to break in to. Fermi after all was designed to be a ray tracing powerhouse along with being a GPU compute powerhouse, and while NVIDIA hasn’t made as much progress here (the gold standard without a doubt being who can land Pixar), this is the next step in getting there.
Finally, NVIDIA is pushing their 3D stereoscopy initiative beyond the consumer space and games/Blu-Ray. NVIDIA is now looking at ways to use and promote stereoscopy for professional use, and to do so they cooked up some new hardware to match the market in the form of a new 3D Vision kit. Called 3D Vision Pro, it’s effectively the existing 3D Vision kit with all Infrared (IR) communication replaced with RF communication. This means the system uses the same design for the base and the glasses (big heads be warned) while offering the benefits of RF over IR: it doesn’t require line of sight, it plays well with other 3D Vision systems in the same area, and as a result it’s better suited for having multiple people looking at the same monitor. Frankly it’s a shame NVIDIA can’t make RF more economical – removing the LoS requirements alone is a big step up from the IR 3D Vision kit where it can be easy at times to break communication. But economics is why this is a professional product at the time: the base alone is $399, and the glasses are another $349, a far cry from the IR kit’s cost of $199 for the base and the glasses together.