We had the chance to briefly visit Stampede, the first Supercomputer based upon the Xeon Phi. This is one of the supercomputers at  the Texas Advanced Computing Center (TACC).

Stampede consist of 6400 PowerEdge C8220X and C8220 server Sleds.  Typically these servers contain two octal core Xeon E5s, 32 GB of  RAM and one GPU/MIC.

Eight of those server sleds find a home inside the C8000 4U Chassis, together with two power sleds.


Dual ported Mellanox ConnectX with FDR infiniband interfaces connects all those servers together to form one large supercomputer. In each rack you can find on 8 C8000s on average. 

Connect 200 racks together and you get the Stampede supercomputer:

The Xeon E5s deliver two Petaflops at the moment. When all Xeon Phi are in place, an additional 8 Petaflops will be available to researchers on Stampede.

Intel Xeon Phi is not a standalone replacement to a GPU. For example, the Xeon Phi has no texture units. As a result remote visualization is done by 128 NVIDIA Tesla K20 GPUs. The rest of the supercomputer: 272 TB total memory and 14PB of disk storage. The complete supercomputer and the necessary cooling will require up to 6 megawatts of power. 

The Xeon Phi Cards Coding for Xeon Phi


View All Comments

  • creed3020 - Friday, November 16, 2012 - link

    There are only 4 per row in the chassis because these units in Stampede feature the Xeon Phi card which requires a bigger sled. The author got the potential specs messed up with the way they are actually configured for this supercomputer. Reply
  • GullLars - Thursday, November 15, 2012 - link

    So, it seems these are great at general purpose supercomputing.
    How do they stack up against the latest FPGAs if they are set up carefully by the people who will be running a specialized problem on them?
    And would these be able to work effectively with offloading of some key functions that would be able to work 20-100x faster (or power efficient) on a carefully set up FPGA?

    Some people in the comments mentioned hetrogenous computing. A step on the way is modular accelerated code. I'm interrested to see if we get more specialized hardware for acceleration in the comming years, not just graphics (with transcoding) and encryption/decryption like is common in CPUs now. Or if we get an FPGA component (integrated or PCIe) that can be reserved and set up by programs to realize huge speedups or power savings.
  • Jameshobbs - Monday, November 19, 2012 - link

    Why have there not been a lot of reports regarding the PCI express. This was the first source that I was able to find that even mentions the speed of the PCI e bus for the Xeon Phi.

    One of the most challenging things for programming on accelerators is handling the PCI express and trying to balance data transfer with computational complexity. Everyone, NVIDIA, Intel, AMD seem to be doing a lot of arm waving regarding this issue, and there are many GPU papers that tend to omit the transfer times in their results. To me I find this dishonest and cheating.

    One thing that continues to shock me as well is that people keep complaining about how difficult it is to debug a GPU program and then they reference old out of date references such as http://lenam701.blogspot.be/2012/02/nvidia-cuda-my... which was mentioned above. The things that the author of that blog post complained about have been resolved in the latest versions of CUDA (from 4.2 onward... maybe even in 4.0).

    Programmers can now use printf and it is possible to hook a debugger into a GPU application to do more in depth debugging. The main thing that bothers me about GPU programming is you must check to make sure a program has successfully completed or not. Other than that I find it relatively easy to debug a GPU application.
  • MySchizoBuddy - Wednesday, November 21, 2012 - link

    Next version of AMD APU will allow both the GPU and CPU access to the same memory locations. Reply
  • sheepdestroyer - Wednesday, December 5, 2012 - link

    i would really like to see a benchmark of this cpu on LLVMpipe
    The original Larabee would have had a DirectX translation layer and this project could be seen as an OpenGL version of it.
    Just loading a distro with Gnome 3 running on LLVMpipe or benchmarking some ioq3 and iodoom3 games would be VERY interesting.
  • tuklap - Sunday, March 3, 2013 - link

    Can this accelerate my normal PC applications like Rendering in AutoCAD/Revit, Media Conversion, STAAD, ETABS and etc computations???

    or do i Have to create my own applications?

Log in

Don't have an account? Sign up now