One of the big selling points of the Xeon Phi is that you can simply run multi-threaded Xeon code on the Xeon Phi. If you want to get decent performance out of the Xeon Phi, that code should be compiled with the Intel C or fortan Compiler and the Intel MKL math libraries. In that case, Intel claims many "typical applications" get about 2 to 2.5 higher performance with the Xeon Phi. A few exceptions get more.

That is an impressive performance boost, but not earth shattering. These numbers are much more realistic than the typical benchmarks of 100x that are throw around by the GPU folks. Those benchmarks are typically comparing a single threaded non SIMD binaries running on a CPU to a fully threaded carefully tuned application running on a GPU. 

The question remains in which applications a cheaper quad CPU solution is more effective. Before the Xeon E5 (Sandy Bridge EP) came out, AMD was quite succesful with their less expensive quad CPU platforms in the HPC world. It will be interesting to compare the performance per dollar and performance per watt of such quad CPU platforms with a CPU + Phi solution. There are certainly applications where the CPU + Phi wins hands down, but we are willing to bet that there are lots of HPC applications where it is a close call (e.g. highly threaded, but harder to vectorize code).

The point is of course that the time investment to get there is a lot lower than is the case with CUDA on NVIDIA's Tesla K20. We have heard from several companies that debugging CUDA code is still a pretty daunting experience. One good example can be found here. The maturity of the Intel compilers and high performance software is a big plus for the Xeon Phi. The numerous papers and OpenMP to CUDA frameworks/translators clearly indicate that porting OpenMP applications to CUDA is not necessarily straightforward. That in contrast with the Xeon Phi, where existing OpenMP applications run faster on the Xeon Phi without a recompile. OpenMP is simply the ecosystem where the Xeon Phi thrives. And Intel has an excellent track record when it comes to supporting OpenMP in its compilers.

The Xeon Phi might also prove to be a bit more flexible and forgiving. The Xeon Phi architecture still, at a high level, resembles a general purpose Xeon core. We're talking about 60 in-order x86 cores with wider SIMD units, a 512KB L2 feeding 4 threads per core. 

GPUs on the other hand are built for more "extreme" parallelism: hundreds of stream processors, with small shared L1-caches and one relatively small L2-cache. 

We'll have to hold final judgement until we get a Xeon Phi equipped system in house, but our first impressions are that the Xeon Phi looks like a more cost effective, potentially easier to use alternative to high-end GPUs for HPC.

Dell's C8220 and The TACC Stampede
Comments Locked

46 Comments

View All Comments

  • creed3020 - Friday, November 16, 2012 - link

    There are only 4 per row in the chassis because these units in Stampede feature the Xeon Phi card which requires a bigger sled. The author got the potential specs messed up with the way they are actually configured for this supercomputer.
  • GullLars - Thursday, November 15, 2012 - link

    So, it seems these are great at general purpose supercomputing.
    How do they stack up against the latest FPGAs if they are set up carefully by the people who will be running a specialized problem on them?
    And would these be able to work effectively with offloading of some key functions that would be able to work 20-100x faster (or power efficient) on a carefully set up FPGA?

    Some people in the comments mentioned hetrogenous computing. A step on the way is modular accelerated code. I'm interrested to see if we get more specialized hardware for acceleration in the comming years, not just graphics (with transcoding) and encryption/decryption like is common in CPUs now. Or if we get an FPGA component (integrated or PCIe) that can be reserved and set up by programs to realize huge speedups or power savings.
  • Jameshobbs - Monday, November 19, 2012 - link

    Why have there not been a lot of reports regarding the PCI express. This was the first source that I was able to find that even mentions the speed of the PCI e bus for the Xeon Phi.

    One of the most challenging things for programming on accelerators is handling the PCI express and trying to balance data transfer with computational complexity. Everyone, NVIDIA, Intel, AMD seem to be doing a lot of arm waving regarding this issue, and there are many GPU papers that tend to omit the transfer times in their results. To me I find this dishonest and cheating.

    One thing that continues to shock me as well is that people keep complaining about how difficult it is to debug a GPU program and then they reference old out of date references such as http://lenam701.blogspot.be/2012/02/nvidia-cuda-my... which was mentioned above. The things that the author of that blog post complained about have been resolved in the latest versions of CUDA (from 4.2 onward... maybe even in 4.0).

    Programmers can now use printf and it is possible to hook a debugger into a GPU application to do more in depth debugging. The main thing that bothers me about GPU programming is you must check to make sure a program has successfully completed or not. Other than that I find it relatively easy to debug a GPU application.
  • MySchizoBuddy - Wednesday, November 21, 2012 - link

    Next version of AMD APU will allow both the GPU and CPU access to the same memory locations.
  • sheepdestroyer - Wednesday, December 5, 2012 - link

    i would really like to see a benchmark of this cpu on LLVMpipe
    http://www.mesa3d.org/llvmpipe.html
    The original Larabee would have had a DirectX translation layer and this project could be seen as an OpenGL version of it.
    Just loading a distro with Gnome 3 running on LLVMpipe or benchmarking some ioq3 and iodoom3 games would be VERY interesting.
  • tuklap - Sunday, March 3, 2013 - link

    Can this accelerate my normal PC applications like Rendering in AutoCAD/Revit, Media Conversion, STAAD, ETABS and etc computations???

    or do i Have to create my own applications?

Log in

Don't have an account? Sign up now