Intel's Xeon Phi in 10 Petaflops supercomputer
by Johan De Gelas on September 11, 2012 7:41 PM EST- Posted in
- IT Computing
- Cloud Computing
- HPC
Intel announced the Xeon Phi ("Knights Corner") a few months ago and bought the Qlogic infiniband team and Cray fabric team to bolster its HPC efforts. A clear signal that Intel will not stand idly while GPU vendors try to conquer the HPC market.
Dell, Intel and the Texas Advanced Computing Center (TACC) build the first Supercomputer based upon the Xeon Phi, called Stampede. Stampede can spit out 10 Petaflops. If it was released right now, it would occupy the third place in the top 500 list of supercomputers. Stampede will go live on January the 7th, 2013.
The Xeon Phi consists of 64 x86 cores (256 threads), each with a 512-bit vector unit. The vector unit can dispatch 8 double precision SIMD operations. The Xeon Phi runs at 2 GHz (more or less, probably more soon) and thus delivers (2 GHz x 64 cores x 8 FLOPs) 1 TFlops. For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013. NVIDIA and AMD GPUs can deliver similar FLOPs, programming the Xeon Phi should be a lot easier to use than CUDA- or OpenCL. The same development tools as for the regular Xeons are available: OpenMP, Intel's Threading Building Blocks, MPI, the Math Kernel Library (MKL)...
Anyway, the Xeon Phi is definitly not limited to ultra expensive supercomputers. Supermicro showed us the Superserver 2027GR-TRF which contain 4 Xeon Phi cards thanks to two redundant 1800W (Platinum) PSUs. The rest of the server consists of two Xeon E5 and 16 DIMM slots in total, supporting up to 256 GB. So it seems that one Xeon Phi card consume about 300W.
15 Comments
View All Comments
nutgirdle - Tuesday, September 11, 2012 - link
At my workplace we have a fairly well developed MPI/OpenMP environment. We've dabbled with a Tesla card, but we would like to avoid re-writing everything in OpenCL. Even then, we don't know how long nVidia will support OpenCL.Excited to see if/when this will actually be released, and since we are a single-precision application, if it can hold a candle to the ridiculous speed the K10 cards are exhibiting.
IanCutress - Wednesday, September 12, 2012 - link
I migrated my Brownian motion SP code from OpenMP to CUDA quite easily, got a factor 375x speed up over a single Nehalem core using a GTX480, Though tbh, the code was only 1000 lines max and was easier to do than expected.boeush - Tuesday, September 11, 2012 - link
"So it seems that the Xeon Phi cards consume about 300W."Each? Or 4 of them put together? Because if that's per-card, I'm not very impressed considering,
"For comparison, a quadcore Haswell at 4 GHz will deliver about one fourth of that in 2013."
For 300W, you can put together on the order of 10 Haswell quad-cores! That'd give you about 2.5x the max theoretical performance for the same wattage as the Xeon Phi (and, I'd imagine for a fraction of the cost as well...)
JohanAnandtech - Tuesday, September 11, 2012 - link
Very valid points. However, I don't have any measurement nor real benchmarks yet. The 300W is - to my understanding - the upper limit. The last time I tested, Linpack can make a CPU consume 30-35% more than a typical integer application, both running at 100% CPU load.cmikeh2 - Tuesday, September 11, 2012 - link
While you probably could put together 10 Haswell quad-cores, at around 300 W,I doubt they would be running at 4 GHz.ArCamiNo - Tuesday, September 11, 2012 - link
Do you have more info about the 2Ghz frequency ?It seems very high for that kind of chip. Maybe the 1 TFlops in double precision can be achieved with an FMA instruction (considered as 2 floating point operations) :
1GHz * (512/64) * 64 cores * 2 ops per cycle
codedivine - Tuesday, September 11, 2012 - link
Agree with you.1008anan - Tuesday, September 11, 2012 - link
I also to think 16 double precision flops per core per cycle or 4 double precision flops per thread per clock * 1 gigahertz.I am surprised to learn that there are only 8 double precision flops per core per clock or 2 double precision flops per thread per clock.
Running a 64 CPU core SoC at 2 gigahertz is astounding.
djgandy - Wednesday, September 12, 2012 - link
I'd agree. Peak rates are usually quoted using FMA.At 2Ghz I'd expect 2TFLOPS too, or 4TFLOPS in 32-bit which would be consistent with Larrabee numbers.
1008anan - Tuesday, September 11, 2012 - link
Johan De Gelas, thank you very much for your article.Aren't some of the 64 cores disabled? Previously it was reported that there would be between 51 and 64 CPU cores per Xeon Phi SoC.
Are you sure there are only 8 double precision flops per core per clock or 2 double precision flops per thread per clock? If so the theoretical max assuming 64 cores at 2 gigahertz is:
[2 Gigahertz]*[512 flops/clock] = 1.024 double precision teraflops. Actual performance is always below theoretical performance. You would need quite a bit more than 2 Gigahertz to hit 1 teraflop double precision.
2 Gigahertz is confirmed? Pretty amazing for a 64 core SoC.
How many single precision flops per core per clock?
Please confirm that a 64 CPU core Xeon Phi SoC only has a TDP of 75 watts. Can that be right?