We had the chance to briefly visit Stampede, the first Supercomputer based upon the Xeon Phi. This is one of the supercomputers at  the Texas Advanced Computing Center (TACC).

Stampede consist of 6400 PowerEdge C8220X and C8220 server Sleds.  Typically these servers contain two octal core Xeon E5s, 32 GB of  RAM and one GPU/MIC.

Eight of those server sleds find a home inside the C8000 4U Chassis, together with two power sleds.


Dual ported Mellanox ConnectX with FDR infiniband interfaces connects all those servers together to form one large supercomputer. In each rack you can find on 8 C8000s on average. 

Connect 200 racks together and you get the Stampede supercomputer:

The Xeon E5s deliver two Petaflops at the moment. When all Xeon Phi are in place, an additional 8 Petaflops will be available to researchers on Stampede.

Intel Xeon Phi is not a standalone replacement to a GPU. For example, the Xeon Phi has no texture units. As a result remote visualization is done by 128 NVIDIA Tesla K20 GPUs. The rest of the supercomputer: 272 TB total memory and 14PB of disk storage. The complete supercomputer and the necessary cooling will require up to 6 megawatts of power. 

The Xeon Phi Cards Coding for Xeon Phi


View All Comments

  • SodaAnt - Wednesday, November 14, 2012 - link

    It does support the x86 instruction set though, so it shouldn't be too hard to port. Reply
  • MrSpadge - Wednesday, November 14, 2012 - link

    But you have to use the custom vector format to stampede anything. Reply
  • Kevin G - Saturday, November 17, 2012 - link

    In theory it should run the current the Linux version of F@H without modification. That catch is that the current version is going to be horribly suboptimal as it doesn't natively support the 512 bit wide vector format used by the Xeon Phi. This would leave only the x87 FPU for calculations. This would allow the 60 scalar FPU's to be used but limit performance to a mere 60 GFLOP across all the cores. There maybe some weird scheduling oddities with Linux and/or F@H due to the chips ability to expose 240 logical processors to the host OS (the result would be better performance from running multiple instances in parallel instead of one large instance using 240 threads).

    An OpenCL version of F@H might be coaxed to working and it that would utilize the 512 bit vector units. Intel would have to have OpenCL drivers available for this to even have a chance of working. This would allow the full ~1 TFLOP performance to be utilized.
  • SydneyBlue120d - Wednesday, November 14, 2012 - link

    Why did Intel choose a custom SIMD format? Why not AVX? Reply
  • Jaybus - Thursday, November 15, 2012 - link

    Because they needed heavier duty vector units. Each Phi core has 32 512-bit registers, where Core i7 has 16 256-bit registers. They just didn't implement the backward compatibility, probably to reduce complexity. It is certainly possible to do, and we may indeed see AVX, SSE, etc. added in a future revision. Reply
  • Kevin G - Saturday, November 17, 2012 - link

    The 512 bit vector instructions change how exceptions and the register masking are handled in comparison to AVX. Outside of that, the vector instructions are similar to how AVX instructions are formatted and the output complies with IEEE floating point standards. So while there is a distinct break in ISA capabilities, it does appear that it is possible to bridge the two together in future designs. Still it is odd that Intel has forked their ISA. Reply
  • coder543 - Wednesday, November 14, 2012 - link

    I just want to know how much it will cost.

    Why is Intel keeping this such a ridiculous secret? Knowing Intel, these will easily be $2,000+ a piece, if not much higher, but I still want to *know.*
  • LogOver - Wednesday, November 14, 2012 - link

    Did you read the article at all? Check the second page again. Reply
  • Comdrpopnfresh - Wednesday, November 14, 2012 - link

    How could PCIe 3.0 result in more overhead? Reply
  • nutgirdle - Wednesday, November 14, 2012 - link

    I concur. A major dis-advantage to co-processor computing is the time it takes to move data on and off the card. The PCIe 2.0 bus is already a bottleneck in our workflow involving a Tesla card. This was a very short-sighted omission. Reply

Log in

Don't have an account? Sign up now