Spark Benchmarking

Spark is wonderful framework, but you need some decent input data and some good coding skills to really test it. Speeding up Big Data applications is the top priority project at the lab I work for (Sizing Servers Lab of the University College of West-Flanders), so I was able to turn to the coding skills of Wannes De Smet to produce a benchmark that uses many of the Spark features and is based upon real world usage.

The test is described in the graph above. We first start with 300 GB of compressed data gathered from the CommonCrawl. These compressed files are a large amount of web archives. We decompress the data on the fly to avoid a long wait that is mostly storage related. We then extract the meaningful text data out of the archives by using the Java library "BoilerPipe". Using the Stanford CoreNLP Natural Language Processing Toolkit, we extract entities ("words that mean something") out of the text, and then count which URLs have the highest occurrence of these entities. The Alternating Least Square algorithm is then used to recommend which URLs are the most interesting for a certain subject.

We tested with Apache Spark 1.5 in standalone mode (non-clustered) as it took us a long time to make sure that the results were repetitive. For now, we keep version 1.5 to be able to compare with earlier results.

Apache Spark 1.5

The POWER8 surprises here with excellent performance: it is able to keep with a 14 core Xeon E5 "Broadwell EP" and beats the midrange Xeon E5-2690 v3 by healthy margin. Remember, this is a midrange POWER8: there are SKUs that reach 3.4-3.8 GHz.

Database Performance Energy Consumption
Comments Locked

49 Comments

View All Comments

  • Eden-K121D - Thursday, September 15, 2016 - link

    Can't wait for Power9
  • Kevin G - Thursday, September 15, 2016 - link

    Same here. I'm really curious about the differences between the four different dies IBM will be offering. Certainly the mix of two core types and IO types should fill the assorted niches found in the server market.
  • rahvin - Thursday, September 15, 2016 - link

    I can wait, it will be a market share failure like every other power because IBM will price it out of reach of any sensible price range. Going by previous attempts it will cost anywhere from 5-10X as much as an equivalent amount of x86 processing power. Something like $10K for the processor and a another $2-5 for the case, memory and motherboard and it will be equivalent to a quad x86 Xeon server that costs $5k for the same hardware.

    No one that doesn't need some special sauce it provides will buy them, particularly because you'd have to recompile all your software to use it. IBM has screwed up power so many times at this point that you'd have to be a fool to bet on it.
  • Eden-K121D - Friday, September 16, 2016 - link

    Tell that to Google
  • Brutalizer - Friday, September 16, 2016 - link

    Power9 will be 50% - 125% faster than power8, according to IBM.
    http://www.nextplatform.com/wp-content/uploads/201...
    On average it will be 75% faster.

    The specjbb2013 benchmark is broken, SPEC discovered the benchmark can be vendor optimized to provide false results so they fixed it in specjbb2015. IBM have released specjbb2015 numbers for their S812LC server achieving 44.900 for max-jops and 13.000 for crticial-jops. That is almost as good as the Intel Xeon E5-2699v4 result. However, what is interesting is the critical-jops, which measures critical throughput under SLAs. IBM have said they will compete with Intel, with their power9.

    (Of course, one SPARC M7 cpu achieves 120.600 max-jops and 60.300 critical-jops, that is 2.7x faster max-jops and 4.6x faster critical-jops. This is not using the built in hardware accelerators in SPARC. Next year the SPARC M8 arrives, which is 2x faster than M7. Today, Oracle have released six cpus in six years, each doubling performance (except the low cost S7, which is a crippled M7))
  • wingar - Friday, September 16, 2016 - link

    I do like how you come with a comment that's incendiary towards POWER8 and POWER9, doing what you can to make it look worse... and then start touting how magical and wonderful SPARC M7 is. Using the same old Oracle-supplied performance claims without substantiating it. Funny, that. I think it stands out a little bit...

    But that's not what matters. If you run a simple google search, "site:anandtech.com brutalizer", you'll find comments with not a lot of variety. Usually commenting on anything x86 and POWER8, and in every single one (Except this one, actually! You actually reference an IBM supplied Spec result. However, you should link to it next time.) you tout the wonder of the latest SPARC of the time. Linking to Oracle-supplied benchmarks, on Oracles own site consistently concluding that Oracle outperforms their competitors. And every time you do this the comment seems to be as close to the top of the comment list as possible, for visibility.

    Have some links.
    http://www.anandtech.com/comments/10158/the-intel-...
    http://www.anandtech.com/comments/9193/the-xeon-e7...
    http://www.anandtech.com/comments/10230/ibm-nvidia...
    http://www.anandtech.com/comments/9567/the-power-8...
    http://www.anandtech.com/comments/7757/quad-ivy-br...
    http://www.anandtech.com/comments/7852/intel-xeon-...
    http://www.anandtech.com/comments/7285/intel-xeon-...

    But I found a couple of comments you left that anti-everyone-not-Oracle. Have some links.
    http://www.anandtech.com/comments/7334/a-look-at-a...
    http://www.anandtech.com/comments/7371/understandi...
    http://www.anandtech.com/comments/5831/amd-trinity...

    I'm sure there's more comments like this where you're actually adding to the conversation but those are the few I found, and they're always unrelated to CPUs and the server market. They seem to perhaps reflect your own interests? But there is one thing to point out here and that the first religiously-pro-Oracle comment you made seemed to be in 2014. What happened then? Did you buy the account? Did someone start paying you? I don't know.

    And hey, for fun I've actually posted this comment before to you, here's a link:
    http://www.anandtech.com/comments/10435/assessing-...
  • Brutalizer - Friday, September 16, 2016 - link

    I am not doing something to make power look worse, I put it in perspective and post other benchmark numbers from Intel and Oracle so people can compare. Yes, I am posting hard facts that can be indendently verified, or are you rejecting the benchmarks I post? Why? Why do you think it is a bad thing I post benchmarks from other vendors than IBM? You dont want people to be able to build their own opinion about power by comparing with other vendors? Why not? Why is it dangerous when someone quote benchmarks from other vendors? Whats the problem with that?

    If you insist, here is the SPARC M7 specjbb2015 results.
    https://blogs.oracle.com/BestPerf/entry/201511_spe...
  • PowerOfFacts - Friday, September 16, 2016 - link

    troll
  • Brutalizer - Friday, September 16, 2016 - link

    "...Using the same old Oracle-supplied performance claims without substantiating it..."

    Now this is the same old FUD from the IBM supporters. As i have explained, mathematicians can always prove their claims with links to benchmarks, white papers, resaerch papers, or point to common comp sci knowledge, etc. So you are in deep sh-t now. I can always post links to the numbers I claim. You claim I can not, and I spread unsubstantiated information - now you are lying about me.

    Quote me on any number in any post - and I will post links to prove my numbers. If you ever find any post (you will not find any) where I make up numbers out of the blue to discredit IBM or Intel, you are correct that I post unsubstantiated claims. If you can not find any such posts by me, you are spreading FUD about me, and you lie about me. Now go ahead and quote me on any number where I make out things. I am waiting.

    You are not really smart to claim a mathematician to not be able to prove his figures. I am now able to prove you are a liar and FUDer.

    I think it is funny how the IBM supporters always FUD and try to discredit people, instead of countering the benchmark numbers. I post benchmark numbers, and instead of try to discuss the numbers you always attack me. That is not the scientific way, to avoid the hard facts and instead try to discredit the opponent. You should instead try to dissect my numbers and links instead of attacking me. But always, always, the IBM crowd does that " oh, he is an Oracle supporter" - so what? You are an IBM supporter! The difference is that I post numbers, and IBM crowd attacks me instead of countering with other numbers.

    If you want to disprove my claims about Sparc, post numbers that disproves my benchmarks. Do not attack me, that does not win you any discussions.
  • SarahKerrigan - Friday, September 16, 2016 - link

    Sure, it's true that on SPECjbb2015 a T7-1 beats a low-end IBM Turismo machine, an S812LC (with an entry price under $5000 list, compared to over $30000 entry price for the T7-1), by a factor of 2.7x on max-jops. It's also true that M7 came out almost a year and a half after P8 did, and that you can get a dual-CPU P8 server with that same processor, and 256GB RAM, for well under half of the list price of a single-CPU T7-1 with 128GB.

    Starting to see why IBM has over 70% of the non-x86 server market?

Log in

Don't have an account? Sign up now