Apache Spark 2.x Benchmarking

Last, but not least, we have Apache Spark. Apache Spark is the poster child of Big Data processing. Speeding up Big Data applications is the top priority project at the university lab I work for (Sizing Servers Lab of the University College of West-Flanders), so we produced a benchmark that uses many of the Spark features and is based upon real world usage.

The test is described in the graph above. We first start with 300 GB of compressed data gathered from the CommonCrawl. These compressed files are a large amount of web archives. We decompress the data on the fly to avoid a long wait that is mostly storage related. We then extract the meaningful text data out of the archives by using the Java library "BoilerPipe". Using the Stanford CoreNLP Natural Language Processing Toolkit, we extract entities ("words that mean something") out of the text, and then count which URLs have the highest occurrence of these entities. The Alternating Least Square algorithm is then used to recommend which URLs are the most interesting for a certain subject.

To get better scaling, we run with 4 executors. Researcher Esli Heyvaert reconfigured the Spark benchmark so it could run on Apache Spark 2.1.1.

Here are the results:

Apache Spark 2.1.1

(*) EPYC and Xeon E5 V4 are older results, run on Kernel 4.8 and a slightly older Java 1.8.0_131 instead of 1.8.0_161. Though we expect that the results would be very similar on kernel 4.13 and Java 1.8.0_161, as we did not see much difference on the Skylake Xeon between those two setups.

Data processing is very parallel and extremely CPU intensive, but the shuffle phases require a lot of memory interactions. The time spent on storage I/O is negligible. The ALS phase does not scale well over many threads, but is less than 4% of the total testing time.

The ThunderX2 delivers 87% of the performance of the twice as expensive EPYC 7601. Since this benchmark scales well with the number of cores, we can estimate that the Xeon 6148 will score around 4.8. So while the ThunderX2 can not really threaten the Xeon Platinum 8176, it gives the Gold 6148 and its ilk a run for their money.

Java Performance: Huge Pages Investigated What We Can Conclude: So Far
Comments Locked

97 Comments

View All Comments

  • Gunbuster - Wednesday, May 23, 2018 - link

    Because it's hard to explain the critical line of business software or database is having some unknown edge case issue because you thought look at me I'm so smart and saved 1% of the project cost using unproven low penetration hardware.
  • daanno2 - Wednesday, May 23, 2018 - link

    I'm guessing you've never dealt with expensive enterprise software before. They are mostly licensed per-core, so getting the absolute best performance per core, even if the CPU is 2-3x more expensive, is worth it. At the end of the day, the CPUs might be <5% of the total cost.
  • SirPerro - Wednesday, May 23, 2018 - link

    You can swallow a big risk if the benefit is 75% of the cost. Hey, it's definitely worth the try.

    If your hardware makes up for 5% of the cost, saving a 3% of the total budget is not worth the risk of migration.
  • FunBunny2 - Thursday, May 24, 2018 - link

    "You can swallow a big risk if the benefit is 75% of the cost. Hey, it's definitely worth the try."

    the EOL of today's machines, the amortization schedules must be draconian. only if a 'different' server pays off in dozens of months, not years, will it have chance. to the extent that enterprise software is a C/C++ and *nix codebase, porting won't be onerous. but, I'm willing to guess, even Oracle code isn't all that parallel, so throwing a truckload of teeny cpu at it won't necessarily work.
  • name99 - Thursday, May 24, 2018 - link

    The bigger problem here is the massive uncertainty around the meaning of the word "server" and thus the target for these new ARM CPUs.
    Some people seem to think "server" means primarily boxes that run SAP or ORACLE, but I think it's clear that the ARM ecosystem has little interest in that, at least right now.

    What's of much more interest is racks on racks of CPUs running commodity (LAMP) or homegrown software, ie data warehouses and HPC. I'm not even sure the Java benchmarks being run are of much interest to this market. The things that matter are the sorts of things Cloudflare was measuring when they tested Centriq -- memcached, nginx, transforming one type of data into another (compression/decompression, encrypt/decrypt, transcode,...) at massive throughput.
    That's where I'd expect to see the big sales of the ARM "server" cores -- to Cloudflare, Baidu, Google, and so on.

    Also now that Marvell is in the game, will be interesting to see the extent to which they pull this downward, into their traditional sorts of markets like infrastructure network and storage control (eg to go into network appliances and NAS boxes).
  • Ed469546 - Wednesday, June 13, 2018 - link

    Some of the commercial software you pay per core. Intel had the best single threaded performance mening power license costs.

    Interesting question is how the Thunderx2 cores are counted in this case: one core can run 4 threads.
  • andrewaggb - Wednesday, May 23, 2018 - link

    I wonder what workloads they are targeting? High throughput with poor single threaded results is somewhat limiting.
  • peevee - Wednesday, May 23, 2018 - link

    Web app servers. VM servers. Hadoop/Spark nodes. All benefit more from having more threads running in parallel instead of each request waiting or switching contexts.

    If you are concerned about single-thread performance on 256-thread server (as 2-CPU server with this CPU will provide) AT ALL, you choose outrageously wrong hardware for the task to begin with. Go buy a 2-core i3. Practically the only test in this article which matters is Critical jOPS (assuming the used quality of service metric was configured realistically).
  • GeekyMcGeekface - Friday, May 25, 2018 - link

    I’m building a cluster now with a few hundred Raspberry Pi’s because scale up is expensive and stupid. By distributing across a pool of clusters, I can handle far more memory bandwidth and compute. Consider 100 Raspberry PIs have 400 64-bit cores and 100GB of RAM. Total cost $3500 + power, mounting and switches.

    Running three clusters of those with Kubernetes, Couchbase and Azure Functions provides 1200 64-bit cores, about 100GB of extremely high performance storage, incredible failover and a map-reduce environment to die for.

    Add some 64GB MicroSD cards and an object storage system to the cluster and there’s 12TB of cold storage (4TB when made redundant).

    Pay a service fee to some sweatshop in the Eastern Block to do the labor intensive bits and you can build a massively parallel, almost impossible to crash, CI/CD friendly, multi-tenant, infinitely scalable PaaS... for less than the cost of the RAM for a single one of the servers here.

    The only expensive bits in the design are the Netscalers.

    Oh... and the power foot print is about the same as one of these servers.

    I honestly have no idea what I what I would use a server like these in a new design for.
  • jospoortvliet - Wednesday, May 30, 2018 - link

    single-core performance with your pi's is considerably lower, as is inter-core bandwidth. If your tasks require little inter-process communication you're good but with highly interdependent compute it won't perform well. But for specific tasks, yes, it might be very cost effective.

Log in

Don't have an account? Sign up now