Apache Spark 2.1 Benchmarking

Apache Spark is the poster child of Big Data processing. Speeding up Big Data applications is the top priority project at the university lab I work for (Sizing Servers Lab of the University College of West-Flanders), so we produced a benchmark that uses many of the Spark features and is based upon real world usage.

The test is described in the graph above. We first start with 300 GB of compressed data gathered from the CommonCrawl. These compressed files are a large number of web archives. We decompress the data on the fly to avoid a long wait that is mostly storage related. We then extract the meaningful text data out of the archives by using the Java library "BoilerPipe". Using the Stanford CoreNLP Natural Language Processing Toolkit, we extract entities ("words that mean something") out of the text, and then count which URLs have the highest occurrence of these entities. The Alternating Least Square algorithm is then used to recommend which URLs are the most interesting for a certain subject.

We  our newest servers into virtual clusters to make better use of all those core. We run with 8 executors. Researcher Esli Heyvaert also upgraded our Spark benchmark so it could run on Apache Spark 2.1.1.

Here are the results:

Apache Spark 2.1.1

Our Spark benchmark needs about 120 GB of RAM to run. The time spent on storage I/O is negligible. Data processing is very parallel, but the shuffle phases require a lot of memory interaction. The ALS phase does not scale well over many threads, but it is less than 4% of the total testing time.

For reasons unknown to us, we could get our 2.7 GHz 8280 to perform much better than the 2.1 GHz Xeon 8176. We suspect that the fact that we used the new Xeon chips with an older (Skylake-SP) server could be the reason, as trying different Spark configurations (executors, JVM settings) did not help. A BIOS update did not help us either.

Ok, this was big data processing combined with mostly "traditional" Machine learning: NER and ALS. How about some "deep learning"? 

SAP S&D 2-tier Convolutional Neural Network Training
Comments Locked

56 Comments

View All Comments

  • tipoo - Monday, July 29, 2019 - link

    Fyi, when on page 2 and clicking "convolutional, etc" for page 3, it brings me back to the homepage
  • Ryan Smith - Monday, July 29, 2019 - link

    Fixed. Sorry about that.
  • Eris_Floralia - Monday, July 29, 2019 - link

    Johan's new piece in 14 months! Looking forward to your Rome review :)
  • JohanAnandtech - Monday, July 29, 2019 - link

    Just when you think nobody noticed you were gone. Great to come home again. :-)
  • Eris_Floralia - Tuesday, July 30, 2019 - link

    Your coverage on server processors are great!
    Can still well remember Nehalem, Barcelona, and especially Bulldozer aftermath articles
  • djayjp - Monday, July 29, 2019 - link

    Not having a Tesla for such an article seems like a glaring omission.
  • warreo - Monday, July 29, 2019 - link

    Doubt Nvidia is sourcing AT these cards, so it's likely an issue of cost and availability. Titan is much cheaper than a Tesla, and I'm not even sure you can get V100's unless you're an enterprise customer ordering some (presumably large) minimum quantity.
  • olafgarten - Monday, July 29, 2019 - link

    It is available https://www.scan.co.uk/products/32gb-pny-nvidia-te...
  • abufrejoval - Tuesday, July 30, 2019 - link

    Those bottlenecks are over now and P100, V100 can be bought pretty freely, as well as RTX6000/8000 (Turings). Actually the "T100" is still missing and the closest siblings (RTX 6000/8000) might never get certified for rackmount servers, because they have active fans while the P100/V100 are designed to be cooled by server fans. I operate a handful of each and getting budget is typically the bigger hurdle than purchasing.
  • SSNSeawolf - Monday, July 29, 2019 - link

    I've been trying to find more information on Cascade Lake's AI/VNNI performance, but came up dry. Thanks, Johan. Eagerly putting this aside for my lunch reading today.

Log in

Don't have an account? Sign up now