Big Data 101

Many of you have experienced this: you got a massive (text) file (a log of several weeks, some crawled web data) and you need to extract data from it. Moving it inside a text editor will not help you. The text editor will probably collapse and crash as it cannot handle the hundreds of gigabytes of text. Even if it doesn’t, repeated searches (via grep for example) are not exactly a very fast nor are they scientific way to analyze what is hidden inside that enormous hump of data.

Importing and normalizing it into a SQL database via the typical Extract, Transform and Load (ETL) process is tedious and extremely slow. SQL databases are built to handle limited amounts of structured data after all.

That is why Google created MapReduce: by splitting up those massive amounts of data in smaller slices (mapping) and reducing (aggregating, calculating, counting) them to the results that matters to you, you can avoid the rather sequential and slow query execution plans that need to evaluate the whole database to provide meaningful results. Combine this with a redundant and distributed filesystem HDFS that keeps data close to the processing node. The result is that you do not need to import data before you can use it, you do not need the ultimate SSD to quickly load so much data at once, and you can distribute your data mining over a large cluster.

I am of course talking about the grandfather of all Big Data crunching: Hadoop. However Hadoop had two significant disadvantages: although it could crunch through terabytes of data where most other data mining systems collapsed, it was pretty slow the moment you had go through iterative steps, as it wrote the intermediate results to disk. It also was pretty bad if you just want to launch a quick simple query.

Apache Spark 1.5: The Ultimate Big Data Cruncher

This brings us to Spark. In addressing Hadoop’s disadvantages, the members of UC Berkeley’s AMPlab invented a method of keeping the slices of data that need the same kind of operations in memory (Resilient Distributed Datasets). So Spark makes much better use of DRAM than MapReduce, and also avoids many write operations.

But there is more: the Spark framework also includes machine learning / AI libraries that you can use inside your scala/python code. The resulting workload is a marriage of machine learning and data mining that is highly parallel and does not need the best I/O technology to crunch through hundreds of gigabytes of data. In summary, Spark absolutely craves more CPU power and better RAM technology.

Meanwhile, according to Intel, this kind of big data technology is top growth driver of enterprise compute demand in the next few years, as enterprise demand is expected to grow by 67%. And Intel is not the only one that has seen the opportunity; IBM has a Spark Technology Center in San Francisco and launched "Insight Cloud Services", a cloud service based on top of Spark.

Intel now has a specialized Big Data Solutions group, led by Ananth Sankaranarayanan. This group spearheaded the "Big Bench" benchmark development, which was adopted by a TPC group as TPCx-BB. The benchmark is expressed in BBQs per minute...(BBQ = Big Bench Queries). (ed: yummy)

Among the contributors are Cloudera, Cisco, VMware and ... IBM and Huawei. Huawei is the parent company of the HiSilicon ARM processor, and IBM of course has the POWER 8. Needless to say, the new benchmark is going to be an important battleground which might decide whether or not Intel will remain the dominant enterprise CPU vendor. We have yet to evaluate TPC-BBx, but the next page gives you some hard benchmark numbers.

SAP S&D 2-tier Spark Benchmarking
Comments Locked

112 Comments

View All Comments

  • ltcommanderdata - Friday, April 1, 2016 - link

    Does anyone know the Windows support situation for Broadwell-EP for workstation use? Microsoft said Broadwell is the last fully supported processor for Windows 7/8.1 with Skylake getting transitional support and Kaby Lake will not be supported. So how does Broadwell-EP fit in? Is it lumped in with Broadwell and is fully supported or will it be treated like Skylake with temporary support until 2018 and only critical security updates after that? And following on will Skylake-EP see any Windows 7/8.1 support at all or will it not be supported since it'll presumably be released after Kaby Lake?
  • extide - Friday, April 1, 2016 - link

    When MS says they are not supporting Skylake on Windows 7 DOES NOT MEAN it won't work. It just means they are not going to add any specific support for that processor in the older OS's. They are not adding in the speed shift support, essentially.

    For some reason the press has not made this very clear, and many people are freaking out thinking that there will be a hard break here will stuff will straight up not work. That is not the case.

    Broadwell has no new OS level features over Haswell (unlike Skylake with speed shift) so there is nothing special about Broadwell to the OS. As the poster above mentions, they are all x86 cpu's and will all still work with x86 OS's.

    The difference here is between "Fully Supported" and Compatible. Skylake and even Kaby Lake will be compatible with WIndows 7/8/8.1.
  • aryonoco - Friday, April 1, 2016 - link

    Johan, this is yet again by far the best Enterprise CPU benchmark that's available anywhere on the net.

    Thank you for your detailed, scientific and well documented work. Works like this are not easy, I can only imagine how many man hours (weeks?) compiling this article must have taken. I just want you to know that it's hugely appreciated.
  • JohanAnandtech - Friday, April 1, 2016 - link

    Great to read this after weeks of hard work! :-D
  • fsdjmellisse - Friday, April 1, 2016 - link

    hello, i want to buy E5-2630L v4
    any one can give me website for buy it ?

    Best regards
  • HrD - Friday, April 1, 2016 - link

    I'm confused by the following:

    "The following compiler switches were used on icc:

    -fast -openmp -parallel

    The results are expressed in GB per second. The following compiler switches were used on icc:

    -O3 –fopenmp –static"

    Shouldn't one of these refer to icc and the other to gcc?
  • JohanAnandtech - Friday, April 1, 2016 - link

    Pretty sure I did not mix them up. "-fast" does not work on gcc neither does -fopenmp work on icc.
  • patrickjp93 - Friday, April 1, 2016 - link

    Um, wrong and wrong. -Ofast works with GCC 4.9 and later for sure. And -fopenmp is a valid ICC flag post-ICC 13.
  • JohanAnandtech - Saturday, April 2, 2016 - link

    "-fast" is a typical icc flag. (I did not write -"Ofast" that works on gcc 4.8 too)
  • extide - Friday, April 1, 2016 - link

    Johan, if you read the comment, you can see that you mention icc for BOTH.

Log in

Don't have an account? Sign up now