Big Data 101

Many of you have experienced this: you got a massive (text) file (a log of several weeks, some crawled web data) and you need to extract data from it. Moving it inside a text editor will not help you. The text editor will probably collapse and crash as it cannot handle the hundreds of gigabytes of text. Even if it doesn’t, repeated searches (via grep for example) are not exactly a very fast nor are they scientific way to analyze what is hidden inside that enormous hump of data.

Importing and normalizing it into a SQL database via the typical Extract, Transform and Load (ETL) process is tedious and extremely slow. SQL databases are built to handle limited amounts of structured data after all.

That is why Google created MapReduce: by splitting up those massive amounts of data in smaller slices (mapping) and reducing (aggregating, calculating, counting) them to the results that matters to you, you can avoid the rather sequential and slow query execution plans that need to evaluate the whole database to provide meaningful results. Combine this with a redundant and distributed filesystem HDFS that keeps data close to the processing node. The result is that you do not need to import data before you can use it, you do not need the ultimate SSD to quickly load so much data at once, and you can distribute your data mining over a large cluster.

I am of course talking about the grandfather of all Big Data crunching: Hadoop. However Hadoop had two significant disadvantages: although it could crunch through terabytes of data where most other data mining systems collapsed, it was pretty slow the moment you had go through iterative steps, as it wrote the intermediate results to disk. It also was pretty bad if you just want to launch a quick simple query.

Apache Spark 1.5: The Ultimate Big Data Cruncher

This brings us to Spark. In addressing Hadoop’s disadvantages, the members of UC Berkeley’s AMPlab invented a method of keeping the slices of data that need the same kind of operations in memory (Resilient Distributed Datasets). So Spark makes much better use of DRAM than MapReduce, and also avoids many write operations.

But there is more: the Spark framework also includes machine learning / AI libraries that you can use inside your scala/python code. The resulting workload is a marriage of machine learning and data mining that is highly parallel and does not need the best I/O technology to crunch through hundreds of gigabytes of data. In summary, Spark absolutely craves more CPU power and better RAM technology.

Meanwhile, according to Intel, this kind of big data technology is top growth driver of enterprise compute demand in the next few years, as enterprise demand is expected to grow by 67%. And Intel is not the only one that has seen the opportunity; IBM has a Spark Technology Center in San Francisco and launched "Insight Cloud Services", a cloud service based on top of Spark.

Intel now has a specialized Big Data Solutions group, led by Ananth Sankaranarayanan. This group spearheaded the "Big Bench" benchmark development, which was adopted by a TPC group as TPCx-BB. The benchmark is expressed in BBQs per minute...(BBQ = Big Bench Queries). (ed: yummy)

Among the contributors are Cloudera, Cisco, VMware and ... IBM and Huawei. Huawei is the parent company of the HiSilicon ARM processor, and IBM of course has the POWER 8. Needless to say, the new benchmark is going to be an important battleground which might decide whether or not Intel will remain the dominant enterprise CPU vendor. We have yet to evaluate TPC-BBx, but the next page gives you some hard benchmark numbers.

SAP S&D 2-tier Spark Benchmarking
Comments Locked

112 Comments

View All Comments

  • jhh - Thursday, March 31, 2016 - link

    The article says TSX-NI is supported on the E5, but if one looks at Intel ARK, it say it's not. Do the processors say they support TSX-NI? Or is this another one of the things which will be left for the E7?
  • JohanAnandtech - Friday, April 1, 2016 - link

    Intel's official slides say: "supports TSX". All SKUs, no exceptions.
  • Oxford Guy - Thursday, March 31, 2016 - link

    Bigger, badder, still obsolete cores.
  • patrickjp93 - Friday, April 1, 2016 - link

    Obsolete? Troll.
  • Oxford Guy - Tuesday, April 5, 2016 - link

    Unlike you, propagandist, I know what Skylake is.
  • benzosaurus - Thursday, March 31, 2016 - link

    "You can replace a dual Xeon 5680 with one Xeon E5-2699 v4 and almost double your performance while halving the CPU power consumption."

    I mean you can, but you can buy 4 X5680s for a quarter the price of a single E5-2699v4. It takes a lot of power savings to make that worthwhile. The pricing in the server market's always seemed weirdly non-linear to me.
  • warreo - Friday, April 1, 2016 - link

    Presumably, it's not just about TCO. Space is at a premium in a datacenter, and so being able to fit more performance per sq ft also warrants a higher price, just like how notebook parts have historically been more expensive than their desktop equivalents.
  • ShieTar - Friday, April 1, 2016 - link

    But you don't get 4 1366-Systems for the price of one 2011-3 System. Depending on your Memory, Storage and Interconnect Needs, even two full Systems based on the Xeon 5680 may cost you more than one system based on the E5-2699 v4. One less Infiniband-Adapter can easily save you 500$ in Hardware.

    And you are not only halving the CPU power consumption, but also the power consumption of the rest of the system that you no longer use, so instead of 140W you are saving probably at least 200W per System, which can already add up to more than 1k$ in electricity and cooling bills for a 24/7 machine running for 3 years.

    And last, but by no means least, less parts means less space, less chance for failure, less maintenance effort. If you happily waste a few hours here or there to maintain your own workstation, you don't do the math, but if you have to pay somebody to do it, salaries matter quickly. With an MTBF for an entire server rarely being much higher than 40.000, and recovery/repair easily taking you a person-day of work, each system generates about 1.7 hours of work per year. Cost of work (it's more than salaries, of course) probably comes up to 100$ for a skilled technical administrator, thus producing another 500$ over 3 years of added operational cost.

    And of course, space matters as well. If your data center is filled, it can be more cost effective to replace the old CPUs with new expensive ones, rather than build a new facility to fill with more old Systems.

    If you add it all up, I doubt you can get a System with an Xeon 5680 and operate it over 3 years for anything below 20.000$. So going from two 20.000$-Systems to a single 24.000$ Dollar System (because of an extra 4000$ for the big CPU) should save you a lot of money in the long run.
  • JohanAnandtech - Friday, April 1, 2016 - link

    Where do you get your pricing info from? I can not imagine that server vendors still sell X5680s.
  • extide - Friday, April 1, 2016 - link

    Yeah, if you go used. No enterprise sysadmin worth his salt is ever going to put used gear that is not in warranty, and in support into production.

Log in

Don't have an account? Sign up now