When it comes to big data-styled analytics processing, it’s a so-called two-horse race between the old stallion Hadoop MapReduce and young buck Apache Spark. Increasingly, many companies that are running in Hadoop environments are choosing to process their big data with Spark instead.

Hadoop MapReduce and Apache Spark aren’t necessarily competitors, and, in fact, they can work well together —Spark can work on a Hadoop Distributed File System (HDFS). But Spark applications can be an order of magnitude faster than those running on Hadoop’s MapReduce — one interviewee even said 100 times faster. MapReduce work is disk-intensives, across multiple disks, while spark tries to run as much as possible in RAM.

MapReduce is considered a more flexible, wide-ranged option, but Spark can convert a mess of data into actionable information faster.

Why Trovit went from Hadoop MapReduce to Apache Spark

At Trovit, a classified ads search engine, Ferran Gali Reniu originally was working with Hadoop because “instead of buying bigger machines, you buy smaller machines. It solves the storage problems with huge amounts of data.” They were creating Hadoop Distributed Files Systems layers or HDFS, which allowed he and his team to code applications into Hadoop jobs.

He said that, “Using Hadoop, we solved challenges that we were having at Trovit [which was] having a good search engine with quality and freshness without duplicates.” But when it came time to use Hadoop MapReduce on the HDFS, he found that the application programming interface or API wasn’t very flexible, and he found that it was “quite disk intensive.”

Read More:   Update Concord Leverages Mesos for High Performance Stream Processing

big-data-trovit-spark

“That’s why other distributed processing frameworks appear like Spark, which solve the same problems but allow the developers to solve the problems with a different approach much more flexible,” Gali Reniu said.

The company uses Apache Spark to power its recommendation engine. It still runs MapReduce code in production for many duties, but builds out new functionality on Spark, because of its speed and developer ease-of-use.

The Trovit team decided to move to it, and he decided to demo Spark at PAPIs Connect predictive APIs conference because Spark’s restrictive data sets (RDS) represent several sets of routines that you can combine. This was essential when Trovit’s APIs are written in Scala, Java and Python.

Gali Reniu went into other reasons why he found Spark useful to the user including that they’ve built a set of libraries on top of the framework.

Spark provides on top of its API a set of libraries that are useful to the user, which he said allowed them to provide more intelligence and to iterate by rapid processing, which in turn let them better utilize their resources.

“Spark allows us to go a bit to the next level on big data processing, innovating the way we do data science” — Ferran Gali Reniu

Now Trovit uses Hadoop and Spark together, with renewed flexibility in the language of the data and ability to do more than one thing at once. He says with this combo he can wake up a cluster and execute it on a deep learning model or take a prototype to a distribute structure production, all without having to worry about his laptop imploding since it distributes the data across multiple distributed data collections.

Faster Distributed Deep Learning with Spark on AWS

Conferences on machine learning like PAPIs are still niche enough that they are filled with academics, experimenters, and first adopters, willing to show off what they’ve created for advancing the science. Vincent Van Steenbergen is a freelance data engineer who, after “playing with” Scala, Akka and Spark for three years, decided in his free time to leverage Spark to train a model. His demo walked people through the process he went from discovering his deep learning and machine learning needs to researching and testing different tools to finally training a model.

Read More:   Update Microsoft: Machine Learning Models Can Be Easily Reverse Engineered

He started by explaining the possibilities of distributed deep learning which can be image analysis, image generation (like turning an image into a Van Gogh replica) to, most famous, learning and playing Go, which has more possible moves than atoms on Earth.

Steenbergen said that training a model typically requires:

  1. A lot of time.
  2. A lot of computer power.

Continuing with the topical AlphaGo reference, he said training it alone took 1,202 CPU and 176 GPU to train it over six weeks. So the obvious question Steenbergen, as well as other data scientists and engineers, have to answer is:  “How can I do that from my laptop in a decent amount of money in a decent timespan?” The solution became to distribute training over a Spark cluster. He explained that value of Spark to him is that it “allows you to distribute your training, your computation over multiple machines and multiple resources.”

“Spark allows us to go a bit to the next level on big data processing, innovating the way we do data science,” Reniu said.

Then you need a cluster of services, and, in his case as an experimental hobby not backed by an enterprise budget; it had to be reasonably priced. He found that Amazon Web Services’ EC2 has GPU instances that you can span on-demand for a dollar per spot instance, which he says is about two to three times cheaper than other instances. For this dollar, you get:

  • Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4GB video memory
  • 32 vCPUs
  • 60 GiB of memory
  • 240 GB (2 x 120) of SSD memory

The next step was to choose a deep learning framework. The big names right now include Google’s TensorFlow, Deepmind’s Torch (from Facebook), and Berkeley’s Caffe. He chose the latter because it’s built in C++ “so is fast optimized for using CPU and GPU very efficiently,” as well as he said it already has good documentation, a rich community, and lots of existing deep learning models for training. Mimicking what Yahoo engineers do for deep learning at Flickr, he runs a Caffe wrapper on a Spark cluster.

Read More:   Update Timescale Updates Time-Series Database, Enters Multilicense Fray

For Steenbergen, the advantages were clear, like being able to run an existing cluster alongside other Spark jobs. He says you can run as many deep learning jobs at the same time which allows you to train multiple models at once and even leverage existing models. “You can even use SQL, DataFrames and existing LMDB files to train your model and do any kind of treatment on your data.”

machine-learning-spark

So what did our early adopter build in his live demo on stage? He built an image recognition machine that can classify handwritten digits. It took the machine about five minutes to study 60,000 handwritten digits numbered zero to nine, looking for distinctive features like the curve in a five versus the curves of a three. He said that while the machine is running the model, it’s also testing it with sample ID images. This is where the machine learning and even deep learning comes in with the machine working to constantly improve.