Hadoop on OpenStack: Elastic Data Processing (EDP) with Savanna 0.3
August 26, 2013
Now that version 0.2 of Project Savanna is out, it’s time to start looking at what will be coming up in version 0.3. The goal for this next development phase is to provide elastic data processing (EDP) capabilities, creating a Savanna component that enables data analysis and transformation in an easy and resource-effective way, running a Hadoop cluster only when it’s needed.
To provide this functionality, Savanna needs you to give it these three things:
These three items define the architecture for the EDP part of Savanna. You can see the high-level architecture in this figure:
As you can see from the image, EDP is made up of the following components:
We’ll circle back to how EDP uses these three components, but first let’s consider in more detail how Savanna interacts with OpenStack as a whole:
The Savanna product communicates with the following OpenStack components:
Let’s consider how Savanna performs elastic data processing using the three components we mentioned in the beginning of this post.
Data for processing
The data for processing can be stored in various locations and have various representations. Let’s take a look at different uses cases for data locations Savanna is planning to support:
Processing the data: job workflow
In its simplest form, a job is defined by a single jar file. It contains all of the necessary code—in other words, implementation of the map and reduce functions. This means that writing a data processing procedure typically requires knowledge of Java. Another option is to use the Hadoop Streaming API. In that case, you can implement the map and reduce functions in any language supported by the operating system on which Hadoop is running, for example, Python or C++.
For those who are not familiar with Java or other programming languages, Savanna provides an opportunity to use a high-level scripting language such as Pig or Hive. These languages can easily be learned by users who are not professional programmers.
Looking ahead, an interesting possibility is support of Mahout as a service. It contains sophisticated algorithms, which are difficult to implement in a scripting language like Hive or Pig. Often, the end user doesn’t want to know the implementation details. In the future, we hope that Savanna will enable running Mahout jobs in an elastic way.
Currently, Savanna supports users who are professional Hadoop programmers and have to run complex data processing algorithms, which involve several tasks. The abstraction that describes these general workflows is represented by direct acyclic graphs (DAGs). Vertices correspond to certain task and edges represent a dependencies between steps. In Hadoop, usually this process is managed via Oozie, a Hadoop workflow engine and job coordinator. The next version of Savanna, 0.3, will likely include a mechanism for scheduling Oozie workflows for elastic data processing.
Now let’s talk about how Savanna can get the job code needed to begin data processing.
The simplest way to provide a program for data processing is to store it in a distributed object store such as Swift. Savanna uses the Horizon dashboard to provide a user-friendly interface for file upload. This works great when you want a quick, one-time execution, for example, for an ad-hoc query. But on the other hand, Savanna should be able to process code from a version control system (VCS) repository—for example, git or mercurial. This is critical for users who have a continuous delivery process and need a way to propagate code from development and testing environments to production.
We’ll now drill down and explain job execution in more detail.
The first question when it comes to job execution is whether Savanna should start a new cluster or start the job on one that already exists. To answer that, Savanna takes into account various aspects, such as the current load the existing clusters, proximity to the data location, and the required speed. Savanna provides this information to the end user, so they can decide where to schedule the job.
One very useful feature is cluster autoscaling during a job execution. Often, at different stages of data processing, you require different resource quantities. Autoscaling allows you to have only what you need at the time you need it. The Savanna team is still deciding how to handle autoscaling; it is likely that we will support it first within Savanna itself, then integrate with Heat.
Plans for version 0.3 of Savanna include both analytics as a service and elastic data processing (EDP). Making this possible requires providing Savanna with the data, the implementation of the processing routine, and guidance on where to process the data itself. Once these capabilities are in place, Savanna’s elastic data processing (EDP) features allow you to choose where and how to process your data using resource autoscaling.
To follow progress on the Savanna project, or even to contribute, please visit the Savanna wiki.11 comments
Continuing the Discussion