Hadoop was originally a widely adopted implementation of the MapReduce paradigm. It has since become a platform for distributed computation, with a number of projects built on top of it. In this post, I will discuss use cases for moving two core components—the MapReduce framework and HDFS (distributed and reliable file system)—to OpenStack. This will be a first step toward bringing the whole Hadoop infrastructure into the private cloud. This project, which we’ve called Savanna, was just recently introduced by Mirantis, and is detailed by my colleague Dmitry Mescheryakov in the Youtube video below. I’ve also discussed the use cases we believe are germane to the problem.
Understanding the Hadoop Cloud Use Cases
The Hadoop ecosystem has different groups of users, each with unique views on the system and specific needs. We can divide these users into three major groups: system administrators, developers and QA, and analytics and data scientists. Let’s consider how we can help each group of users and how their problems can be solved using the OpenStack cloud.
Hadoop cloud use case: Systems Administrators
System administrators solve problems of installation, management, and monitoring of clusters. They need tools that are integrated into a central point of administration to monitor the entire IT infrastructure of the company.
The other issue that arises within a big organization is how to support different versions and distributions of Hadoop. One team may prefer the Cloudera distribution, while another may want to play with a new feature that is currently available only in the latest version directly out of Apache. Also, for many, it is important to have out-of-the-box integration with enterprise-level tools such as the Cloudera Management Console and Ambari for Apache/HortonWorks.
Such an approach works well with the current tendency of moving applications to the cloud to provide superior resource utilization. The OpenStack cloud thus becomes an ideal tool for solving use-case-related issues encountered by the system administrators group by providing an easy way to set up multiple Hadoop environments.
Hadoop cloud use case: Developers and QA
Developer and QA teams require fast creation of different types of environments, such as development, QA, and pre-production. Often a unique environment is created to test a specific issue or conduct an experiment, and does not have a long-lived presence. The problem is especially critical in areas used by Hadoop developers, because even their simple testing environments require several machines. Due to its cloud nature, OpenStack provides an ideal solution for the fast creation of new temporary environments.
Ideally, an OpenStack cloud should provide a REST API for Hadoop cluster setup and management. If we have a continuous delivery process and want to run a test in a distributed environment, the continuous integration server should be able to create a test cluster via API and run a test on it. After the job is complete, the server shuts down this cluster.
Another problem that development and QA teams often run into is providing access to data for another team that is using a different Hadoop cluster. If we create a new cluster, we should be able to populate it with data. An effective solution to avoid transferring data to the new cluster is to use Swift for the permanent storage of data. This way, any new cluster created can be configured to receive its data directly from the main Swift cluster. The Mirantis team has already produced major contributions in Swift that expose data locality information (see Change I6b1ba25b), and the Hadoop part (HADOOP-8545) is nearly finished.
In the next blog post we will discuss this Hadoop and Swift integration in more detail, and look into use cases for working with Swift as a permanent data source for Hadoop.
Hadoop cloud use case: Data Scientists
Data scientists prefer to work with high-level instruments such as Pig or Hive. These tools provide a simpler interface for data processing than the traditional way of writing a MapReduce job. Analytics are not required to understand implementation details and have little knowledge of what happens under the hood. Due to deficiencies in scientific understanding of structural details, the analytics work can have unpredictable outcomes and damage an existing production cluster. The load from an analytics job mixed with an existing production load could consequently cause a problem with cluster stability. Using OpenStack allows for the creation of an isolated environment for analytics needs separate from those of production.
Data scientists usually deal with ad-hoc queries, meaning that large loads of data are transferred/loaded into the Hadoop cluster. This creates a situation in which production jobs can run short of resources. Usually analytics would have to wait for a free window in which to run their queries, but OpenStack clouds can utilize the unused power of data center resources to meet these immediate loads. This lets the data scientists run their ad hoc queries and avoid waiting for a time slot on a production cluster.
Permanent production load represents a use case for which we don’t recommend using a Hadoop installation on top of OpenStack, even with all of these advantages. In this case, the benefits of manageability are not significant and do not justify the expense of virtualization overhead—at least for now. In the future, when OpenStack launches Grizzly with bare metal provisioning, this issue will be fixed and this will be a merging point for different types of Hadoop installations.
Savanna: Hadoop as a Service with OpenStack
To provide a solution for these use cases above, Mirantis has initiated the Savanna project, Hadoop as Service for OpenStack. For more detailed information, please visit the Savanna wiki page https://wiki.openstack.org/wiki/Savanna and project website http://software.mirantis.com/key-related-openstack-projects/savanna-hadoop/.