Sahara Data Processing for OpenStack
Rapidly Configure, Auto-Deploy and Scale Hadoop Clusters on OpenStack
Hadoop — open source map/reduce — is a great solution for meeting the growing demand for Big Data analytics coming from every corner of the enterprise. But because it’s resource-intensive, Hadoop demands scale-up/scale-down agility — made harder because Hadoop is also complex: difficult to configure, deploy, test, optimize and maintain.
OpenStack Sahara — elastic Hadoop on demand — helps solve these problems. Sahara lets you rapidly configure, reliably auto-deploy and scale Hadoop clusters on OpenStack. And it helps you (or even end-users) submit jobs to and collect results from them. All from the convenience of the Horizon web UI or OpenStack CLI, while using familiar OpenStack Projects, Users, Quotas and other constructs to isolate Big Data workloads, allocate and manage resources.
What is Sahara?
Sahara began life as an Apache 2.0 project and is now an OpenStack integrated project, meaning it is part of the semi-annual OpenStack release. Developed by open source thought leaders from the Apache and OpenStack Foundations with active participation from Mirantis, Hortonworks, and Red Hat, Sahara provides push-button provisioning of mainstream Hadoop distributions and elastic data processing (EDP) capability similar to Amazon Elastic MapReduce (EMR).
Manually creating a Hadoop cluster on OpenStack requires spinning up instances, installing Hadoop on each instance, configuring the instances to work together, specifying the namenode, jobtracker, tasktrackers, and any storage nodes, and configuring each of hundreds, or even thousands, of nodes. Why spend all that time on manual configuration — with the possibility of human error — when you can use Sahara to specify node characteristics and roles, click to deploy, and access a individual or multiple robust, stable Hadoop clusters in parallel with minimal delay?
Simplify deployment by creating node templates
Sahara makes it possible to create templates from which it can deploy as many or as few nodes as your cluster needs. You can easily mix and match roles for a node, perhaps creating a master node that runs the namenode, secondarynamenode and jobtracker processes, a worker node running the datanode and tasktracker processes, and pure storage nodes running datanode.
Sahara provides the ability to create templates that specify:
- Roles a node is to play.
- Flavor for the OpenStack VM on which the node should run.
- Type of storage (Cinder volume/emphemeral drive) it should use.
- IP pool from which to draw.
- HDFS parameters (for storage nodes).
- MapReduce parameters (for jobtracker and tasktracker nodes).
Easily scale your cluster up and down
When it’s time to grow your cluster, Sahara makes it simple. Instead of struggling to create and configure new nodes, Sahara gives you the ability to simply add node types using a convenient interface. You can also remove nodes and easily redeploy, providing just the right computing power for your cluster’s needs.
Get control of your Hadoop environment
While Sahara takes the pain out of provisioning a Hadoop cluster, it doesn’t tie your hands when it comes to control. Sahara enables you to:
- Decide whether to create clusters using the UI, or integrate Sahara with your own applicaton via the convenient API.
- Choose from among multiple Hadoop distributions, including the Hortonworks Data Platform (HDP) and the Cloudera Hadoop Distribution (CHD).
- Implement Apache Spark jobs.
- Enable anti-affinity based on roles, so that different processes of the same type are running on different physical hosts to provide increased stability and performance.
- Easily configure HDFS and MapReduce parameters at both the node and cluster level.
- Specify a base image to be loaded on images at provisioning time.
- Select a keypair to use for logging in to instances.
Under the hood
Sahara is tightly integrated with core OpenStack services such as Keystone, Glance, Horizon, and Nova, and is moving towards integration with other services such as Heat and Trove. It supports the native OpenStack APIs, which gives you the opportunity to either provide users with the Horizon GUI to provision Hadoop environments and run elastic data processing operations or code to them directly.