Mirantis OpenStack

  • Download

    Mirantis OpenStack is the zero lock-in distro that makes deploying your cloud easier, and more flexible, and more reliable.

  • On-Demand

    Mirantis OpenStack Express is on demand Private-Cloud-as-a-Service. Fire up your own cloud and deploy your workloads immediately.

Solutions Engineering

Services offerings for all phases of the OpenStack lifecycle, from green-field to migration to scale-out optimization, including Migration, Self-service IT as a Service (ITaaS), CI/CD. Learn More

Deployment and Operations

The deep bench of OpenStack infrrastructure experts has the proven experience across scores of deployments and uses cases, to ensure you get OpenStack running fast and delivering continuous ROI.

Driver Testing and Certification

Mirantis provides coding, testing and maintenance for OpenStack drivers to help infrastructure companies integrate with OpenStack and deliver innovation to cloud customers and operators. Learn More

Certification Exam

Know OpenStack? Prove it. An IT professional who has earned the Mirantis® Certificate of Expertise in OpenStack has demonstrated the skills, knowledge, and abilities needed to create, configure, and manage OpenStack environments.

OpenStack Bootcamp

New to OpenStack and need the skills to run an OpenStack cluster yourself? Our bestselling 3 day course gives you the hands-on knowledge you need.

OpenStack: Now

Your one stop for the latest news and technical updates from across OpenStack ecosystem and marketplace, for all the information you need stay on top of rapid the pace innovation.

Read the Latest

The #1 Pure Play OpenStack Company

Some vendors choose to “improve” OpenStack by salting it with their own exclusive technology. At Mirantis, we’re totally committed to keeping production open source clouds free of proprietary hooks or opaque packaging. When you choose to work with us, you stay in full control of your infrastructure roadmap.

Learn about Our Philosophy

Improving Data Processing Performance with Hadoop Data Locality

on February 28, 2014
If you’re interested in this topic, please vote to see
Andrew Lazarev’s full talk, Performance of Hadoop on OpenStack, at the Spring OpenStack summit.

One of the major bottlenecks in data-intensive computing is cross-switch network traffic.  Fortunately, having map code executing on the node where the data resides significantly reduces this problem.  This technique, called “data locality”, is one of the key advantages of Hadoop Map/Reduce.

In this article, we’ll discuss the requirements for data locality, how the virtualized environment of OpenStack influences the Hadoop cluster topology, and how to achieve data locality using Hadoop with Savanna.

Requirements for Hadoop data locality

In order to get the benefit of all of the advantages of data-locality, you need to make sure that your system architecture satisfies several conditions.

First of all, the cluster should have the appropriate topology. Hadoop map code must have the ability to read data “locally”. Some popular solutions, such as networked storage (networked-attached storage [NAS] and storage-area networks [SANs]) will always cause network traffic, so in one sense you might not consider this “local”, but really, it’s a matter of perspective.  Depending on your situation, you might define “local” as meaning “within a single datacenter” or “all on one rack”.

Second, Hadoop must be aware of the topology of the nodes where tasks are executed. Tasktracker nodes are used to execute map tasks, and so the Hadoop scheduler needs information about node topology for proper task assignment.

Last but not least, Hadoop must know where the data is located. This part can be a bit trickier because of the different storage engines supported by Hadoop. HDFS supports data locality out of the box, while other drivers (for example, Swift) need to be extended in order to provide data topology information to Hadoop.

Hadoop cluster topology on a virtualized infrastructure

Traditionally, Hadoop uses a three-layer network topology.  These layers were originally specified as Data-Center, Rack, and Node, though the cross-Data-Center case isn’t common, and that layer is often used for defining top-level switches.

This topology works well for traditional hadoop cluster deployments, but it is hard to map a virtual environment into these three layers because there is no room for the hypervisor. Under certain circumstances, two virtual machines that are running on the same host can communicate much faster than they could on separate hosts, because there’s no network involved. That’s why starting with version 1.2.0, Hadoop supports the four-layer topology.

This new layer (called “node group”) corresponds to hypervisor that hosts virtual machines; several nodes may exist in individual VMs on a single host, controlled by a single hypervisor and thus able to communicate without having to go through the network.

Savanna and Hadoop Data Locality

OK, so knowing all that, how do you actually make it happen?  One way is to use Savanna to set up your Hadoop clusters.

Savanna is an OpenStack project that allows you to deploy Hadoop clusters over OpenStack and execute tasks on it. The recent 0.3 release of Savanna added data locality support for the Vanilla plugin (the Hortonworks plugin will support data locality in the upcoming Icehouse release). With the improvement, Savanna can push the cluster topology configuration to Hadoop and enable data locality. In this way, Savanna supports both 3-layer and 4-layer network topologies. For 4-layer topology Savanna uses the compute node host ID as the Hadoop cluster node group identifier.  (Note:  You’ll want to be careful not to confuse Hadoop cluster node groups with Savanna node groups, which serve a different purpose.)

Savanna can also enable data locality for Swift input streams in Hadoop, but to do that, Hadoop needed to be enhanced with a specific Swift driver, because the Vanilla plugin uses Hadoop 1.2.1 without Swift support out of the box. The Swift driver was developed by Savanna team and has already been partially merged into Hadoop 2.4.0. The plan is to have it fully merged into the 2.x repo and then backported to 1.x.

Using data locality with Swift involves enabling it within Savanna, then specifying both the Compute and Swift topologies.  Here’s a demonstration of how Savanna starts the Hadoop cluster and configures data locality on it:

Savanna Data Locality

Conclusion

Data processing on Hadoop in virtual environment is, perhaps, the next step in the evolution of Big Data. As clusters grow, it is extremely important to optimize consumed resources. Technologies like data locality can drastically decrease network use and allow you to work with large distributed clusters without losing the advantages of smaller, more local clusters. This gives you the opportunity for nearly infinite scaling on a Hadoop cluster.

1 comment

One Response

Continuing the Discussion

  1. Dell Open Source Ecosystem Digest #37. Issue Highlight: Thierry Carrez: Why we do feature freeze - Dell TechCenter - TechCenter - Dell Community

    […] Mirantis: Improving Data Processing Performance with Hadoop Data Locality One of the major bottlenecks in data-intensive computing is cross-switch network traffic. Fortunately, having map code executing on the node where the data resides significantly reduces this problem. This technique, called “data locality”, is one of the key advantages of Hadoop Map/Reduce. In this article, we’ll discuss the requirements for data locality, how the virtualized environment of OpenStack influences the Hadoop cluster topology, and how to achieve data locality using Hadoop with Savanna. Read more. […]

    March 7, 201404:43

Some HTML is OK


or, reply to this post via trackback.