Improving Data Processing Performance with Hadoop Data Locality

Andrew Lazarev, Mirantis - February 28, 2014 - , ,
If you’re interested in this topic, please vote to see
Andrew Lazarev’s full talk, Performance of Hadoop on OpenStack, at the Spring OpenStack summit.

One of the major bottlenecks in data-intensive computing is cross-switch network traffic.  Fortunately, having map code executing on the node where the data resides significantly reduces this problem.  This technique, called “data locality”, is one of the key advantages of Hadoop Map/Reduce.

In this article, we’ll discuss the requirements for data locality, how the virtualized environment of OpenStack influences the Hadoop cluster topology, and how to achieve data locality using Hadoop with Savanna.

Requirements for Hadoop data locality

In order to get the benefit of all of the advantages of data-locality, you need to make sure that your system architecture satisfies several conditions.

First of all, the cluster should have the appropriate topology. Hadoop map code must have the ability to read data “locally”. Some popular solutions, such as networked storage (networked-attached storage [NAS] and storage-area networks [SANs]) will always cause network traffic, so in one sense you might not consider this “local”, but really, it’s a matter of perspective.  Depending on your situation, you might define “local” as meaning “within a single datacenter” or “all on one rack”.

Second, Hadoop must be aware of the topology of the nodes where tasks are executed. Tasktracker nodes are used to execute map tasks, and so the Hadoop scheduler needs information about node topology for proper task assignment.

Last but not least, Hadoop must know where the data is located. This part can be a bit trickier because of the different storage engines supported by Hadoop. HDFS supports data locality out of the box, while other drivers (for example, Swift) need to be extended in order to provide data topology information to Hadoop.

Hadoop cluster topology on a virtualized infrastructure

Traditionally, Hadoop uses a three-layer network topology.  These layers were originally specified as Data-Center, Rack, and Node, though the cross-Data-Center case isn’t common, and that layer is often used for defining top-level switches.

This topology works well for traditional hadoop cluster deployments, but it is hard to map a virtual environment into these three layers because there is no room for the hypervisor. Under certain circumstances, two virtual machines that are running on the same host can communicate much faster than they could on separate hosts, because there’s no network involved. That’s why starting with version 1.2.0, Hadoop supports the four-layer topology.

This new layer (called “node group”) corresponds to hypervisor that hosts virtual machines; several nodes may exist in individual VMs on a single host, controlled by a single hypervisor and thus able to communicate without having to go through the network.

Savanna and Hadoop Data Locality

OK, so knowing all that, how do you actually make it happen?  One way is to use Savanna to set up your Hadoop clusters.

Savanna is an OpenStack project that allows you to deploy Hadoop clusters over OpenStack and execute tasks on it. The recent 0.3 release of Savanna added data locality support for the Vanilla plugin (the Hortonworks plugin will support data locality in the upcoming Icehouse release). With the improvement, Savanna can push the cluster topology configuration to Hadoop and enable data locality. In this way, Savanna supports both 3-layer and 4-layer network topologies. For 4-layer topology Savanna uses the compute node host ID as the Hadoop cluster node group identifier.  (Note:  You’ll want to be careful not to confuse Hadoop cluster node groups with Savanna node groups, which serve a different purpose.)

Savanna can also enable data locality for Swift input streams in Hadoop, but to do that, Hadoop needed to be enhanced with a specific Swift driver, because the Vanilla plugin uses Hadoop 1.2.1 without Swift support out of the box. The Swift driver was developed by Savanna team and has already been partially merged into Hadoop 2.4.0. The plan is to have it fully merged into the 2.x repo and then backported to 1.x.

Using data locality with Swift involves enabling it within Savanna, then specifying both the Compute and Swift topologies.  Here’s a demonstration of how Savanna starts the Hadoop cluster and configures data locality on it:

Savanna Data Locality

Conclusion

Data processing on Hadoop in virtual environment is, perhaps, the next step in the evolution of Big Data. As clusters grow, it is extremely important to optimize consumed resources. Technologies like data locality can drastically decrease network use and allow you to work with large distributed clusters without losing the advantages of smaller, more local clusters. This gives you the opportunity for nearly infinite scaling on a Hadoop cluster.

banner-img
test
tst
tst
Deploy Mirantis Secure Registry on any Kubernetes (Minikube, EKS, GKE, K0S, etc.)

Note: this blog post was originally published by Avinash Desireddy on Medium. You can view the original post here. Docker Containers, Kubernetes, CNCF, and many other relevant projects completely changed how we package, ship, and run applications. As you all know, Kubernetes has become a defacto standard for running applications. At the same time, container registries and chart repositories play a …

Deploy Mirantis Secure Registry on any Kubernetes (Minikube, EKS, GKE, K0S, etc.)
Software Supply Chain Security on Any Kubernetes with Mirantis Secure Registry 3.0

Security and cloud infrastructure availability concerns have been in the news of late with the recent Log4j vulnerabilities and outages at some of the world’s largest public cloud providers. The security and integrity of your container-based images has never been more important. Many have taken to Kubernetes to assist in the deployment and management of their container-based workloads, and are leveraging …

Software Supply Chain Security on Any Kubernetes with Mirantis Secure Registry 3.0
A Year in Review: A Look Back at the Most Powerful Mirantis Resources from 2021

2021 has been quite the year - and while there have been plenty of not-so-good times, we at Mirantis would like to take a moment to focus on the good. We are thankful for the opportunity to provide our readers with informative, accurate, and, above all, educational content via our company blog. We try not only to include helpful information …

A Year in Review: A Look Back at the Most Powerful Mirantis Resources from 2021
FREE EBOOK!
Service Mesh for Mere Mortals
A Guide to Istio and How to Use Service Mesh Platforms
DOWNLOAD
LIVE WEBINAR
Getting started with Kubernetes part 2: Creating K8s objects with YAML

Thursday, December 30, 2021 at 10:00 AM PST
SAVE SEAT
LIVE WEBINAR
Manage your cloud-native container environment with Mirantis Container Cloud

Wednesday, January 5 at 10:00 am PST
SAVE SEAT