Improving Data Processing Performance with Hadoop Data Locality
Andrew Lazarev's full talk, Performance of Hadoop on OpenStack, at the Spring OpenStack summit.
One of the major bottlenecks in data-intensive computing is cross-switch network traffic. Fortunately, having map code executing on the node where the data resides significantly reduces this problem. This technique, called "data locality", is one of the key advantages of Hadoop Map/Reduce.
In this article, we'll discuss the requirements for data locality, how the virtualized environment of OpenStack influences the Hadoop cluster topology, and how to achieve data locality using Hadoop with Savanna.
Requirements for Hadoop data locality
In order to get the benefit of all of the advantages of data-locality, you need to make sure that your system architecture satisfies several conditions.
First of all, the cluster should have the appropriate topology. Hadoop map code must have the ability to read data "locally". Some popular solutions, such as networked storage (networked-attached storage [NAS] and storage-area networks [SANs]) will always cause network traffic, so in one sense you might not consider this "local", but really, it's a matter of perspective. Depending on your situation, you might define "local" as meaning "within a single datacenter" or "all on one rack".
Second, Hadoop must be aware of the topology of the nodes where tasks are executed. Tasktracker nodes are used to execute map tasks, and so the Hadoop scheduler needs information about node topology for proper task assignment.
Last but not least, Hadoop must know where the data is located. This part can be a bit trickier because of the different storage engines supported by Hadoop. HDFS supports data locality out of the box, while other drivers (for example, Swift) need to be extended in order to provide data topology information to Hadoop.
Hadoop cluster topology on a virtualized infrastructure
Traditionally, Hadoop uses a three-layer network topology. These layers were originally specified as Data-Center, Rack, and Node, though the cross-Data-Center case isn't common, and that layer is often used for defining top-level switches.
This topology works well for traditional hadoop cluster deployments, but it is hard to map a virtual environment into these three layers because there is no room for the hypervisor. Under certain circumstances, two virtual machines that are running on the same host can communicate much faster than they could on separate hosts, because there's no network involved. That’s why starting with version 1.2.0, Hadoop supports the four-layer topology.
This new layer (called “node group”) corresponds to hypervisor that hosts virtual machines; several nodes may exist in individual VMs on a single host, controlled by a single hypervisor and thus able to communicate without having to go through the network.
Savanna and Hadoop Data Locality
OK, so knowing all that, how do you actually make it happen? One way is to use Savanna to set up your Hadoop clusters.
Savanna is an OpenStack project that allows you to deploy Hadoop clusters over OpenStack and execute tasks on it. The recent 0.3 release of Savanna added data locality support for the Vanilla plugin (the Hortonworks plugin will support data locality in the upcoming Icehouse release). With the improvement, Savanna can push the cluster topology configuration to Hadoop and enable data locality. In this way, Savanna supports both 3-layer and 4-layer network topologies. For 4-layer topology Savanna uses the compute node host ID as the Hadoop cluster node group identifier. (Note: You'll want to be careful not to confuse Hadoop cluster node groups with Savanna node groups, which serve a different purpose.)
Savanna can also enable data locality for Swift input streams in Hadoop, but to do that, Hadoop needed to be enhanced with a specific Swift driver, because the Vanilla plugin uses Hadoop 1.2.1 without Swift support out of the box. The Swift driver was developed by Savanna team and has already been partially merged into Hadoop 2.4.0. The plan is to have it fully merged into the 2.x repo and then backported to 1.x.
Using data locality with Swift involves enabling it within Savanna, then specifying both the Compute and Swift topologies. Here's a demonstration of how Savanna starts the Hadoop cluster and configures data locality on it:
Data processing on Hadoop in virtual environment is, perhaps, the next step in the evolution of Big Data. As clusters grow, it is extremely important to optimize consumed resources. Technologies like data locality can drastically decrease network use and allow you to work with large distributed clusters without losing the advantages of smaller, more local clusters. This gives you the opportunity for nearly infinite scaling on a Hadoop cluster.