Build Your First Hadoop project with Maven

In this fuse day we jumped into the BigData world, and obviously, our first choice was Hadoop. Hadoop is a de-facto standard in the open source world today for large scale data processing. In this article I'll describe the “baby steps”, I took in order to set up my first Hadoop Java project, so it will be easier for any Hadoop newbie to kick start a Java Maven project.

The first step is to install Hadoop in your working computer. There are plenty of articles, describing how to install Hadoop. This post assumes you have already set an environment for developing and running Hadoop, and will guide you what to do next..

The first step after installing Hadoop, is to kick off a Maven project. This project might look different than a regular web project, you may have developed lately - Its NOT the typical web project, and naturally will NOT run within any web container. More than that, assuming you might be using some dependencies in your program, you probably want them to be part of the program classpath. Hadoop uses all jars within the “lib” directory of the output target jar, adding it to runtime classpath. To achieve this in Maven, we use the following assembly Maven file, located in src/main/assembly/hadoop-job.xml:

<?xml version="1.0" encoding="UTF-8"?>

<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0
                        http://maven.apache.org/xsd/assembly-1.1.0.xsd">

	<id>job</id>
	<formats>
		<format>jar</format>
	</formats>
	<includeBaseDirectory>false</includeBaseDirectory>
	<dependencySets>
		<dependencySet>
			<unpack>false</unpack>
			<scope>runtime</scope>
			<outputDirectory>lib</outputDirectory>
			<excludes>
				<exclude>${groupId}:${artifactId}</exclude>
			</excludes>
		</dependencySet>
		<dependencySet>
			<unpack>true</unpack>
			<includes>
				<include>${groupId}:${artifactId}</include>
			</includes>
		</dependencySet>
	</dependencySets>
</assembly>

 

In our pom.xml file, we just use the assembly file as our descriptor for the assembly plugin, and add all the regular dependencies. This will assemble all the dependencies within a lib directory inside our output target jar:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.tikal.fuseday.bigdata</groupId>
	<artifactId>genome</artifactId>
	<packaging>jar</packaging>
	<version>1.0-SNAPSHOT</version>
	<properties>
		<server.deploy.dir>UTF-8</server.deploy.dir>
	</properties>


	<dependencies>
		<dependency>
				<groupId>com.google.guava</groupId>
				<artifactId>guava</artifactId>
				<version>12.0</version>
		</dependency>
			
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-core</artifactId>
			<version>1.0.3</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.mockito</groupId>
			<artifactId>mockito-core</artifactId>
			<version>1.8.5</version>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.7</version>
			<scope>test</scope>
		</dependency>
	</dependencies>

	<build>
		<plugins>
			<plugin>
				<artifactId>maven-compiler-plugin</artifactId>
				<configuration>
					<source>1.6</source>
					<target>1.6</target>
				</configuration>
			</plugin>
			<plugin>
				<artifactId>maven-assembly-plugin</artifactId>
				<version>2.2-beta-5</version>
				<configuration>
					<descriptors>
						<descriptor>src/main/assembly/hadoop-job.xml</descriptor>
					</descriptors>
					<archive>
						<manifest>
							<mainClass>com.tikal.fuseday.hadoop.GenomeMapReduce</mainClass>
						</manifest>
					</archive>
				</configuration>
				<executions>
					<execution>
						<id>make-assembly</id>
						<phase>package</phase>
						<goals>
							<goal>single</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build>

</project>

 

Last step is to write the MapReduce program. MapReduce is at the heart of Hadoop framework, and serves as the engine for processing your data. We decided to write our MapReduce in order to analyze the GenBank data, trying to create a length's histogram for the Nucleotide available there (dozens of billions).

This data can also be found in Amazon public data, but our first step was to create a program that we can run and test locally, and only then run it in AWS.

As a sample data we downloaded 20 Nucleotide from the GenBank search page. We noticed the downloaded file is with Fasta format - begins with a single-line description, followed by lines of sequence data. Since this is a special format, 

The logic for Map class is to take the a rounded length of the data as our Map key and the numerical value “one” as the value of the Map. Then the reducer just need to sum all the “one” values. Here is the simple MapReduce program:

public final class GenomeMapReduce extends Configured implements Tool {

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new GenomeMapReduce(), args);
		System.exit(res);
	}

	public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, IntWritable> {
		private IntWritable lineLength = new IntWritable();
		
		private static final IntWritable one = new IntWritable(1);
		@Override
		public void map(Text key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
			lineLength.set(1000*(value.toString().length()/1000));
			output.collect(lineLength, one);
		}
	}

	public static class Reduce extends MapReduceBase implements Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
		@Override
		public void reduce(IntWritable key, Iterator<IntWritable> values, OutputCollector<IntWritable, IntWritable> output, Reporter reporter)throws IOException {
			int count = 0;
			while (values.hasNext())
				count += values.next().get();
			output.collect(key, new IntWritable(count));
		}
	}

	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = getConf();
		JobConf job = new JobConf(conf, GenomeMapReduce.class);
		Path in = new Path(args[0]);
		Path out = new Path(args[1]);
		FileInputFormat.setInputPaths(job, in);
		FileOutputFormat.setOutputPath(job, out);
		job.setJobName("GenomeMapReduceJob");
		job.setMapperClass(MapClass.class);
		job.setReducerClass(Reduce.class);
		job.setInputFormat(FastaInputFormat.class);
		job.setOutputFormat(TextOutputFormat.class);
		job.setOutputKeyClass(IntWritable.class);
		job.setOutputValueClass(IntWritable.class);
		JobClient.runJob(job);
		return 0;
	}

}

 

Now, all you need to do is run “mvn install”, and you will have a jar file containing our compiled class and its dependencies in the lib directory within this jar.

 

In order to run this, you need to put the downloaded data file within HDFS – The Hadoop special file system:

hadoop fs -put /Users/yanaif/Downloads/sequence.fasta /user/yanaif/sequence.fasta

 

Then, you can run your created job in your Hadoop “pseudo distributed” within your machine:

hadoop jar target/genome-1.0-SNAPSHOT-job.jar /user/yanaif/sequence.fasta genome-output

 

In order to inspect the output you can cat the output file:

 

hadoop fs -cat genome-output/part-00000

 

1000	4
2000	3
3000	9
4000	2
5000	1
6000	1

 

Putting this data in Excel can produce the following sample frequency Nucleotide length histogram, Putting this data in Excel can produce the following sample frequency Nucleotide length histogram, and reveal frequencies and their distribution:

 

Now, we are ready to deploy our self contained jar, in AWS and run it against the “real” data set, which contain 200GB. We'll discuss this on our next post.