Monday, 25 April 2016

Hadoop Multi Node Cluster

This chapter explains the setup of the Hadoop Multi-Node cluster on a distributed environment.

As the whole cluster cannot be demonstrated, we are explaining the Hadoop cluster environment using three systems (one master and two slaves); given below are their IP addresses.

  • Hadoop Master: 192.168.1.15 (hadoop-master)
  • Hadoop Slave: 192.168.1.16 (hadoop-slave-1)
  • Hadoop Slave: 192.168.1.17 (hadoop-slave-2)

Follow the steps given below to have Hadoop Multi-Node cluster setup.

Installing Java

Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using "java -version". The syntax of java version command is given below.

$ java -version

If everything works fine it will give you the following output.

java version "1.7.0_71"   Java(TM) SE Runtime Environment (build 1.7.0_71-b13)   Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If java is not installed in your system, then follow the given steps for installing java.

Step 1

Download java (JDK - X64.tar.gz) by visiting the following linkhttp://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.

Step 2

Generally you will find the downloaded java file in Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/  $ ls  jdk-7u71-Linux-x64.gz  $ tar zxf jdk-7u71-Linux-x64.gz  $ ls  jdk1.7.0_71 jdk-7u71-Linux-x64.gz

Step 3

To make java available to all the users, you have to move it to the location "/usr/local/". Open the root, and type the following commands.

$ su  password:  # mv jdk1.7.0_71 /usr/local/  # exit

Step 4

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71  export PATH=PATH:$JAVA_HOME/bin

Now verify the java -version command from the terminal as explained above. Follow the above process and install java in all your cluster nodes.

Creating User Account

Create a system user account on both master and slave systems to use the Hadoop installation.

# useradd hadoop   # passwd hadoop

Mapping the nodes

You have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.

# vi /etc/hosts  enter the following lines in the /etc/hosts file.  192.168.1.109 hadoop-master   192.168.1.145 hadoop-slave-1   192.168.56.1 hadoop-slave-2

Configuring Key Based Login

Setup ssh in every node such that they can communicate with one another without any prompt for password.

# su hadoop   $ ssh-keygen -t rsa   $ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master   $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1   $ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2   $ chmod 0600 ~/.ssh/authorized_keys   $ exit

Installing Hadoop

In the Master server, download and install Hadoop using the following commands.

# mkdir /opt/hadoop   # cd /opt/hadoop/   # wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz   # tar -xzf hadoop-1.2.0.tar.gz   # mv hadoop-1.2.0 hadoop  # chown -R hadoop /opt/hadoop   # cd /opt/hadoop/hadoop/

Configuring Hadoop

You have to configure Hadoop server by making the following changes as given below.

core-site.xml

Open the core-site.xml file and edit it as shown below.

<configuration>     <property>         <name>fs.default.name</name>         <value>hdfs://hadoop-master:9000/</value>      </property>      <property>         <name>dfs.permissions</name>         <value>false</value>      </property>   </configuration>

hdfs-site.xml

Open the hdfs-site.xml file and edit it as shown below.

<configuration>     <property>         <name>dfs.data.dir</name>         <value>/opt/hadoop/hadoop/dfs/name/data</value>         <final>true</final>      </property>        <property>         <name>dfs.name.dir</name>         <value>/opt/hadoop/hadoop/dfs/name</value>         <final>true</final>      </property>        <property>         <name>dfs.replication</name>         <value>1</value>      </property>   </configuration>

mapred-site.xml

Open the mapred-site.xml file and edit it as shown below.

<configuration>     <property>         <name>mapred.job.tracker</name>         <value>hadoop-master:9001</value>      </property>   </configuration>

hadoop-env.sh

Open the hadoop-env.sh file and edit JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS as shown below.

Note: Set the JAVA_HOME as per your system configuration.

export JAVA_HOME=/opt/jdk1.7.0_17 export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers

Install Hadoop on all the slave servers by following the given commands.

# su hadoop   $ cd /opt/hadoop   $ scp -r hadoop hadoop-slave-1:/opt/hadoop   $ scp -r hadoop hadoop-slave-2:/opt/hadoop

Configuring Hadoop on Master Server

Open the master server and configure it by following the given commands.

# su hadoop   $ cd /opt/hadoop/hadoop

Configuring Master Node

$ vi etc/hadoop/masters  hadoop-master

Configuring Slave Node

$ vi etc/hadoop/slaves  hadoop-slave-1   hadoop-slave-2

Format Name Node on Hadoop Master

# su hadoop   $ cd /opt/hadoop/hadoop   $ bin/hadoop namenode format
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG: /************************************************************   STARTUP_MSG: Starting NameNode   STARTUP_MSG: host = hadoop-master/192.168.1.109   STARTUP_MSG: args = [-format]   STARTUP_MSG: version = 1.2.0   STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473; compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013   STARTUP_MSG: java = 1.7.0_71 ************************************************************/ 11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap editlog=/opt/hadoop/hadoop/dfs/name/current/edits  ………………………………………………….  ………………………………………………….  …………………………………………………. 11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name has been successfully formatted. 11/10/14 10:58:08 INFO namenode.NameNode:   SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15 ************************************************************/

Starting Hadoop Services

The following command is to start all the Hadoop services on the Hadoop-Master.

$ cd $HADOOP_HOME/sbin  $ start-all.sh

Adding a New DataNode in the Hadoop Cluster

Given below are the steps to be followed for adding new nodes to a Hadoop cluster.

Networking

Add new nodes to an existing Hadoop cluster with some appropriate network configuration. Assume the following network configuration.

For New node Configuration:

IP address : 192.168.1.103   netmask : 255.255.255.0  hostname : slave3.in

Adding User and SSH Access

Add a User

On a new node, add "hadoop" user and set password of Hadoop user to "hadoop123" or anything you want by using the following commands.

useradd hadoop  passwd hadoop

Setup Password less connectivity from master to new slave.

Execute the following on the master

mkdir -p $HOME/.ssh   chmod 700 $HOME/.ssh   ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa   cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys   chmod 644 $HOME/.ssh/authorized_keys  Copy the public key to new slave node in hadoop user $HOME directory  scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/

Execute the following on the slaves

Login to hadoop. If not, login to hadoop user.

su hadoop ssh -X hadoop@192.168.1.103

Copy the content of public key into file "$HOME/.ssh/authorized_keys" and then change the permission for the same by executing the following commands.

cd $HOME  mkdir -p $HOME/.ssh   chmod 700 $HOME/.ssh  cat id_rsa.pub >>$HOME/.ssh/authorized_keys   chmod 644 $HOME/.ssh/authorized_keys

Check ssh login from the master machine. Now check if you can ssh to the new node without a password from the master.

ssh hadoop@192.168.1.103 or hadoop@slave3

Set Hostname of New Node

You can set hostname in file /etc/sysconfig/network

On new slave3 machine  NETWORKING=yes   HOSTNAME=slave3.in

To make the changes effective, either restart the machine or run hostname command to a new machine with the respective hostname (restart is a good option).

On slave3 node machine:

hostname slave3.in

Update /etc/hosts on all machines of the cluster with the following lines:

192.168.1.102 slave3.in slave3

Now try to ping the machine with hostnames to check whether it is resolving to IP or not.

On new node machine:

ping master.in

Start the DataNode on New Node

Start the datanode daemon manually using $HADOOP_HOME/bin/hadoop-daemon.sh script. It will automatically contact the master (NameNode) and join the cluster. We should also add the new node to the conf/slaves file in the master server. The script-based commands will recognize the new node.

Login to new node

su hadoop or ssh -X hadoop@192.168.1.103

Start HDFS on a newly added slave node by using the following command

./bin/hadoop-daemon.sh start datanode

Check the output of jps command on a new node. It looks as follows.

$ jps  7141 DataNode  10312 Jps

Removing a DataNode from the Hadoop Cluster

We can remove a node from a cluster on the fly, while it is running, without any data loss. HDFS provides a decommissioning feature, which ensures that removing a node is performed safely. To use it, follow the steps as given below:

Step 1: Login to master

Login to master machine user where Hadoop is installed.

$ su hadoop

Step 2: Change cluster configuration

An exclude file must be configured before starting the cluster. Add a key named dfs.hosts.exclude to our $HADOOP_HOME/etc/hadoop/hdfs-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

For example, add these lines to etc/hadoop/hdfs-site.xml file.

<property>      <name>dfs.hosts.exclude</name>      <value>/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt</value>      <description>DFS exclude</description>   </property>

Step 3: Determine hosts to decommission

Each machine to be decommissioned should be added to the file identified by the hdfs_exclude.txt, one domain name per line. This will prevent them from connecting to the NameNode. Content of the "/home/hadoop/hadoop-1.2.1/hdfs_exclude.txt" file is shown below, if you want to remove DataNode2.

slave2.in

Step 4: Force configuration reload

Run the command "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" without the quotes.

$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes

This will force the NameNode to re-read its configuration, including the newly updated 'excludes' file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

On slave2.in, check the jps command output. After some time, you will see the DataNode process is shutdown automatically.

Step 5: Shutdown nodes

After the decommission process has been completed, the decommissioned hardware can be safely shut down for maintenance. Run the report command to dfsadmin to check the status of decommission. The following command will describe the status of the decommission node and the connected nodes to the cluster.

$ $HADOOP_HOME/bin/hadoop dfsadmin -report

Step 6: Edit excludes file again

Once the machines have been decommissioned, they can be removed from the 'excludes' file. Running "$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes" again will read the excludes file back into the NameNode; allowing the DataNodes to rejoin the cluster after the maintenance has been completed, or additional capacity is needed in the cluster again, etc.

Special Note: If the above process is followed and the tasktracker process is still running on the node, it needs to be shut down. One way is to disconnect the machine as we did in the above steps. The Master will recognize the process automatically and will declare as dead. There is no need to follow the same process for removing the tasktracker because it is NOT much crucial as compared to the DataNode. DataNode contains the data that you want to remove safely without any loss of data.

The tasktracker can be run/shutdown on the fly by the following command at any point of time.

$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker

Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Example Using Python

For Hadoop streaming, we are considering the word-count problem. Any job in Hadoop must have two phases: mapper and reducer. We have written codes for the mapper and the reducer in python script to run it under Hadoop. One can also write the same in Perl and Ruby.

Mapper Phase Code

!/usr/bin/python  import sys  # Input takes from standard input for myline in sys.stdin:   # Remove whitespace either side myline = myline.strip()   # Break the line into words words = myline.split()   # Iterate the words list for myword in words:   # Write the results to standard output print '%s\t%s' % (myword, 1)

Make sure this file has execution permission (chmod +x /home/ expert/hadoop-1.2.1/mapper.py).

Reducer Phase Code

#!/usr/bin/python  from operator import itemgetter   import sys   current_word = ""  current_count = 0   word = ""   # Input takes from standard input for myline in sys.stdin:   # Remove whitespace either side myline = myline.strip()   # Split the input we got from mapper.py word, count = myline.split('\t', 1)   # Convert count variable to integer      try:         count = int(count)   except ValueError:      # Count was not a number, so silently ignore this line continue  if current_word == word:      current_count += count   else:      if current_word:         # Write result to standard output print '%s\t%s' % (current_word, current_count)      current_count = count     current_word = word  # Do not forget to output the last word if needed!   if current_word == word:      print '%s\t%s' % (current_word, current_count)

Save the mapper and reducer codes in mapper.py and reducer.py in Hadoop home directory. Make sure these files have execution permission (chmod +x mapper.py and chmod +x reducer.py). As python is indentation sensitive so the same code can be download from the below link.

Execution of WordCount Program

$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.  2.1.jar \     -input input_dirs \      -output output_dir \      -mapper <path/mapper.py \      -reducer <path/reducer.py

Where "\" is used for line continuation for clear readability.

For Example,

./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

How Streaming Works

In the above example, both the mapper and the reducer are python scripts that read the input from standard input and emit the output to standard output. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

When a script is specified for mappers, each mapper task will launch the script as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper collects the line-oriented outputs from the standard output (STDOUT) of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then the entire line is considered as the key and the value is null. However, this can be customized, as per one need.

When a script is specified for reducers, each reducer task will launch the script as a separate process, then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the standard input (STDIN) of the process. In the meantime, the reducer collects the line-oriented outputs from the standard output (STDOUT) of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized as per specific requirements.

Important Commands

ParametersDescription
-input directory/file-nameInput location for mapper. (Required)
-output directory-nameOutput location for reducer. (Required)
-mapper executable or script or JavaClassNameMapper executable. (Required)
-reducer executable or script or JavaClassNameReducer executable. (Required)
-file file-nameMakes the mapper, reducer, or combiner executable available locally on the compute nodes.
-inputformat JavaClassNameClass you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default.
-outputformat JavaClassNameClass you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default.
-partitioner JavaClassNameClass that determines which reduce a key is sent to.
-combiner streamingCommand or JavaClassNameCombiner executable for map output.
-cmdenv name=valuePasses the environment variable to streaming commands.
-inputreaderFor backwards-compatibility: specifies a record reader class (instead of an input format class).
-verboseVerbose output.
-lazyOutputCreates output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write).
-numReduceTasksSpecifies the number of reducers.
-mapdebugScript to call when map task fails.
-reducedebugScript to call when reduce task fails.

Hadoop MapReduce

  • MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm

  • Generally MapReduce paradigm is based on sending the computer to where the data resides!
  • MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

       
    •  Map stage : The map or mapper's job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
    • Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The Reducer's job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
  • During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
  • The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
  • Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
  • After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
MapReduce Algorithm

Inputs and Outputs (Java Perspective)

The MapReduce framework operates on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.

The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) <k1, v1> -> map -> <k2, v2>-> reduce -> <k3, v3>(Output).

InputOutput
Map<k1, v1>list (<k2, v2>)
Reduce<k2, list(v2)>list (<k3, v3>)

Terminology

  • PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
  • Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
  • NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
  • DataNode - Node where data is presented in advance before any processing takes place.
  • MasterNode - Node where JobTracker runs and which accepts job requests from clients.
  • SlaveNode - Node where Map and Reduce program runs.
  • JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
  • Task Tracker - Tracks the task and reports status to JobTracker.
  • Job - A program is an execution of a Mapper and Reducer across a dataset.
  • Task - An execution of a Mapper or a Reducer on a slice of data.
  • Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.

Example Scenario

Given below is the data regarding the electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years.

JanFebMarAprMayJunJulAugSepOctNovDecAvg
19792323243242526262626252625
198026272828283031313130303029
198131323232333435363634343434
198439383939394142434039383840
198538393939394141410040393945

If the above data is given as input, we have to write applications to process it and produce results such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the application written.

But, think of the data representing the electrical consumption of all the largescale industries of a particular state, since its formation.

When we write applications to process such bulk data,

  • They will take a lot of time to execute.
  • There will be a heavy network traffic when we move data from source to network server and so on.

To solve these problems, we have the MapReduce framework.

Input Data

The above data is saved as sample.txtand given as input. The input file looks as shown below.

1979   23   23   2   43   24   25   26   26   26   26   25   26  25   1980   26   27   28  28   28   30   31   31   31   30   30   30  29   1981   31   32   32  32   33   34   35   36   36   34   34   34  34   1984   39   38   39  39   39   41   42   43   40   39   38   38  40   1985   38   39   39  39   39   41   41   41   00   40   39   39  45 

Example Program

Given below is the program to the sample data using MapReduce framework.

package hadoop;     import java.util.*;     import java.io.IOException;   import java.io.IOException;     import org.apache.hadoop.fs.Path;   import org.apache.hadoop.conf.*;   import org.apache.hadoop.io.*;   import org.apache.hadoop.mapred.*;   import org.apache.hadoop.util.*;     public class ProcessUnits   {      //Mapper class      public static class E_EMapper extends MapReduceBase implements      Mapper<LongWritable ,/*Input key Type */      Text,                /*Input value Type*/      Text,                /*Output key Type*/      IntWritable>        /*Output value Type*/      {                 //Map function         public void map(LongWritable key, Text value,         OutputCollector<Text, IntWritable> output,           Reporter reporter) throws IOException         {            String line = value.toString();            String lasttoken = null;            StringTokenizer s = new StringTokenizer(line,"\t");            String year = s.nextToken();                       while(s.hasMoreTokens())              {                 lasttoken=s.nextToken();              }                          int avgprice = Integer.parseInt(lasttoken);            output.collect(new Text(year), new IntWritable(avgprice));         }      }                //Reducer class      public static class E_EReduce extends MapReduceBase implements      Reducer< Text, IntWritable, Text, IntWritable >      {               //Reduce function         public void reduce( Text key, Iterator <IntWritable> values,            OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException            {               int maxavg=30;               int val=Integer.MIN_VALUE;                             while (values.hasNext())               {                  if((val=values.next().get())>maxavg)                  {                     output.collect(key, new IntWritable(val));                  }               }               }      }                 //Main function      public static void main(String args[])throws Exception      {         JobConf conf = new JobConf(ProcessUnits.class);                 conf.setJobName("max_eletricityunits");         conf.setOutputKeyClass(Text.class);        conf.setOutputValueClass(IntWritable.class);         conf.setMapperClass(E_EMapper.class);         conf.setCombinerClass(E_EReduce.class);         conf.setReducerClass(E_EReduce.class);         conf.setInputFormat(TextInputFormat.class);         conf.setOutputFormat(TextOutputFormat.class);                 FileInputFormat.setInputPaths(conf, new Path(args[0]));         FileOutputFormat.setOutputPath(conf, new Path(args[1]));                 JobClient.runJob(conf);      }   } 

Save the above program as ProcessUnits.java. The compilation and execution of the program is explained below.

Compilation and Execution of Process Units Program

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1

The following command is to create a directory to store the compiled java classes.

$ mkdir units 

Step 2

Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit the following linkhttp://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 to download the jar. Let us assume the downloaded folder is /home/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnits.javaprogram and creating a jar for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java   $ jar -cvf units.jar -C units/ . 

Step 4

The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir 

Step 5

The following command is used to copy the input file named sample.txtin the input directory of HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir 

Step 6

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/ 

Step 7

The following command is used to run the Eleunit_max application by taking the input files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir 

Wait for a while until the file is executed. After execution, as shown below, the output will contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

INFO mapreduce.Job: Job job_1414748220717_0002   completed successfully   14/10/31 06:02:52   INFO mapreduce.Job: Counters: 49   File System Counters      FILE: Number of bytes read=61   FILE: Number of bytes written=279400   FILE: Number of read operations=0   FILE: Number of large read operations=0     FILE: Number of write operations=0   HDFS: Number of bytes read=546   HDFS: Number of bytes written=40   HDFS: Number of read operations=9   HDFS: Number of large read operations=0   HDFS: Number of write operations=2 Job Counters          Launched map tasks=2       Launched reduce tasks=1      Data-local map tasks=2       Total time spent by all maps in occupied slots (ms)=146137      Total time spent by all reduces in occupied slots (ms)=441        Total time spent by all map tasks (ms)=14613      Total time spent by all reduce tasks (ms)=44120      Total vcore-seconds taken by all map tasks=146137           Total vcore-seconds taken by all reduce tasks=44120      Total megabyte-seconds taken by all map tasks=149644288      Total megabyte-seconds taken by all reduce tasks=45178880        Map-Reduce Framework      Map input records=5       Map output records=5        Map output bytes=45       Map output materialized bytes=67       Input split bytes=208      Combine input records=5       Combine output records=5      Reduce input groups=5       Reduce shuffle bytes=6       Reduce input records=5       Reduce output records=5       Spilled Records=10       Shuffled Maps =2       Failed Shuffles=0       Merged Map outputs=2       GC time elapsed (ms)=948       CPU time spent (ms)=5160       Physical memory (bytes) snapshot=47749120       Virtual memory (bytes) snapshot=2899349504       Total committed heap usage (bytes)=277684224         File Output Format Counters         Bytes Written=40 

Step 8

The following command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/ 

Step 9

The following command is used to see the output in Part-00000 file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000 

Below is the output generated by the MapReduce program.

1981    34   1984    40   1985    45 

Step 10

The following command is used to copy the output folder from HDFS to the local file system for analyzing.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir /home/hadoop 

Important Commands

All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoopcommand. Running the Hadoop script without any arguments prints the description for all commands.

Usage : hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

OptionsDescription
namenode -formatFormats the DFS filesystem.
secondarynamenodeRuns the DFS secondary namenode.
namenodeRuns the DFS namenode.
datanodeRuns a DFS datanode.
dfsadminRuns a DFS admin client.
mradminRuns a Map-Reduce admin client.
fsckRuns a DFS filesystem checking utility.
fsRuns a generic filesystem user client.
balancerRuns a cluster balancing utility.
oivApplies the offline fsimage viewer to an fsimage.
fetchdtFetches a delegation token from the NameNode.
jobtrackerRuns the MapReduce job Tracker node.
pipesRuns a Pipes job.
tasktrackerRuns a MapReduce task Tracker node.
historyserverRuns job history servers as a standalone daemon.
jobManipulates the MapReduce jobs.
queueGets information regarding JobQueues.
versionPrints the version.
jar <jar>Runs a jar file.
distcp <srcurl> <desturl>Copies file or directories recursively.
distcp2 <srcurl> <desturl>DistCp version 2.
archive -archiveName NAME -pCreates a hadoop archive.
<parent path> <src>* <dest>
classpathPrints the class path needed to get the Hadoop jar and the required libraries.
daemonlogGet/Set the log level for each daemon

How to Interact with MapReduce Jobs

Usage: hadoop job [GENERIC_OPTIONS]

The following are the Generic Options available in a Hadoop job.

GENERIC_OPTIONSDescription
-submit <job-file>Submits the job.
-status <job-id>Prints the map and reduce completion percentage and all job counters.
-counter <job-id> <group-name> <countername>Prints the counter value.
-kill <job-id>Kills the job.
-events <job-id> <fromevent-#> <#-of-events>Prints the events' details received by jobtracker for the given range.
-history [all] <jobOutputDir> - history < jobOutputDir>Prints job details, failed and killed tip details. More details about the job such as successful tasks and task attempts made for each task can be viewed by specifying the [all] option.
-list[all]Displays all jobs. -list displays only jobs which are yet to complete.
-kill-task <task-id>Kills the task. Killed tasks are NOT counted against failed attempts.
-fail-task <task-id>Fails the task. Failed tasks are counted against failed attempts.
-set-priority <job-id> <priority>Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>   e.g.   $ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004 

To see the history of job output-dir

$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>   e.g.   $ $HADOOP_HOME/bin/hadoop job -history /user/expert/output 

To kill the job

$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>   e.g.   $ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004 

Featured post

What is SharePoint?

Microsoft SharePoint is an extensible platform that provides a range of products that can help organizations with solution for a variety...