Sunday, January 8, 2012

Installing Apache Hadoop



It’s easy to install Hadoop on a single machine to try it out.

Prerequisites

Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
have been reported to work.

Hadoop runs on Unix and on Windows. Linux is the only supported production plat-
form, but other flavors of Unix (including Mac OS X) can be used to run Hadoop for
development. Windows is only supported as a development platform, and additionally
requires Cygwin to run. During the Cygwin installation process, you should include
the openssh package if you plan to run Hadoop in pseudo-distributed mode.

Installation

Start by deciding which user you’d like to run Hadoop as. For trying out Hadoop or
developing Hadoop programs, it is simplest to run Hadoop on a single machine using
your own user account.
Download a stable release, which is packaged as a gzipped tar file, from the Apache
Hadoop releases page (http://hadoop.apache.org/core/releases.html) and unpack it
somewhere on your filesystem:

% tar -xzf hadoop-x.y.z.tar.gz


Before you can run Hadoop, you need to tell it where Java is located on your system.

If you have the JAVA_HOME environment variable set to point to a suitable Java installation, that will be used, and you don’t have to configure anything further. Otherwise,you can set the Java installation that Hadoop uses by editing conf/hadoop-env.sh, and
specifying the JAVA_HOME variable.

 For example, on my Mac, the command is:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

to point to version 1.6.0 of Java.

On Ubuntu, the equivalent line is:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

It’s very convenient to create an environment variable that points to the Hadoop in-
stallation directory (HADOOP_INSTALL, say) and to put the Hadoop binary directory on  your command-line path.

 For example:

% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z
% export PATH=$PATH:$HADOOP_INSTALL/bin

Check that Hadoop runs by typing:

% hadoop version

Hadoop 0.20.0
Subversion https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504
Compiled by ndaley on Thu Apr 9 05:18:40 UTC 2009

Configuration
Each component in Hadoop is configured using an XML file. Core properties go in
core-site.xml, HDFS properties go in hdfs-site.xml, and MapReduce properties go in
mapred-site.xml. These files are all located in the conf subdirectory.


Hadoop can be run in one of three modes:
Standalone (or local) mode
There are no daemons running and everything runs in a single JVM. Standalone
mode is suitable for running MapReduce programs during development, since it
is easy to test and debug them.

Pseudo-distributed mode
The Hadoop daemons run on the local machine, thus simulating a cluster on a
small scale.

Fully distributed mode
The Hadoop daemons run on a cluster of machines.

To run Hadoop in a particular mode, you need to do two things: set the appropriate
properties, and start the Hadoop daemons. Table A-1 shows the minimal set of prop-
erties to configure each mode. In standalone mode, the local filesystem and the local
MapReduce job runner are used, while in the distributed modes the HDFS and Map-
Reduce daemons are started




Standalone Mode

In standalone mode, there is no further action to take, since the default properties are
set for standalone mode, and there are no daemons to run.


Pseudo-Distributed Mode


The configuration files should be created with the following contents, and placed in
the conf directory (although you can place configuration files in any directory as long
as you start the daemons with the --config option).


<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>




Configuring SSH


In pseudo-distributed mode, we have to start daemons, and to do that, we need to have
SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and
fully distributed modes: it merely starts daemons on the set of hosts in the cluster
(defined by the slaves file) by SSH-ing to each host and starting a daemon process.
Pseudo-distributed mode is just a special case of fully distributed mode in which the
(single) host is localhost, so we need to make sure that we can SSH to localhost and log
in without having to enter a password.
First, make sure that SSH is installed and a server is running. On Ubuntu, for example,
this is achieved with:


% sudo apt-get install ssh


On Windows with Cygwin, you can set up an SSH server (after having
installed the openssh package) by running ssh-host-config -y.

Then to enable password-less login, generate a new SSH key with an empty passphrase:


% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys



Test this with:

% ssh localhost


You should be loged in without having to type a password.



Formatting the HDFS filesystem


Before it can be used, a brand-new HDFS installation needs to be formatted. The formatting process creates an empty filesystem by creating the storage directories and the
initial versions of the namenode’s persistent data structures. Datanodes are not involved
in the initial formatting process, since the namenode manages all of the filesystem’s metadata, and datanodes can join or leave the cluster dynamically. For the samereason, you don’t need to say how large a filesystem to create, since this is determined by the number of datanodes in the cluster, which can be increased as needed, long after the filesystem was formatted.

Formatting HDFS is quick to do. Just type the following:


% hadoop namenode –format


Starting and stopping the daemons
To start the HDFS and MapReduce daemons, type:


% start-dfs.sh

% start-mapred.sh



If you have placed configuration files outside the default conf directory, start the daemons with the --config option, which takes an absolute path to the configuration directory:



% start-dfs.sh --config path-to-config-directory

% start-mapred.sh --config path-to-config-directory



Three daemons will be started on your local machine: a namenode, a secondary namenode, and a datanode. You can check whether the daemons started successfully by looking at the logfiles in the logs directory (in the Hadoop installation directory), or by


looking at the web UIs, at http://localhost:50030/ for the jobtracker, and at http://localhost:
50070/ for the namenode.


You can also use Java’s jps command to see whether they are running.


Stopping the daemons is done in the obvious way:


% stop-dfs.sh

% stop-mapred.sh


If you encounter any problem, feel free to response via comment box.

No comments: