Sunday, January 8, 2012

Cloudera’s Distribution for Hadoop


Cloudera’s Distribution for Hadoop is based on the most recent stable version of
Apache Hadoop with numerous patches, backports, and updates. Cloudera shares this distribution in a number of different formats: compressed tar files, RPMs, Debian packages, and Amazon EC2 AMIs. Cloudera’s Distribution for Hadoop is free, released under the Apache 2.0 license and available at http://www.cloudera.com/hadoop/.
Cloudera has an online configurator at http://www.cloudera.com/configurator to make setting up a Hadoop cluster easy (Figure B-1). The configurator has a simple wizardlike interface that asks targeted questions about your cluster. When you’ve finished, the configurator generates customized Hadoop packages and places them in a package repository for you. You can manage any number of clusters and return at a later time to update your active configurations. 

To simplify package management, Cloudera shares RPMs from a yum repository and Debian packages from an apt repository. Cloudera’s Distribution for Hadoop enables you to install and configure Hadoop on each machine in your cluster by running a single, simple command. Kickstart users benefit even more by being able to commission entire Hadoop clusters automatically without any manual intervention.

Prerequisites


Cloudera’s Distribution for Hadoop requires Sun Java 6 or later to be installed. The
Sun Java Debian and RPM packages require that you agree to the Sun license before use. For a Debian-based system, you will want to enable the non-free apt repository, as it contains the sun-java6-* packages. For a Red Hat–based system, download the Sun Java RPM package from http://java.sun.com/javase/downloads/.

Before you can use your favorite package manager (e.g., yum, apt-get, aptitude) to install Cloudera packages, you’ll need to add the Cloudera repositories to your list of yum and/or apt sources.


Please refer to http://www.cloudera.com/hadoop/ for up-to-date instructions on the simplest way to satisfy these prerequisites


Standalone Mode

To install Hadoop standalone mode, run the following command on Red Hat–based
systems:


% yum install hadoop


Or on Debian-based systems, run the command:


% apt-get install hadoop


The hadoop package include a man page. To read the man page, run the command:


% man hadoop


If you want to install the full Hadoop documentation on a machine, install the hadoopdocs package. On Red Hat–based systems, run the command:


% yum install hadoop-docs


To install the documentation on Debian-based systems, run the command:


% apt-get install hadoop-doc



Pseudo-Distributed Mode

To install Hadoop in pseudo-distributed mode, run the following command on Red
Hat–based systems:


% yum install hadoop-conf-pseudo


Or on Debian-based systems, run the command:


% apt-get install hadoop-conf-pseudo


Once you’ve installed the Hadoop pseudo-distributed configuration package, you start
the Hadoop services by running the same command on both Red Hat– and Debianbased
systems:


% for x in namenode secondarynamenode datanode jobtracker tasktracker ;
do /etc/init.d/hadoop-$x start ; done


There is no need to worry about creating a hadoop user or formatting HDFS, as that is
handled automatically by the hadoop-conf-pseudo package. You can use Hadoop immediately after installing the package and starting the Hadoop services. The hadoopconf-pseudo package will also ensure that your Hadoop services are started at system boot.



Fully Distributed Mode

For details about deploying a fully distributed Hadoop cluster, visit Cloudera’s Distribution for Hadoop web page at http://www.cloudera.com/hadoop/. When you run Cloudera’s online configurator, it creates a personalized apt or yum repository to hold the configuration packages for every cluster you manage. For example, let’s say you gave one of your clusters the name mycluster. To see a list of all the configuration packages for mycluster, run the following command on Red Hat–based systems:


% yum search hadoop-conf-mycluster


or on Debian-based systems, run the command:


% apt-cache search hadoop-conf-mycluster


These commands will return a list of configuration packages for the mycluster cluster. The number and types of configuration packages depends on how you answered thequestions posed by the Cloudera configurator. Some of the packages will be generated for specific hosts in your cluster; others will be for groups or classes of machines in your cluster. For host-specific configurations, the fully qualified hostname will be added to the package name. For example, there may be a configuration for myhost.mydo main in the mycluster cluster. To install Hadoop on myhost.mydomain on Red Hat–based systems, run the command:


% yum install hadoop-conf-mycluster-myhost.mydomain


or on Debian-based systems, run the command:


% apt-get install hadoop-conf-mycluster-myhost.mydomain


The Hadoop configuration packages will ensure that your services are set up to run at
system boot.
Hadoop-Related Packages


Cloudera’s Distribution for Hadoop allows you to easily deploy tools built on top of  Hadoop like Hive and Pig. Hive is a data warehouse infrastructure that allows you to query data in Hadoop with a query language based on SQL. For more information on Hive, see “Hadoop and Hive at Facebook”. Pig is a platform for analyzing large datasets using a high-level language. To install Hive and Pig on Red Hat–based systems, run the command:


% yum install hadoop-hive hadoop-pig



To install Hive and Pig on Debian-based systems, run the command:


% apt-get install hadoop-hive hadoop-pig


More Hadoop-related packages will be added to Cloudera’s Distribution for Hadoop
over time.

If you encounter any problem, feel free to response via comment box.

No comments: