Get This: Cloudera’s Distribution for Hadoop

Cloudera’s Distribution for Hadoop is based on the most recent stable version of

Apache Hadoop with numerous patches, backports, and updates. Cloudera shares this distribution in a number of different formats: compressed tar files, RPMs, Debian packages, and Amazon EC2 AMIs. Cloudera’s Distribution for Hadoop is free, released under the Apache 2.0 license and available at http://www.cloudera.com/hadoop/.

Cloudera has an online configurator at http://www.cloudera.com/configurator to make setting up a Hadoop cluster easy (Figure B-1). The configurator has a simple wizardlike interface that asks targeted questions about your cluster. When you’ve finished, the configurator generates customized Hadoop packages and places them in a package repository for you. You can manage any number of clusters and return at a later time to update your active configurations.

To simplify package management, Cloudera shares RPMs from a yum repository and Debian packages from an apt repository. Cloudera’s Distribution for Hadoop enables you to install and configure Hadoop on each machine in your cluster by running a single, simple command. Kickstart users benefit even more by being able to commission entire Hadoop clusters automatically without any manual intervention.

Prerequisites

Cloudera’s Distribution for Hadoop requires Sun Java 6 or later to be installed. The

Sun Java Debian and RPM packages require that you agree to the Sun license before use. For a Debian-based system, you will want to enable the non-free apt repository, as it contains the sun-java6-* packages. For a Red Hat–based system, download the Sun Java RPM package from http://java.sun.com/javase/downloads/.

Before you can use your favorite package manager (e.g., yum, apt-get, aptitude) to install Cloudera packages, you’ll need to add the Cloudera repositories to your list of yum and/or apt sources.

Please refer to http://www.cloudera.com/hadoop/ for up-to-date instructions on the simplest way to satisfy these prerequisites

Standalone Mode

To install Hadoop standalone mode, run the following command on Red Hat–based

systems:

% yum install hadoop

Or on Debian-based systems, run the command:

% apt-get install hadoop

The hadoop package include a man page. To read the man page, run the command:

% man hadoop

If you want to install the full Hadoop documentation on a machine, install the hadoopdocs package. On Red Hat–based systems, run the command:

% yum install hadoop-docs

To install the documentation on Debian-based systems, run the command:

% apt-get install hadoop-doc

Pseudo-Distributed Mode

To install Hadoop in pseudo-distributed mode, run the following command on Red

Hat–based systems:

% yum install hadoop-conf-pseudo

Or on Debian-based systems, run the command:

% apt-get install hadoop-conf-pseudo

Once you’ve installed the Hadoop pseudo-distributed configuration package, you start

the Hadoop services by running the same command on both Red Hat– and Debianbased

systems:

% for x in namenode secondarynamenode datanode jobtracker tasktracker ;

do /etc/init.d/hadoop-$x start ; done

There is no need to worry about creating a hadoop user or formatting HDFS, as that is

handled automatically by the hadoop-conf-pseudo package. You can use Hadoop immediately after installing the package and starting the Hadoop services. The hadoopconf-pseudo package will also ensure that your Hadoop services are started at system boot.

Fully Distributed Mode

For details about deploying a fully distributed Hadoop cluster, visit Cloudera’s Distribution for Hadoop web page at http://www.cloudera.com/hadoop/. When you run Cloudera’s online configurator, it creates a personalized apt or yum repository to hold the configuration packages for every cluster you manage. For example, let’s say you gave one of your clusters the name mycluster. To see a list of all the configuration packages for mycluster, run the following command on Red Hat–based systems:

% yum search hadoop-conf-mycluster

or on Debian-based systems, run the command:

% apt-cache search hadoop-conf-mycluster

These commands will return a list of configuration packages for the mycluster cluster. The number and types of configuration packages depends on how you answered thequestions posed by the Cloudera configurator. Some of the packages will be generated for specific hosts in your cluster; others will be for groups or classes of machines in your cluster. For host-specific configurations, the fully qualified hostname will be added to the package name. For example, there may be a configuration for myhost.mydo main in the mycluster cluster. To install Hadoop on myhost.mydomain on Red Hat–based systems, run the command:

% yum install hadoop-conf-mycluster-myhost.mydomain

or on Debian-based systems, run the command:

% apt-get install hadoop-conf-mycluster-myhost.mydomain

The Hadoop configuration packages will ensure that your services are set up to run at

system boot.

Hadoop-Related Packages

Cloudera’s Distribution for Hadoop allows you to easily deploy tools built on top of Hadoop like Hive and Pig. Hive is a data warehouse infrastructure that allows you to query data in Hadoop with a query language based on SQL. For more information on Hive, see “Hadoop and Hive at Facebook”. Pig is a platform for analyzing large datasets using a high-level language. To install Hive and Pig on Red Hat–based systems, run the command:

% yum install hadoop-hive hadoop-pig

To install Hive and Pig on Debian-based systems, run the command:

% apt-get install hadoop-hive hadoop-pig

More Hadoop-related packages will be added to Cloudera’s Distribution for Hadoop

over time.

If you encounter any problem, feel free to response via comment box.

Get This

Pages

Sunday, January 8, 2012

Cloudera’s Distribution for Hadoop

No comments: