Page tree
Skip to end of metadata
Go to start of metadata

If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI), this document can help you navigate some of the technology and terminology, and guide you in setting up and configuring the system.
In this document we provide some background information about the framework, the key distributions, modules, components, and related products. We also provide you with single and multi-node Hadoop installation commands and configuration parameters.
The final section includes some tips and tricks to help you get started, and provides guidance in setting up a Hadoop project.

To learn how we make BI work on Hadoop:Jethro Data Sheet.

Key Hadoop Distributions

Vendor

Strength

Apache Hadoop

The open source distribution from Apache

Apache Cloudera

The leading vendor. with proprietary components for enterprise needs

MapR

Committed to ease of use, while supporting high performance and scalability

Hortonworks

100% open source package

IBM

Integration with IBM analytics products

Pivotal

Integration with Greenplum and Cloud Foundry (CF)

Hadoop Modules

Module

Description

Common

Common utilities. Supports other Hadoop modules

HDFS

Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware

YARN

Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling

MapReduce

Software framework for parallel processing of large data sets based on YARN

Hadoop Components

Component

Module

Description

NameNode

HDFS

The directory tree of the Hadoop HDFS file system (a.k.a Hadoop inode)

Secondary NameNode

HDFS

High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits file into the fsimage file

JournalNode

HDFS

Arbiter node that supports auto failover between NameNodes

DataNode

HDFS

Nodes (or servers) that store the actual data

NFS3 Gateway

HDFS

Daemons that enable NFS3 support

ResourceManager

YARN

Global daemon that arbitrates resources among all the applications in the Hadoop cluster

ApplicationMaster

YARN

Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager to consume them and monitor the tasks

NodeManager

YARN

Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage such as CPU and disk, and reporting back to the ResourceManager

Container

YARN

Running specific tasks on a specific machine for a specific application based on allocated resources

Hadoop Ecosystem – Related Products

Product

Description

Ambari

A completely open-source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters

Apex

Big data in motion platform based on YARN

Azbakan

Workflow job scheduling and management system for Hadoop

Flume

Reliable, distributed and available service that streams logs into HDFS

Knox

Authentication and Access gateway service for Hadoop

HBase

Distributed non-relational database that runs on top of HDFS

Hive

Data warehouse system based on Hadoop

Mahout

Machine learning algorithm (clustering, classification and batch-based collaborative filtering) implementation based on MapReduce

Pig

High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark

Impala

Enables low-latency SQL queries on HBase and HDFS

Oozie

Workflow job scheduling and management system for Hadoop

Ranger

Access policy manager for HDFS files, folders, databases, tables and columns

Spark

Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like interface and machine learning library.

Sqoop

Data migration application between RDBMS and Hadoop using CLI

Tez

Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN

ZooKeeper

Distributed name registry, synchronization service and configuration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers

Cloud Operator

Service Name

Amazon Web Services

EMR (Elastic Map Reduce)

IBM Softlayer

IBM Brightsight

Microsoft Azure

HDInsight

Common Data Formats

Format

Description

Avro

JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.

Parquet

Columnar storage format

ORC

Fast Columnar storage format

RCFile

Data placement format for Rational tables

SequenceFile

Binary data format with a record of specific data types

Unstructured

Hadoop also supports various unstructured data formats

Single Node Installation

Requirement

Task

Command

Java Installation

Check version

java -version


Install

sudo apt-get -y update && sudo apt-get -y install default-jdk

Create User and Permissions

Create User

useradd hadoop passwd hadoop mkdir /home/hadoop chown -R hadoop:hadoop /home/hadoop


Create keys

su - hadoop ssh-keygen -t rsa && cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && chmod 0600 ~/.ssh/authorized_keys

Install from source

 

wget {*}{_}http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz_* && tar xzf hadoop-2.7.2.tar.gz && mv hadoop-2.7.2 hadoop

Environment

Env Vars

source ~/.bashrc export HADOOP_HOME=/home/hadoop/hadoopexport HADOOP_INSTALL=$HADOOP_HOMEexport HADOOP_MAPRED_HOME=$HADOOP_HOMEexport HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin



vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh export JAVA_HOME=/opt/jdk1.8.0_05/

Configuration files

 

core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml

Format NameNode

 

hdfs namenode -format

Start System

 

cd $HADOOP_HOME/sbin/ start-dfs.sh start-yarn.sh

Test System

 

bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation

Task

Command

Configure hosts on each node

> vi /etc/hosts 192.168.1.11 hadoop-master 192.168.1.12 hadoop-slave-1 192.168.1.13 hadoop-slave-2

Enable cross node authentication

> su – hadoop > ssh-keygen -t rsa > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1 192.168.1.12 hadoop-slave-1 > ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2 > chmod 0600 ~/.ssh/authorized_keys > exit

Copy system

> su - hadoop > cd /opt/hadoop
> scp -r hadoop hadoop-slave-1:/opt/hadoop
> scp -r hadoop hadoop-slave-2:/opt/hadoop

Configure Master

> su - hadoop > cd /opt/hadoop/hadoop > vi conf/masters hadoop-master > vi conf/slaves hadoop-slave-1 hadoop-slave-2 > su - hadoop > cd /opt/hadoop/hadoop > bin/hadoop namenode -format

Start system

bin/start-all.sh

Backup HDFS Metadata

Task

Command

Stop the cluster

stop-all.sh

Perform cold backup to metadata directories

cd /data/dfs/nn tar -cvf /tmp/backup.tar.gz

Start the cluster

start-all.sh

HDFS Basic Commands

Task

Command

List the content of the home directory

hdfs dfs -ls /data/

Upload a file from the local file system to HDFS

hdfs dfs -put logs. csv /data/

Read the content of the file from HDFS

hdfs dfs -cat /data/ logs.csv

Change the permission of a file

hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a file to 3

hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the file

hdfs dfs -du -h /data/logs.csv

Move the file to the newly-created sub-directory

hdfs dfs -mv logs.csv logs/

Remove directory from HDFS

hdfs dfs -rm -r logs

HDFS Administration

Task

Command

Balance the cluster storage

hdfs balancer -threshold

Run the NameNode

hdfs namenode

Run the secondary NameNode

hdfs secondarynamenode

Run a datanode

hdfs datanode

Run the NFS3 gateway

hdfs nfs3

Run the RPC portmap for the NFS3 gateway

hdfs portmap

YARN

Task

Command

Show yarn help

yarn

Define configuration file

yarn [--config confdir]

Define log level

yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG, and TRACE

User commands

 

Show Hadoop classpath

yarn classpath

Show and kill application

yarn application

Show application attempt

yarn applicationattempt

Show container information

yarn container

Show node information

yarn node

Show queue information

yarn queue

Administration commands

 

Start NodeManager

yarn nodemanager

Start Proxy web server

yarn proxyserver

Start ResourceManager

yarn resourcemanager

Run ResourceManager admin client

yarn rmadmin

Start Shared Cache Manager

yarn sharedcachemanager

Start TimeLineServer

yarn timelineserver

MapReduce

Task

Command

Submit the WordCount MapReduce job to the cluster

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input logs-output

Check the output of this job in HDFS

hadoop fs -cat logs -output/*

Submit a scalding job

hadoop jar scalding.jar com.twitter.scalding.Tool Scalding

Kill a MapReduce job

yarn application -kill

Resource Manager UI

Resource

Default URI

NameNode

http://:50070/

DataNode

http://:50075/

Sec NameNode

http://:50090/

Resource Manager

http://:8088

HBase Master

http://:60010

Secure Hadoop

Aspect

Best Practice

Authentication

  • Define users
  • Enable Kerberos in Hadoop
  • Setup Knox gateway to control access and authentication to the HDFS cluster
  • Integrate with the organization's SSO and LDAP

Authorization

  • Define groups
  • Define HDFS Permissions
  • Define HDFS ACL's
  • Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns

Audit

  • Enable process execution audit trail

Data Protection

  • Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks

Project Concept

Iterate cluster sizing to optimize performance and meet actual load patterns

Hardware

Clusters with more nodes recover faster

The higher the storage per node, the longer the recovery time

Use commodity hardware:

  • Use large slow disks (SATA) without RAID (3-6TB disks)
  • Use as much RAM as is cost-effective (96-192GB RAM)
  • Use mainstream CPU with as many cores as possible (8-12 cores)

Invest in reliable hardware for the NameNodes

NameNode RAM should be 2GB + 1GB for every 100TB raw disk space

Networking cost should be 20% of hardware budget

40 nodes is the critical mass to achieve best performance/cost ratio

Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows for 3 replicas

Operating System and JVM

Must be 64-bit

Set file descriptor limit to 64K (ulimit)

Enable time synchronization using NTP

Speed up reads by mounting disks with NOATIME

Disable hugepages

System

Enable monitoring using Ambari

Monitor the checkpoints of the NameNodes to verify that they occur at the correct times. This will enable you to recover your cluster when needed

Avoid reaching 90% cluster disk utilization

Balance the cluster periodically using balancer

Edit metadata files using Hadoop utilities only, to avoid corruption

Keep replication >= 3

Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation

Clean /tmp regularly – it tends to fill up with junk files

Optimize the number of reducers to avoid system starvation

Verify that the file system you selected is supported by your Hadoop vendor

Data and System Recovery

Disk failure is not an issue

Data nodes failure is not a major issue

NameNodes failure is an issue even in a clustered environment

Make regular backups of namenode metadata

Enable NameNode clustering using ZooKeeper

Provide sufficient disk space for NameNode logging

Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml

  • No labels