Feb 28, 2015 hadoop

Notes on Hadoop ecosystem.

Hadoop Cluster mode

pseudo-distributed mode (single node)

fully distributed mode (clusters)

HDFS

Master daemon

machine 1: NameNode (NN) 

machine 2: Secondary NN (2NN)

Slaves daemons

on each slave machine: DateNode (DN)

NameNode (NN):

DataNode(DN):

Hadoop filesystem

hadoop fs -<command>...
    # some old command uses "hadoop dfs". "dfs" and "fs" are same!

hadoop fs -ls
hadoop fs -mkdir
hadoop fs -rm
hadoop fs -mv

Put file into HDFS

hadoop fs -put <file> <destination>

Retrieve file into local file system

hadoop fs -get <file> <destination>

Map/Reduce

If not using java, one can use python or any other language via “streaming”

hadoop jar <path/to/mapreduce.jar> -mapper <mapper_code> -reducer
<reducer_code> -input <input_dir> -output <output_dir>

shell alias (.bashrc, etc), from udacity

# alias to run mapreduce
run_mapreduce() {
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}
alias hs=run_mapreduce

# alias to run mapreduce with combiner
run_mapreduce_with_combiner() {
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $3 -output $4
}
alias hsc=run_mapreduce_with_combiner

Example from Udacity

# there's a mapper.py and reducer.py.
# data is in hadoop's fs in /myinput/ directory.  
# output will be stored in joboutput/ directory.

hadoop jar
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0...jar
-mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
-input myinput -output joboutput

# runs.... 

When finished, it produces 

file "joboutput/_SUCCESS"
file "joboutput/_logs"
file "joboutput/part-00000"  #<-- this is the actual result

# see result
hadoop fs -cat joboutput/part-00000 | less 

# copy result to local drive
hadoop fs -get joboutput/part-00000 my_output.txt

By adding Combiner, we can improve efficiency and reduce load on reducer

Hive

HiveQL, converts SQL to Map Reduce jobs

Example (InfiniteSkills)

# first, create tables and populate tables
# and then...

SELECT loc, AVG(sal)
from emp JOIN dept USING (deptno)
WHERE sal>3000 GROUP BY loc; Pig ====== Alternative to Hive, better for SQL users

Example (InfiniteSkills):

emp = LOAD '/hdfs/path/emp_file.txt' AS (loc);
dept = LOAD '/hdfs/path/dept_file.txt' AS (uid,amt);
filtered = FILTER emp BY sal >3000;
joined = JOIN filtered_emp BY deptno, dept BY deptno;
grouped = GROUP emp_join_dept by loc;
result = FOREACH grouped_by_loc GENERATE group,
    AVG(emp_join_dept.sal);
dump result;

Distro

Reference distro: Apache Hadoop (Hadoop: The Definitive Guide)

Distro for entire ecosystem

using AWS EMR

Learning sources

BigData University

Coursera

Udacity

edX

Udemy

free courses

Safari

hadoop video courses

Setup for learning/testing

Ref

using lubuntu, but useful http://hadooppseudomode.blogspot.in/

little outdated, but useful http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Cloudera Quickstart VM for VirtualBox

To share files, use “Bridged Adapter” in Network setting, not NAT

browser opens. Click on bookmark “Cloudera Manager” to start.

username: cloudera password: cloudera * has sudo privilege

mysql root password: cloudera

cloudera manager

HDFS

http://localhost:50075?

Accessing job tracker via Web interface

http://localhost:50030
http://192.168.1.1:50030  # if on another server

Spinning up AWS

http://insightdataengineering.com/blog/hadoopdevops/

https://class.coursera.org/datasci-002/wiki/awssetup

References

Hadoop: The Definitive Guide, 4th Ed. O’Reilly