Notes on Hadoop ecosystem.

Hadoop Cluster mode

  • standalone mode (single node, no daemon)
  • localfile, no HDFS
  • quick way to test map reduce, etc

pseudo-distributed mode (single node)

  • 1 HDFS

fully distributed mode (clusters)

  • production
  • default: 3 HDFS

HDFS

  • Distributed File System
  • When file is copied to HDFS, it is split into blocks, each block stored into each DN of slaves.
  • each blocks are 64MB by default, but recommends setting to 128MB
  • RF (replication factor=3). Each blocks are replicated 3x (by default)
  • Write once, read many. No edits, no appends

Master daemon

machine 1: NameNode (NN) 

machine 2: Secondary NN (2NN)

Slaves daemons

on each slave machine: DateNode (DN)

NameNode (NN):

  • controls (orchestrates, delegates)
  • all metadata (name,permission,dir,) in mem/disk
  • memory hungry
  • if NN is lost, then all HDFS is lost
  • 2 files: fsimage, edit log
  • recommend RAID, frequent backups!

DataNode(DN):

  • actual raw data (in blocks)
  • DN heartbeats every 3 seconds (default) into NN
    • I’m alive
    • I can accept read/write request
    • if it doesn’t report for 10 heartbeats, DN is taken out
    • if it doesn’t report for 10 minutes, its blocks are replicated
  • DN sends block report to NN every hour (by default)
    • NN can create or delete blocks to match RF
  • Each block also has a “checksum” file to detect corruption
    • NN handles corruption and replicates/deletes blocks

Hadoop filesystem

hadoop fs -<command>...
    # some old command uses "hadoop dfs". "dfs" and "fs" are same!

hadoop fs -ls
hadoop fs -mkdir
hadoop fs -rm
hadoop fs -mv

Put file into HDFS

hadoop fs -put <file> <destination>

Retrieve file into local file system

hadoop fs -get <file> <destination>

Map/Reduce

  • Master node: Job Tracker Daemon
  • worker nodes: Task Trackers daemons

If not using java, one can use python or any other language via “streaming”

hadoop jar <path/to/mapreduce.jar> -mapper <mapper_code> -reducer
<reducer_code> -input <input_dir> -output <output_dir>
  • output_dir must not exist.

  • use alias script to reduce above to

    hs <mapper_code> <reducer_code> <input_dir> <output_dir>

shell alias (.bashrc, etc), from udacity

# alias to run mapreduce
run_mapreduce() {
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}
alias hs=run_mapreduce

# alias to run mapreduce with combiner
run_mapreduce_with_combiner() {
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.1.jar -mapper $1 -reducer $2 -combiner $2 -file $1 -file $2 -input $3 -output $4
}
alias hsc=run_mapreduce_with_combiner

Example from Udacity

# there's a mapper.py and reducer.py.
# data is in hadoop's fs in /myinput/ directory.  
# output will be stored in joboutput/ directory.

hadoop jar
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0...jar
-mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
-input myinput -output joboutput

# runs.... 

When finished, it produces 

file "joboutput/_SUCCESS"
file "joboutput/_logs"
file "joboutput/part-00000"  #<-- this is the actual result

# see result
hadoop fs -cat joboutput/part-00000 | less 

# copy result to local drive
hadoop fs -get joboutput/part-00000 my_output.txt

By adding Combiner, we can improve efficiency and reduce load on reducer

Hive

HiveQL, converts SQL to Map Reduce jobs

  • metastore DB, which can be centralized, if multiple users
  • server interface via HUE or Hive Server
  • 10% slower than JAVA
  • faster than hadoop streaming
  • excellent for join

  • Impala (Cloudera) is overtaking Hive

Example (InfiniteSkills)

# first, create tables and populate tables
# and then...

SELECT loc, AVG(sal)
from emp JOIN dept USING (deptno)
WHERE sal>3000 GROUP BY loc;

Client apps

HiveServer2 Clients - Apache Hive - Apache Software Foundation

  • Hive CLI - uses Beeline

Pig

Alternative to Hive, better for SQL users

  • pig eats anything
  • developers prefer pig over hive
  • no need for metastore DB, works on raw HDFS
  • good for unstructured data

Example (InfiniteSkills):

emp = LOAD '/hdfs/path/emp_file.txt' AS (loc);
dept = LOAD '/hdfs/path/dept_file.txt' AS (uid,amt);
filtered = FILTER emp BY sal >3000;
joined = JOIN filtered_emp BY deptno, dept BY deptno;
grouped = GROUP emp_join_dept by loc;
result = FOREACH grouped_by_loc GENERATE group,
    AVG(emp_join_dept.sal);
dump result;

Distro

Reference distro: Apache Hadoop (Hadoop: The Definitive Guide)

Distro for entire ecosystem

using AWS EMR

Learning sources

BigData University

Coursera

Udacity

  • Intro to Hadoop and MapReduce: Excellent course (92015), works with its cloudera VM

edX

  • Big data with Spark - UCB (not much on Hadoop)
  • Data mining(MIT)

Udemy

free courses

  • also see other class, and see what distro and VM are being used, and how it is being set up

Safari

hadoop video courses

Setup for learning/testing

Ref

using lubuntu, but useful http://hadooppseudomode.blogspot.in/

little outdated, but useful http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Cloudera Quickstart VM for VirtualBox

  • go to Cloudera website -> Downloads menu -> QuickStart VM ->VirtualBox
  • It uses CentOS 6.4, 4GB RAM required, but I don’t have enough RAM/too slow
  • use ICH9 (not PIIX3) and use I/O APIC

To share files, use “Bridged Adapter” in Network setting, not NAT

browser opens. Click on bookmark “Cloudera Manager” to start.

username: cloudera password: cloudera

  • has sudo privilege

mysql root password: cloudera

cloudera manager

HDFS

http://localhost:50075?

Accessing job tracker via Web interface

http://localhost:50030
http://192.168.1.1:50030  # if on another server

Spinning up AWS

http://insightdataengineering.com/blog/hadoopdevops/

  • Excellent guide. recent.
  • also can spin multiple t2.micro for an hour for free, but terminate afterward

https://class.coursera.org/datasci-002/wiki/awssetup

  • hadoop and pig, using latest info

References

Hadoop: The Definitive Guide, 4th Ed. O’Reilly