Friday, July 5, 2013

What is Apache Hadoop?

Newbies can get a clean and simple introduction to Hadoop from the following Pivotal blog posts

1) Demystifying Apache Hadoop in 5 Pictures
2) Hadoop 101: Programming MapReduce with Native Libraries, Hive, Pig, and Cascading
3) 20+ Examples of Getting Results with Big Data

Highlights

  • Hadoop is developed to assist in Big data Analysis
  • Hadoop implements distributed computing to process large sets of data in a quick time-frame
  • Hadoop divides and distributes work across large number of computers. It spreads data processing logic across 10s, 100s, or even 1000s of commodity servers.

Hadoop's main components

1) Hadoop Distributed File System (HDFS) : to help us split the data, put it on different nodes, replicate it, and manage it.

2) MapReduce: processes the data on each node in parallel and calculates the results of the job
     a) Map: Performs computation on local data set on each nodes and outputs a list of key-value pairs
     b) Reduce: The output from map step is sent to other nodes as input for the reduce step. Before reduce runs, the key-value pairs are typically sorted and shuffled. The reduce phase then sums the lists into single entries

3) Managing the Hadoop jobs: In Hadoop the entire process is called a job.
    a) Job tracker exists to divide the job into tasks and schedules tasks to run on the nodes. The job tracker keeps track of the participating nodes, monitors the processes, orchestrates data flow, and handles failures. 
    b) Task trackers run tasks and report to the job tracker. 
  With this layer of management automation, Hadoop can automatically distribute jobs on a large number of nodes in parallel and scale when more nodes are added .

Hadoop programming

There are 4 coding approaches
1) Native Hadoop library: It helps to achieve the greatest performance and have the most fine-grained control
2) Pig: similar to SQL and it is procedural, not declarative
3) Hive: Started by Facebook. It provides more SQL like interface, considered the slower of the languages to do Hadoop with.
4) Cascading: It is a set of .jars that define data processing APIs, integration APIs, as well as a process planner and scheduler

No comments: