Stan Learns Big Data: Tech Review: Map Reduce: Thursday, September 17, 2012

Monday, September 17, 2012

Tech Review: Map Reduce: Thursday, September 17, 2012

Google Code University

Introduction to Parallel Programming and MapReduce

http://code.google.com/edu/parallel/mapreduce-tutorial.html

Serial vs. Parallel Programming

Parallel Programming

processing is broken up into parts
each part can be executed concerrently

The Basics

identify set of task that can run concurrently
partitions of data that can be process concurrently
Fibonacci function cannot be parallelized
ideal parallel computing

no dependencies in the computations
no communication required between tasks

Master

initializes array
splits array according to number of workers
sends each worker its subarray
receives result from each worker

Worker

receives subarray from master
processes subarray
returns result to master

static load balancing

What is MapReduce?

map and reduce combinators from functional language like Lisp
map

function and sequence of values
applies function to each values in sequence

reduce

combines all elements of a sequence
uses binary operation

MapReduce is an abstraction that allows an engineer to perform simple computations while hiding the details of

parallelization
data distribution
load balancing
fault tolerance

Map

part of MapReduce library
takes input pair
produces set of intermediate key/value pairs

Reduce

intermediate key
set of values for key
merges together values to form a possibly smaller set of values

MapReduce Execution Overview

shards

partitioned input data to by distributed across multiple machines
typically 16 to 64 MB per piece

map tasks
reduce tasks
master picks who does what

map task

reads input shard
parses key/value pairs out of input data
passes each pair to map function
produces intermediate key/value pairs
tells master where data is located

reduce task

assigns key/values pairs
read intermediate data
sort by intermediate keys
group all occurrences of the same key
outputs results of reduce function to master

worker ping by master

if no answer in allotted time, worker marked as failed

MapReduce Examples

distributed grep

Hadoop Basics

http://cscarioni.blogspot.com/2010/11/hadoop-basics.html

open source project for processing large datasets in parallel
two main parts

Hodoop Distributed File System (HDFS)
Map Reduce Framework

two main phases to process data

Map phase
Reduce phase

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)