Introduction to Parallel Programming and MapReduce
- Serial vs. Parallel Programming
- Parallel Programming
- processing is broken up into parts
- each part can be executed concerrently
- The Basics
- identify set of task that can run concurrently
- partitions of data that can be process concurrently
- Fibonacci function cannot be parallelized
- ideal parallel computing
- no dependencies in the computations
- no communication required between tasks
- Master
- initializes array
- splits array according to number of workers
- sends each worker its subarray
- receives result from each worker
- Worker
- receives subarray from master
- processes subarray
- returns result to master
- static load balancing
- What is MapReduce?
- map and reduce combinators from functional language like Lisp
- map
- function and sequence of values
- applies function to each values in sequence
- reduce
- combines all elements of a sequence
- uses binary operation
- MapReduce is an abstraction that allows an engineer to perform simple computations while hiding the details of
- parallelization
- data distribution
- load balancing
- fault tolerance
- Map
- part of MapReduce library
- takes input pair
- produces set of intermediate key/value pairs
- Reduce
- intermediate key
- set of values for key
- merges together values to form a possibly smaller set of values
- MapReduce Execution Overview
- shards
- partitioned input data to by distributed across multiple machines
- typically 16 to 64 MB per piece
- map tasks
- reduce tasks
- master picks who does what
- map task
- reads input shard
- parses key/value pairs out of input data
- passes each pair to map function
- produces intermediate key/value pairs
- tells master where data is located
- reduce task
- assigns key/values pairs
- read intermediate data
- sort by intermediate keys
- group all occurrences of the same key
- outputs results of reduce function to master
- worker ping by master
- if no answer in allotted time, worker marked as failed
- MapReduce Examples
- distributed grep
Hadoop Basics
- open source project for processing large datasets in parallel
- two main parts
- Hodoop Distributed File System (HDFS)
- Map Reduce Framework
- two main phases to process data
- Map phase
- Reduce phase
No comments:
Post a Comment