Monday, September 17, 2012

Tech Review: Map Reduce: Thursday, September 17, 2012

Google Code University

Introduction to Parallel Programming and MapReduce
  • Serial vs. Parallel Programming
    • Parallel Programming
      • processing is broken up into parts
      • each part can be executed concerrently
  • The Basics
    • identify set of task that can run concurrently
    • partitions of data that can be process concurrently
    • Fibonacci function cannot be parallelized
    • ideal parallel computing 
      • no dependencies in the computations
      • no communication required between tasks
    • Master
      • initializes array
      • splits array according to number of workers
      • sends each worker its subarray
      • receives result from each worker
    • Worker
      • receives subarray from master
      • processes subarray
      • returns result to master
    • static load balancing
  • What is MapReduce?
    • map and reduce combinators from functional language like Lisp
    • map
      • function and sequence of values
      • applies function to each values in sequence
    • reduce
      • combines all elements of a sequence
      • uses binary operation
    • MapReduce is an abstraction that allows an engineer to perform simple computations while hiding the details of
      • parallelization
      • data distribution
      • load balancing
      • fault tolerance
    • Map
      • part of MapReduce library
      • takes input pair
      • produces set of intermediate key/value pairs
    • Reduce
      • intermediate key
      • set of values for key
      • merges together values to form a possibly smaller set of values
    • MapReduce Execution Overview
      • shards
        • partitioned input data to by distributed across multiple machines
        • typically 16 to 64 MB per piece
      • map tasks
      • reduce tasks
      • master picks who does what
    • map task
      • reads input shard
      • parses key/value pairs out of input data
      • passes each pair to map function
      • produces intermediate key/value pairs
      • tells master where data is located
    • reduce task
      • assigns key/values pairs
      • read intermediate data
      • sort by intermediate keys
      • group all occurrences of the same key
      • outputs results of reduce function to master
    • worker ping by master
      • if no answer in allotted time, worker marked as failed
  • MapReduce Examples
    • distributed grep
Hadoop Basics
  • open source project for processing large datasets in parallel
  • two main parts
    • Hodoop Distributed File System (HDFS)
    • Map Reduce Framework
  • two main phases to process data
    • Map phase
    • Reduce phase

No comments:

Post a Comment