Monday, September 17, 2012

Tech Review: Map Reduce: Thursday, September 17, 2012

Google Code University

Introduction to Parallel Programming and MapReduce
  • Serial vs. Parallel Programming
    • Parallel Programming
      • processing is broken up into parts
      • each part can be executed concerrently
  • The Basics
    • identify set of task that can run concurrently
    • partitions of data that can be process concurrently
    • Fibonacci function cannot be parallelized
    • ideal parallel computing 
      • no dependencies in the computations
      • no communication required between tasks
    • Master
      • initializes array
      • splits array according to number of workers
      • sends each worker its subarray
      • receives result from each worker
    • Worker
      • receives subarray from master
      • processes subarray
      • returns result to master
    • static load balancing
  • What is MapReduce?
    • map and reduce combinators from functional language like Lisp
    • map
      • function and sequence of values
      • applies function to each values in sequence
    • reduce
      • combines all elements of a sequence
      • uses binary operation
    • MapReduce is an abstraction that allows an engineer to perform simple computations while hiding the details of
      • parallelization
      • data distribution
      • load balancing
      • fault tolerance
    • Map
      • part of MapReduce library
      • takes input pair
      • produces set of intermediate key/value pairs
    • Reduce
      • intermediate key
      • set of values for key
      • merges together values to form a possibly smaller set of values
    • MapReduce Execution Overview
      • shards
        • partitioned input data to by distributed across multiple machines
        • typically 16 to 64 MB per piece
      • map tasks
      • reduce tasks
      • master picks who does what
    • map task
      • reads input shard
      • parses key/value pairs out of input data
      • passes each pair to map function
      • produces intermediate key/value pairs
      • tells master where data is located
    • reduce task
      • assigns key/values pairs
      • read intermediate data
      • sort by intermediate keys
      • group all occurrences of the same key
      • outputs results of reduce function to master
    • worker ping by master
      • if no answer in allotted time, worker marked as failed
  • MapReduce Examples
    • distributed grep
Hadoop Basics
  • open source project for processing large datasets in parallel
  • two main parts
    • Hodoop Distributed File System (HDFS)
    • Map Reduce Framework
  • two main phases to process data
    • Map phase
    • Reduce phase

Friday, September 7, 2012

Tech Review Big Data: Thursday, September 6, 2012


Big data
  • a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools.
  • Difficulties
    • capture
    • storage
    • search
    • sharing
    • analysis
    • visualization
  • What is considered "big data" varies depending on the capabilities of the organization managing the data set.
  • Big data sizes are a constantly moving target
    • few dozen terabytes
    • many petabytes
  • new platform of "big data" tools 
    • Apache Hadoop
  • MIKE2.0
  • Doug Laney
    • data growth challenges and opportunities are three-dimensional
      • increasing volume (amount of data)
      • velocity (speed of data in and out)
      • variety (range of data types and sources)
  • Big players in big data
    • Oracle
    • IBM
    • Microsoft
    • SAP
    • HP


10 Steps for Testing and Choosing a Big Data Appliance




Marko Grobelnik



IBM:  What is big data?
  • Spans four dimension
    • Volume
    • Velocity
    • Variety
    • Veracity
  • big data is more than simply a matter of size
IBM big data platform

Do a search on "ibm what is big data?" for some more reading.

O'Reilly:  What is big data?
  • data that exceeds the processing capacity of conventional db systems.
    • too big
    • too fast
    • doesn't fit structures of db architectures
Google BigQuery


Stanford University:  Data Mining Certificates Online

  • concepts not tools
  • doesn't seem to be hands on
Texas A&M:  Data Mining Certificate
  • SAS classes
  • heavily statistic based

Who are the top influencers in Big Data, Analytics, Data Mining?


Big Data on Campus
  • should be called big data in education
Online Education in Analytics, Data Mining and Data Science



Big Data University



Web Intelligence and Big Data



Army: Manning Snuck 'Data-Mining' Software Onto Secret Network

Monday, August 13, 2012

Tech Review: Big Data: Thursday, August 9, 2012


Unified computing makes Alaska smaller, faster, more secure

  • unified computing
    • high-performance system
    • loosely coupled storage
    • networking
    • parallel processing functions 
    • high-bandwidth
      • 10-gigabit Ethernet network
    • Cisco's Unified Computing System (UCS)
      • single management interface
      • thousands of virtual machines.