#cs451

Based around key-value pairs. If the input is a text file, key = position of a line and value = text

Two functions

  • map: (k1,v1) => List[(k2, v2)]
  • reduce: (k2, List[v2]) => List[(k3, v3)]

Example for Word Count

def map(line):
	for word in line:
		emit(word, 1)
 
def reduce(key, values):
	sum = 0
	for v in values:
		sum += v
	emit(key, sum)

We can also have multiple reducers and each reducer gets ALL pairs for a given key

Optional: define third function, partitioner

  • partition: (k2, v2, n) -> [0, n)

Apache Hadoop is the famous open-source implementation of MapReduce/

Framework:

  • Assign workers to map and reduce tasks
  • Divides data between map workers
  • Group intermediate values, sorting pairs by key
  • Handle errors

Optional: additional function combiner

  • combine: (k2, List[v2]) => List[(k2, v2)]

Map side:

  • map outputs are buffered in memory (circular buffer)
  • when full, spilled to disk into R intermediate files

Hadoop Distributed File System (HDFS)

  • Very large, millions of files
  • Assume commodity hardware
  • Optimized for batch processing

Very similar to GFS

NameNode (similar to GFS master node):

  • manages file system namespace
    • maps filename to set of blocks
    • maps blocks to the DataNodes where it resides
  • cluster configuration management
  • replication engine for blocks

DataNode:

  • block server
    • stores data in the local file system
    • stores metadata of a block
    • servers data and metadata to clients
  • periodically sends a report of all existing blocks to NameNode
  • pipelining => forwards data to other specified DataNodes

Policy:

  • 3 replicas will be stored on at least 2 racks.
  • client reads from nearest replica.

Hadoop Cluster Architecture

Use SAN (Storage Area Network) to get data to the workers.

  • not optimal because of high communication costs

Solution:

  • move workers to the data, co-locate storage and compute.
  • when assigning tasks to servers, considers which servers contain what part of the file to minimize copy over network

Combiners

Have same signature as reducers.

They are optional

  • may not be run
  • may run once
  • may run many times

They improve performance by reducing network traffic

In-Mapper Combine

In-memory combiner Pros: speed Cons: requires memory management