02 - MapReduce

Based around key-value pairs. If the input is a text file, key = position of a line and value = text

Two functions

map: (k1,v1) => List[(k2, v2)]
reduce: (k2, List[v2]) => List[(k3, v3)]

Example for Word Count

def map(line):
	for word in line:
		emit(word, 1)
 
def reduce(key, values):
	sum = 0
	for v in values:
		sum += v
	emit(key, sum)

We can also have multiple reducers and each reducer gets ALL pairs for a given key

Optional: define third function, partitioner

partition: (k2, v2, n) -> [0, n)

Apache Hadoop is the famous open-source implementation of MapReduce/

Framework:

Assign workers to map and reduce tasks
Divides data between map workers
Group intermediate values, sorting pairs by key
Handle errors

Optional: additional function combiner

combine: (k2, List[v2]) => List[(k2, v2)]

Map side:

map outputs are buffered in memory (circular buffer)
when full, spilled to disk into R intermediate files

Hadoop Distributed File System (HDFS)

Very large, millions of files
Assume commodity hardware
Optimized for batch processing

Very similar to GFS

NameNode (similar to GFS master node):

manages file system namespace
- maps filename to set of blocks
- maps blocks to the DataNodes where it resides
cluster configuration management
replication engine for blocks

DataNode:

block server
- stores data in the local file system
- stores metadata of a block
- servers data and metadata to clients
periodically sends a report of all existing blocks to NameNode
pipelining => forwards data to other specified DataNodes

Policy:

3 replicas will be stored on at least 2 racks.
client reads from nearest replica.

Hadoop Cluster Architecture

Use SAN (Storage Area Network) to get data to the workers.

not optimal because of high communication costs

Solution:

move workers to the data, co-locate storage and compute.
when assigning tasks to servers, considers which servers contain what part of the file to minimize copy over network

Combiners

Have same signature as reducers.

They are optional

may not be run
may run once
may run many times

They improve performance by reducing network traffic

In-Mapper Combine

In-memory combiner Pros: speed Cons: requires memory management

Xin Cheng Shen

Table of Contents

02 - MapReduce

Hadoop Distributed File System (HDFS)

Hadoop Cluster Architecture

Combiners

Graph View

Recent Notes

Xin Cheng Shen

06 - Server Design

07 - Distributed Naming

Backlinks

Xin Cheng Shen

Table of Contents

02 - MapReduce

Hadoop Distributed File System (HDFS) §

Hadoop Cluster Architecture §

Combiners §

Graph View

Recent Notes

Xin Cheng Shen

06 - Server Design

07 - Distributed Naming

Backlinks

Hadoop Distributed File System (HDFS)

Hadoop Cluster Architecture

Combiners