
3 - Implementation
Many different implementations, but at Google
- Dual-processors machines, 2-4GB of memory
- Commodity networking hardware is used
- Cluster consists of hundreds or thousands of machines, failures are common
- Storage provided by inexpensive IDE disks
- Jobs submitted to scheduling system
Execution Overview
Input data splitted into M splits, they can be processed in parallel by different machines. Reduce functions are distributed by partitioning the intermediate key space into R pieces using a partitioning function, R is specified by the user.
Steps
- Input files splitted into M pieces (each 16 to 64MB).
- One program is the master, it picks idle workers and assign them work, M map tasks and R reduce tasks.
- For a map task, it reads content from the input data, and pass them to the map function, result buffered in memory
- Periodically they are written to disk, partitioned, and the locations are passed back to the master, who forwards them to the reducers
- After the reduce worker is notified about these locations, it uses RPCs to read the data from disk. After it gets all the data, intermediate keys are sorted.
- For each key and its corresponding values, they are passed to the reduce function and the result is appended to a final output file
- Done after all map tasks and reduce tasks are done.
There are R final output files.
Mappers store their output locally, reducers store their output in a global file system.
Master Data Structures
For each task, the state is stored IDLE, IN-PROGRESS, COMPLETED
Fault Tolerance
Worker Failure
Workers are pinged by the master periodically.
Any tasks (in progress or completed) from a failed worker are reset back to their initial idle state.
Master Failure
Write periodic checkpoints of the master data structure, if it dies, a new copy can be started from the last checkpoint.
Or the job can abort is the master fails and retry again.
Semantics in the Presence of Failures