MapReduce
Often problems involving large datasets can be divided into many small problems that can be solved in parallel and then have their results combined.
MapReduce is a programming framework that allows us to implement two methods:
- Map - The computation on the smaller chunks of data.
- Reduce - How the results are combined to the final result.
MapReduce is not a universal parallelism framework, not every problem that is parallelisable can be expresses and solved nicely with MapReduce.
Counting Words in a Document
Given a collection of text files, for each word $w$, count how many times $w$ occurs in the files.
To do this we input the data like so:
("file1.txt","<file contents>"),("file2.txt","<file contents>"),("file3.txt","<file contents>"),...
These are grouped into (key,value)
pairs.
The MapReduce computation will then return:
(<word1>,<count>),(<word2>,<count>),(<word3>,<count>),...
The Map Function
The map function is applied to a single key/value pair and produces a list of zero or more key/value pairs:
Map(String filepath, String contents):
for each word w in contents:
output pair (w, "1")
The Reduce function will handle adding together all of the key/value pairs so we don’t have to do that at this stage.
Grouping
After the Map function have been applied, we group all values by key:
("hello",1),("world",1)("hello",1),("hello",1)
this would then go to:
("hello", (1, 1, 1),("world",1)
Copy & Merge
This completes the grouping again for each text file.
The Reduce Function
The Reduce function takes a key and a list of values as input and outputs a list of key/value pairs:
Reduce(String word, Iterator<String>values):
int count = 0
for each v in values:
count = count + parseInt(v)
output pair (word,toString(count))
This only applies to a single key/value pair, not the whole list. This enables more parallel computing.
Additional Examples
There are addition examples on the following in this lecture video:
- Matrix Multiplication (page rank)
- Relational Algebra:
- Selection $\sigma_c(R)$
- Natural Join $R\bowtie S$
Distributed File System
MapReduce operates on it’s own distributed file system:
- All the data is distributed over the nodes.
- Replication is used for fault-tolerance.
It appears to the user as if it was a regular file system.
Input files can be very large. To deal with this, MapReduce splits these into chunks (16-64 MB) and provides an ID for each chunk in a key/value pair.
Implementation
Map returns a list of key/value pairs:
Map(String key, String value)
Reduce returns a list of key/value pairs:
Reduce(String key, Iterator<String> values)
In practice that is also some additional code to set:
- Locations of input and output files.
- Tuning Parameters:
- Number of Machines
- Memory per Map