MapReduce

COMP207 Lectures 2021-11-23

Often problems involving large datasets can be divided into many small problems that can be solved in parallel and then have their results combined.

MapReduce is a programming framework that allows us to implement two methods:

Map - The computation on the smaller chunks of data.
Reduce - How the results are combined to the final result.

MapReduce is not a universal parallelism framework, not every problem that is parallelisable can be expresses and solved nicely with MapReduce.

Counting Words in a Document

Given a collection of text files, for each word $w$, count how many times $w$ occurs in the files.

To do this we input the data like so:

("file1.txt","<file contents>"),("file2.txt","<file contents>"),("file3.txt","<file contents>"),...

These are grouped into (key,value) pairs.

The MapReduce computation will then return:

(<word1>,<count>),(<word2>,<count>),(<word3>,<count>),...

The Map Function

The map function is applied to a single key/value pair and produces a list of zero or more key/value pairs:

Map(String filepath, String contents):
	for each word w in contents:
		output pair (w, "1")

The Reduce function will handle adding together all of the key/value pairs so we don’t have to do that at this stage.

Grouping

After the Map function have been applied, we group all values by key:

("hello",1),("world",1)("hello",1),("hello",1)

this would then go to:

("hello", (1, 1, 1),("world",1)

Copy & Merge

This completes the grouping again for each text file.

The Reduce Function

The Reduce function takes a key and a list of values as input and outputs a list of key/value pairs:

Reduce(String word, Iterator<String>values):
	int count = 0
	for each v in values:
		count = count + parseInt(v)
	output pair (word,toString(count))

This only applies to a single key/value pair, not the whole list. This enables more parallel computing.

Additional Examples

There are addition examples on the following in this lecture video:

Matrix Multiplication (page rank)
Relational Algebra:
- Selection $\sigma_c(R)$
- Natural Join $R\bowtie S$

Distributed File System

MapReduce operates on it’s own distributed file system:

All the data is distributed over the nodes.
Replication is used for fault-tolerance.

It appears to the user as if it was a regular file system.

Input files can be very large. To deal with this, MapReduce splits these into chunks (16-64 MB) and provides an ID for each chunk in a key/value pair.

Implementation

Map returns a list of key/value pairs:

Map(String key, String value)

Reduce returns a list of key/value pairs:

Reduce(String key, Iterator<String> values)

In practice that is also some additional code to set:

Locations of input and output files.
Tuning Parameters:
- Number of Machines
- Memory per Map