Skip to content
UoL CS Notes

MapReduce

COMP207 Lectures

Often problems involving large datasets can be divided into many small problems that can be solved in parallel and then have their results combined.

MapReduce is a programming framework that allows us to implement two methods:

  • Map - The computation on the smaller chunks of data.
  • Reduce - How the results are combined to the final result.

MapReduce is not a universal parallelism framework, not every problem that is parallelisable can be expresses and solved nicely with MapReduce.

Counting Words in a Document

Given a collection of text files, for each word $w$, count how many times $w$ occurs in the files.

To do this we input the data like so:

("file1.txt","<file contents>"),("file2.txt","<file contents>"),("file3.txt","<file contents>"),...

These are grouped into (key,value) pairs.

The MapReduce computation will then return:

(<word1>,<count>),(<word2>,<count>),(<word3>,<count>),...

The Map Function

The map function is applied to a single key/value pair and produces a list of zero or more key/value pairs:

Map(String filepath, String contents):
	for each word w in contents:
		output pair (w, "1")

The Reduce function will handle adding together all of the key/value pairs so we don’t have to do that at this stage.

Grouping

After the Map function have been applied, we group all values by key:

("hello",1),("world",1)("hello",1),("hello",1)

this would then go to:

("hello", (1, 1, 1),("world",1)

Copy & Merge

This completes the grouping again for each text file.

The Reduce Function

The Reduce function takes a key and a list of values as input and outputs a list of key/value pairs:

Reduce(String word, Iterator<String>values):
	int count = 0
	for each v in values:
		count = count + parseInt(v)
	output pair (word,toString(count))

This only applies to a single key/value pair, not the whole list. This enables more parallel computing.

Additional Examples

There are addition examples on the following in this lecture video:

  • Matrix Multiplication (page rank)
  • Relational Algebra:
    • Selection $\sigma_c(R)$
    • Natural Join $R\bowtie S$

Distributed File System

MapReduce operates on it’s own distributed file system:

  • All the data is distributed over the nodes.
  • Replication is used for fault-tolerance.

It appears to the user as if it was a regular file system.

Input files can be very large. To deal with this, MapReduce splits these into chunks (16-64 MB) and provides an ID for each chunk in a key/value pair.

Implementation

Map returns a list of key/value pairs:

Map(String key, String value)

Reduce returns a list of key/value pairs:

Reduce(String key, Iterator<String> values)

In practice that is also some additional code to set:

  • Locations of input and output files.
  • Tuning Parameters:
    • Number of Machines
    • Memory per Map