Compiler Optimisations

COMP328 Lectures 2023-02-14

Optimisation Types

There are several operations that can increase the execution speed of the same code:

Instruction ordering
Memory access order
Inlining
- Functions are included to where they are called from.
Dead code/stores
- Code and variables that are never reached, or never have their values read, are removed.
Code Hoisting
- Code not required in a loop is moved out of the loop.
Common sub-expression elimination
Loop unrolling
Vectorisation

Optimisation Reports

When using the Intel compiler icc we can monitor what -On has done by using:

-qopt-reportN

where $N=1,2,3,\ldots$.

Manual Optimisations

Don’t:

Optimise what the compiler has already done.
Focus on small improvements.

Profiling

Use profiling (gprof) to find where the majority of time is spent in your code. You can then take the following steps:

Identify memory bottlenecks.
Register bottlenecks.
Cache Utilisation.

Then stop when:

Maximum performance is reached.
You run out of time or give up.

You can look for the following activities:

Remove I/O, including prints.
Remove debug code.
Remove dead code.
Check order of memory access (row major).

You can also give the compiler hits (*restrict).

Loop Fusion

If there are multiple loops iterating over the same arrays where work can be combined into a single loop:

If overdone this can lead to many live variables and register spillage.
- Therefore in some cases we need to complete loop fission, where we separate a loop with many variables into two loops to make better use of registers.

Spatial Tiling

If we have data that we access in a particular order (such as in convolution). We may want to reorder our data into the order we will access them:

For a convolution, order the data in the order read by the sliding window as opposed to logically.

This makes better use of the cache lines.

Imagine for a convolution with $N4$ neighbours. We can change the order in which we read data such that we can re-use data from the cache instead of reading it from RAM:

Process the convolution on the diagonal instead of linearly. Therefore we are able to make use of temporal locality by reusing more cells in a shorter period of time.
Another solution would be to break the image into tiles that make best use of the size of the L1 cache.