Compiler Optimisations
Optimisation Types
There are several operations that can increase the execution speed of the same code:
- Instruction ordering
- Memory access order
- Inlining
- Functions are included to where they are called from.
- Dead code/stores
- Code and variables that are never reached, or never have their values read, are removed.
- Code Hoisting
- Code not required in a loop is moved out of the loop.
- Common sub-expression elimination
- Loop unrolling
- Vectorisation
Optimisation Reports
When using the Intel compiler icc
we can monitor what -On
has done by using:
-qopt-reportN
where $N=1,2,3,\ldots$.
Manual Optimisations
Don’t:
- Optimise what the compiler has already done.
- Focus on small improvements.
Profiling
Use profiling (gprof
) to find where the majority of time is spent in your code. You can then take the following steps:
- Identify memory bottlenecks.
- Register bottlenecks.
- Cache Utilisation.
Then stop when:
- Maximum performance is reached.
- You run out of time or give up.
You can look for the following activities:
- Remove I/O, including prints.
- Remove debug code.
- Remove dead code.
- Check order of memory access (row major).
You can also give the compiler hits (*restrict
).
Loop Fusion
If there are multiple loops iterating over the same arrays where work can be combined into a single loop:
- If overdone this can lead to many live variables and register spillage.
- Therefore in some cases we need to complete loop fission, where we separate a loop with many variables into two loops to make better use of registers.
Spatial Tiling
If we have data that we access in a particular order (such as in convolution). We may want to reorder our data into the order we will access them:
- For a convolution, order the data in the order read by the sliding window as opposed to logically.
This makes better use of the cache lines.
Imagine for a convolution with $N4$ neighbours. We can change the order in which we read data such that we can re-use data from the cache instead of reading it from RAM:
- Process the convolution on the diagonal instead of linearly. Therefore we are able to make use of temporal locality by reusing more cells in a shorter period of time.
- Another solution would be to break the image into tiles that make best use of the size of the L1 cache.