Skip to content
UoL CS Notes

Compiler Optimisations

COMP328 Lectures

Optimisation Types

There are several operations that can increase the execution speed of the same code:

  • Instruction ordering
  • Memory access order
  • Inlining
    • Functions are included to where they are called from.
  • Dead code/stores
    • Code and variables that are never reached, or never have their values read, are removed.
  • Code Hoisting
    • Code not required in a loop is moved out of the loop.
  • Common sub-expression elimination
  • Loop unrolling
  • Vectorisation

Optimisation Reports

When using the Intel compiler icc we can monitor what -On has done by using:

-qopt-reportN

where $N=1,2,3,\ldots$.

Manual Optimisations

Don’t:

  • Optimise what the compiler has already done.
  • Focus on small improvements.

Profiling

Use profiling (gprof) to find where the majority of time is spent in your code. You can then take the following steps:

  • Identify memory bottlenecks.
  • Register bottlenecks.
  • Cache Utilisation.

Then stop when:

  • Maximum performance is reached.
  • You run out of time or give up.

You can look for the following activities:

  • Remove I/O, including prints.
  • Remove debug code.
  • Remove dead code.
  • Check order of memory access (row major).

You can also give the compiler hits (*restrict).

Loop Fusion

If there are multiple loops iterating over the same arrays where work can be combined into a single loop:

  • If overdone this can lead to many live variables and register spillage.
    • Therefore in some cases we need to complete loop fission, where we separate a loop with many variables into two loops to make better use of registers.

Spatial Tiling

If we have data that we access in a particular order (such as in convolution). We may want to reorder our data into the order we will access them:

  • For a convolution, order the data in the order read by the sliding window as opposed to logically.

This makes better use of the cache lines.

Imagine for a convolution with $N4$ neighbours. We can change the order in which we read data such that we can re-use data from the cache instead of reading it from RAM:

  • Process the convolution on the diagonal instead of linearly. Therefore we are able to make use of temporal locality by reusing more cells in a shorter period of time.
  • Another solution would be to break the image into tiles that make best use of the size of the L1 cache.