Accelerators (GPUs)

COMP328 Lectures 2023-04-18

Programming Model

Directives Based	Languages/Extensions
OpenMP	OpenCL
OpenACC	CUDA

OpenMP:

Prescriptive - User explicitly specifies actions to be undertaken to the compiler.

OpenACC:

Descriptive - User guides the compiler on how to compile. The compiler decides how to implement the parallelism.

OpenMP for Accelerators

These examples can’t be run using OpenMP on Barkla. They are to show the specification only.

#pragma omp target

Defines a region of code to be off-loaded to the target (GPU)

#pragma omp parallel for

If used within a target region, will run using GPU threads.

There are also the following target clauses available:

device(N)

Run on device N.

map(A, B)

Ensure A and B vars are available on the target.

map(toFrom: C)

Copy C to target device, run and copy back.

We can put this together to form the following example:

#include <omp.h>
#include <stdio.h>
int main() {
  int runningOnGPU = 0;
/* Test if GPU is available using OpenMP4.5 */
#pragma omp target map(fromto : runningOnGPU)
  {
    if (omp_is_initial_device() == 0)
      runningOnGPU = 1;
  }
  /* If still running on CPU, GPU must not be available */
  if (runningOnGPU)
    printf("### Able to use the GPU! ### \n");
  else
    printf("### Unable to use the GPU, using CPU! ###\n");
  return 0;
}

OpenACC

We can use the following directives:

#pragma acc parallel

This loop should be parallelised

#pragma acc kernels

Use compiler to find the parallelism.

Data Transfer in OpenACC

Each time a kernel is called, the data needs to be transferred over the bus. We can direct the compiler to keep data on the GPU in order to reduce data transfers. We can use the following directive:

#pragma data copy(variable to copy) create(temp to keep on GPU)

This can be useful to wrap while loops that can’t be parallelised due to their unknown length.

Caching in OpenACC

We can instruct OpenACC in which order to complete a for loop. This allows us to better use the cache as make use of spatio-temporal locality instead of completing the instructions linearly:

#pragma acc kernels
{
	#pragma acc loop tile(32, 8)
	for(int j=1; j<n-1; j++){
		for(int i=1; i<n-1; i++){
			A[j][i] = Anew[j][i]
		}
	}
}

It is presumed tile completes the iterations based on the nesting level of the for loops. This then completes the iterations in 32x8 tiles.

OpenACC Data Clauses

We can use the following additonal clauses on #pragma acc data to reduce data movement:

copy(x) - Copy of host x to device at start and exit of region.
copyin(x) - Only copy to device at start.
copyout(x) - Only copy off device at the end.
create(y) - Create a variable y only on the accelerator.
present(z) - z already exists on the device from a previous kernel.