Accelerators (GPUs)
Programming Model
Directives Based | Languages/Extensions |
---|---|
OpenMP | OpenCL |
OpenACC | CUDA |
OpenMP:
- Prescriptive - User explicitly specifies actions to be undertaken to the compiler.
OpenACC:
- Descriptive - User guides the compiler on how to compile. The compiler decides how to implement the parallelism.
OpenMP for Accelerators
These examples can’t be run using OpenMP on Barkla. They are to show the specification only.
#pragma omp target
- Defines a region of code to be off-loaded to the target (GPU)
#pragma omp parallel for
- If used within a
target
region, will run using GPU threads.
There are also the following target clauses available:
device(N)
- Run on device
N
.
map(A, B)
- Ensure
A
andB
vars are available on the target.
map(toFrom: C)
- Copy C to target device, run and copy back.
We can put this together to form the following example:
#include <omp.h>
#include <stdio.h>
int main() {
int runningOnGPU = 0;
/* Test if GPU is available using OpenMP4.5 */
#pragma omp target map(fromto : runningOnGPU)
{
if (omp_is_initial_device() == 0)
runningOnGPU = 1;
}
/* If still running on CPU, GPU must not be available */
if (runningOnGPU)
printf("### Able to use the GPU! ### \n");
else
printf("### Unable to use the GPU, using CPU! ###\n");
return 0;
}
OpenACC
We can use the following directives:
#pragma acc parallel
- This loop should be parallelised
#pragma acc kernels
- Use compiler to find the parallelism.
Data Transfer in OpenACC
Each time a kernel is called, the data needs to be transferred over the bus. We can direct the compiler to keep data on the GPU in order to reduce data transfers. We can use the following directive:
#pragma data copy(variable to copy) create(temp to keep on GPU)
This can be useful to wrap while loops that can’t be parallelised due to their unknown length.
Caching in OpenACC
We can instruct OpenACC in which order to complete a for loop. This allows us to better use the cache as make use of spatio-temporal locality instead of completing the instructions linearly:
#pragma acc kernels
{
#pragma acc loop tile(32, 8)
for(int j=1; j<n-1; j++){
for(int i=1; i<n-1; i++){
A[j][i] = Anew[j][i]
}
}
}
It is presumed tile completes the iterations based on the nesting level of the for loops. This then completes the iterations in 32x8 tiles.
OpenACC Data Clauses
We can use the following additonal clauses on #pragma acc data
to reduce data movement:
copy(x)
- Copy of hostx
to device at start and exit of region.copyin(x)
- Only copy to device at start.copyout(x)
- Only copy off device at the end.create(y)
- Create a variabley
only on the accelerator.present(z)
-z
already exists on the device from a previous kernel.