Skip to content
UoL CS Notes

COMP328 (High Performance Computing)

OpenMP Barriers

COMP328 Lectures

There are implicit barriers at the end of work-sharing constructs and parallel sections.

nowait

We can use nowait to remove the implicit barrier on a directive:

#pragma omp for nowait

Additionally schedule(static), can guarantee the order of operations provided that the total number iteration in a for loop are the same.

This is because each thread is tied to a particular iteration. Therefore, the operations are sequential for that thread.

barrier

Manually places a barrier at some point in a parallel section:

#pragma omp barrier

master

Only executes on the master thread. This has no barriers:

#pragma omp master

NUMA Consideration

The first thread to touch a variable defines in who’s memory the variable resides.

flowchart LR
MEM0 <--> CPU0
CPU0 <--> CPU1
CPU1 <--> MEM1

This code would be very slow as the array was defined all on the main thread:

// init on thread 0
for (int i=m; i<n; i++) {
	z[i] = init(i);
}

// work on all threads
#pragma omp parallel for
for (int i=m; i<n; i++) {
	z[i] *= work(i);
}

It would be better to write the data using the same socket that will use it later:

// init on all threads
#pragma omp parallel for schedule(static)
for (int i=m; i<n; i++) {
	z[i] = init(i);
}

// work using the threads that defined each int	
#pragma omp parallel for schedule(static)
for (int i=m; i<n; i++) {
	z[i] *= work(i);
}

Thread Affinity

We can ensure our threads don’t change cores over the course of the program’s runtime by using the following environment variables:

  • OMP_SET_DYNAMIC=false:
    • This guarantees that the number of threads you asked for are used.
  • OMP_PROC_BIND=true:
    • Binds threads to a core.
  • OMP_PLACES=:
    • Allows you to hand-pick which thread lives on which core.
    • This significantly reduces portability.

OpenMP `parallel` `for`

COMP328 Lectures

#pragma omp parallel for

This is a shortcut to create a parallel region and work-share a for loop in one go.

#pragma omp for doesn’t automatically divide the work between threads.

Data Clauses

Variable Clauses

Variables that are only declared and used inside a parallel region don’t need to exist outside (due to C scoping rules).

Variables declared outside the scope of the parallel region, accessed from inside, can have different sharing settings:

shared
Leave the variable as it was originally defined. All threads will refer to the original memory location when trying to access it (useful for read-only).
private
Allocate a new memory location for each thread corresponding to this variable. Any accesses to this variable inside the parallel region will now refer to this memory location instead of the original variable. The final value of this variable is not of interest.
firstprivate
Same as private, with the addition that the private copies of the variable will be initialised with the value that it was at the start of the parallel region.
lastprivate
Same as private, with the addition that the value of the variable will persist after the parallel region is over. Works with omp for, but not with general omp parallel sections.

We define data clauses for variables using the following syntax:

#praga omp parallel for shared(x)
for (int i = 0; i <= 10; i++) {
	...
}

or

#praga omp parallel for firstprivate(x)
for (int i = 0; i <= 10; i++) {
	...
}

We can also set a default using:

#praga omp parallel for default(...)

In the brackets we can set a default but the recommended value is default(none).

Reduction

We can define a reduction to be stored in a global variable with:

reduction(op:var)

To add values into a variable sum:

reduction(+:sum)

Each thread will have it’s own copy of sum which will later be summed in sum.

Work-Sharing

We can define the work sharing schedule by using:

schedule(static/dynamic/guided/runtime, <chunk size>)
  • static - Schedule is statically determined. This has less overhead if we can allocate optimally.

  • dynamic - Work can be stolen from other threads if another thread is idle. This is good for lazy programmers and while loops.

We can use this, for example, to multiply to matrices:

#pragma omp parallel for schedule(static , 1)
for (row = 1; row <= 6; row++) {
	for (col = 1; col <= row; col++) {
		res[row, col] += l[row, col] * y[col];
	}
}

We also have available:

  • guided - Chunks of decreasing size handed out as threads get free.
  • runtime - A promise to the compiler that the schedules options will be provided at runtime via an environment variable:

      OMP_SCHEDULE="guided,10"
    
  • auto - Finds the best over the course of several runs.

    This is not preferred in HPC as it is non-deterministic.

Serial & Parallel Code

We can use the following ifdef to compile conditionally:

#ifdef _OPENMP
// parallel code
#endif

OpenMP (Processes vs. Threads)

COMP328 Lectures

Process
The basic unit of work for the operating system.

Processes have a lot of overhead due to their isolation.

Thread
A portion of a process that can be run concurrently with another.

Threads have low overhead as there is implicit trust. They share a heap but have their own stack.

OpenMP

OpenMP, available at https://www.openmp.org, is a parallel library that operates on a thread level.

This is unlike OpenMPI that operates on a process level.

Consider the following example program:

#include <omp.h>
#include <stdio.h>

int main(void) {
    #pragma omp parallel
    {
        printf("Thread id is %d\n", omp_get_thread_num());
    }
    return 0;
} 

We can compile this with the following flags:

$ gcc -fopenmp thread.c
$ ./a.out

We can define the number of threads to spawn with the following variable:

OMP_NUM_THREADS=200 ./a.out

Otherwise the total number of system threads are used.

MPI Barrier

COMP328 Lectures

Barrier (Synchronisation)

We may use syncrnonisation for the following reasons:

  • Debugging:
    • We can test communication speed (not load imbalance) by using a barrier before a gather.
  • Ensure various task have been completed before continuing:
    • We could also test for completion in a loop while running other non-dependant tasks.
  • Reducing side-effects for time sensitive code.
  • Enforce ordering.

Barrier Syntax

C Syntax
       #include <mpi.h>
       int MPI_Barrier(MPI_Comm comm)

We should be careful when using barrier as it can significantly slow down code.

MPI Messages

COMP328 Lectures

Cost of MPI Messages

Over a given network we can calculate the time to send a message:

\[\text{message}\propto{k}\left(\text{latency}+\frac{\text{message size}}{\text{bandwidth}}\right)\]

With smaller messages we see that the latency makes a much larger impact on time.

MPI_Reduce

SYNTAX
C Syntax
       #include <mpi.h>
       int MPI_Reduce(const void *sendbuf, void *recvbuf, int count,
                      MPI_Datatype datatype, MPI_Op op, int root,
                      MPI_Comm comm)

Everyone calls this function resulting in collective communication.

Collective Functions

Name Function Type
MPI_Bcast Send the same data to all ranks. 1 to many.
MPI_Scatter Distributing data between ranks. 1 to many.
MPI_Gather Collecting data together from ranks. Many to one.
MPI_Reduce Pulling data from all ranks and doing a reduction operation (summation, product). Many to one.
  • There are also MPI_Allgather and MPI_Allreduce that act as usual, but broadcast the result to all ranks.
  • MPI_Scatter, MPI_Gather, MPI_Allgather are also available in the form MPI_Scatterv. This allows sending of variable size data chunks.

Message Passing Interface (MPI) Introduction

COMP328 Lectures

Types of Locks

Deadlock
Occurs when two or more task wait for each other an each will not resume until some action is taken
Livelock
Occurs when the tasks involved in a deadlock take action to resolve the original deadlock but in such a way that there is no progress.

Parallel Programming Models

Shared Memory:

  • All memory is accessible
  • Programmed using OpenMP.

Distributed Memory:

  • Sharing of memory must be explicit.
  • Network latency and bandwidth become important.
  • Programmed using MPI:
    • Message passing interface.

Flynn’s Taxonomy

MPMD
Multiple programs multiple data.
SPMD
Single program multiple data. Can also be multiple copies of the same program.

Data must be passed explicitly between the processes.

MPI Vocabulary

Process
Each instance of the code runs as an MPI process. Each process has a numbered rank.
Communicator
A collection of processors that can send messages to each other. The default communicator, containing all processes is MPI_COMM_WORLD.
Rank
A numerical ID of a process within a communicator (indexed from 0).

Example MPI Programmes

#include <stdio.h>
#include <mpi.h>
int main(void){
	MPI_Init(NULL, NULL);
	mpiRankWorkToDo();
	MPI_Finalize();
}
#include <stdio.h>
#include <mpi.h>
int main(void) {
	int myID;
	MPI_Init(NULL, NULL);
	MPI_Comm_rank(MPI_COMM_WORLD, &myID);
	printf("Hi from process ranked %d\n", myID);
	MPI_Finalize();
}

We can compile this with mpicc and run it with:

$ mpirun –np<num_process> ./a.out

MPI Point-to-Point Communications

  • One sender
  • One receiver

There are two important functions for MPI (taken from man):

SYNTAX
C Syntax
       #include <mpi.h>
       int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest,
            int tag, MPI_Comm comm)
SYNTAX
C Syntax
       #include <mpi.h>
       int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
            int source, int tag, MPI_Comm comm, MPI_Status *status)

These functions also take the types and source rank to give some static checking.

We need to build in logic so that our code can detect it’s own rank:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
const int MAX_STRING = 100;
int main(void) {
	int commSize;
	int myRank;
	
	char greeting[MAX_STRING];
	
	MPI_Init(NULL, NULL);
	MPI_Comm_size(MPI_COMM_WORLD, &commSize);
	MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
	
	char processor_name[MPI_MAX_PROCESSOR_NAME];
	int name_len;
	MPI_Get_processor_name(processor_name, &name_len);
	
	if(myRank != 0){
		sprintf(greeting, "Greetings from processor %s, process %d of %d!", processor_name, myRank, commSize);
		MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
	} else {
		printf("Greetings from processor %s, process %d of %d!", processor_name, myRank, commSize);
		int i;
		for(i = 1; i < commSize; i++){
			MPI_Recv(greeting, MAX_STRING, MPI_CHAR, i, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
			printf("%s\n", greeting);
		}
	}
}

Point-to-Point Problems

Both MPI_Recv and MPI_Send are blocking functions:

  • Send will not return until it has been received.
  • Receive will not return until there is a message.

This can lead to deadlocks if all ranks are trying to send an receive at the same time.

One solution is to have odds send and evens receive for one cycle and then swap if required. This way data will flow properly and there will be no locks.

Scaling & Efficiency

COMP328 Lectures

Speedup & Efficiency

Speedup
Is a ratio of the time it takes to run a program without parallelism versus the time it runs in parallel.
Scalability (Parallel Efficiency)
Is the ratio of speedup to the ideal speedup.

Given that $t_1$ is the time on one core and $t_p$ is the time on $p$ cores:

\[\text{speedup}(s_p) = \frac{t_1}{t_p}\]

Ideal speedup is when $s_p=p$.

This is not a good measure as the two programs could be wildly different and may not be optimal in either case.

\[\text{efficiency} = \frac{s_p}{p}\]

Runtime information should also be provided as this only compares a single implementation.

Scaling

Strong Scaling

For strong scaling we:

  • Keep the input fixed.
  • Change the number of cores.
  • Report the runtime.

We want to see a linear relationship between the cores and runtime.

Weak Scaling

Weak scaling also changes the problem size:

  • Increase the problem size proportionally to the number of cores (resolution of an image).
  • Report time for solution.

We want the time taken to be identical between runs and the input size and resources are proportional.

Parallel Laws

Ahmdahl’s Law

A computer program will never go faster than the sum of the parts that do not run in parallel (the serial parts of the program) no matter how many processing elements we have.

\[\lim_{p\rightarrow\infty}s_p=\frac1\alpha\]

where:

  • $\alpha$ - a fraction of the original program that is inherently serial in terms of runtime.
  • $s_p$ - speedup.
  • $p$ - number of cores.

Therefore, given infinite resources, the maximum speedup will be $\frac1\alpha$.

Gustafon’s Law

This law allows the problem size to increase as we add more cores:

\[s_p=\alpha+p\times(1-\alpha)\]

This offers a more optimistic speedup.

Parallelism

COMP328 Lectures

It is usually better to optimise single thread performance before going to parallelism:

  • You often have to completely re-implement your algorithm when moving to multi-threaded operations.

You should consider your options in the following order:

  1. Vectorisation
  2. OpenMP
  3. MPI
  4. GPU
  5. ASIC/FPGA

Considerations

Generally there is load imbalance in parallel operations. The expected time is:

\[\mathbb{E}[t]=\frac{\text{work}}{\text{#people}}\]

This is a lower bound, which is generally not met due to imbalances and overheads.

Performance By Design

Our implementation will often become tailored to the system it will run on. For any system we will:

  • Meet the maximum performance of the system (roof-line model)
  • Have below expected performance.

We can observe:

  • There is no best solution for any system
  • There is no universal solution for all systems.

This is a problem when trying to make portable code that will make best use of the hardware of all systems.

To optimise for this problem we can use DSLs like:

  • pytorch
  • tensorflow

to make our programs performance portable. This means that we may get 80% of roof-line on all systems instead of variable performance based on the system.

Data Parallelism

Given a dataset and an operation that can be applied to each element:

  • Apply the operation concurrently to each element.

If you have more work than cores then equally divide between the cores.

Task Parallelism

Several (different) independent tasks are to be applies to the same data:

  • Ask each core to work on a different task on the same data.

This often results in load imbalance as the different operations often take different amounts of time.

Pipelines

  • Many (potentially dependant) tasks are to be applied to a stream of data.
  • Each data item is processed by stages that they pass through.
  • Different items can be processed by different stages of the pipeline a the same time.
graph LR
1[task 1] --> 2[task 2]
2 --> 3[task 3]
3 --> 4[task4]

In general a job that takes:

\[n_j\times n_d = \text{timesteps}\]

by one core will take:

\[n_d + (n_j - 1)= \text{timesteps}\]

where:

  • $n_d$ is the amount of items
  • $n_j$ are the number of jobs

There should be as many cores as jobs for this to be true. Additionally the pipeline should have enough items to fill up.

Compiler Optimisations

COMP328 Lectures

Optimisation Types

There are several operations that can increase the execution speed of the same code:

  • Instruction ordering
  • Memory access order
  • Inlining
    • Functions are included to where they are called from.
  • Dead code/stores
    • Code and variables that are never reached, or never have their values read, are removed.
  • Code Hoisting
    • Code not required in a loop is moved out of the loop.
  • Common sub-expression elimination
  • Loop unrolling
  • Vectorisation

Optimisation Reports

When using the Intel compiler icc we can monitor what -On has done by using:

-qopt-reportN

where $N=1,2,3,\ldots$.

Manual Optimisations

Don’t:

  • Optimise what the compiler has already done.
  • Focus on small improvements.

Profiling

Use profiling (gprof) to find where the majority of time is spent in your code. You can then take the following steps:

  • Identify memory bottlenecks.
  • Register bottlenecks.
  • Cache Utilisation.

Then stop when:

  • Maximum performance is reached.
  • You run out of time or give up.

You can look for the following activities:

  • Remove I/O, including prints.
  • Remove debug code.
  • Remove dead code.
  • Check order of memory access (row major).

You can also give the compiler hits (*restrict).

Loop Fusion

If there are multiple loops iterating over the same arrays where work can be combined into a single loop:

  • If overdone this can lead to many live variables and register spillage.
    • Therefore in some cases we need to complete loop fission, where we separate a loop with many variables into two loops to make better use of registers.

Spatial Tiling

If we have data that we access in a particular order (such as in convolution). We may want to reorder our data into the order we will access them:

  • For a convolution, order the data in the order read by the sliding window as opposed to logically.

This makes better use of the cache lines.

Imagine for a convolution with $N4$ neighbours. We can change the order in which we read data such that we can re-use data from the cache instead of reading it from RAM:

  • Process the convolution on the diagonal instead of linearly. Therefore we are able to make use of temporal locality by reusing more cells in a shorter period of time.
  • Another solution would be to break the image into tiles that make best use of the size of the L1 cache.

Computer Architectures

COMP328 Lectures

Von Neumann Architecture

The Von Neumann architecture has the following block diagram:

flowchart LR
subgraph cpu
cu[control unit]
alu[arithmetic logic unit]
end
id[input device] --> cpu
cpu --> od[output device]
cpu <--> mu[memory unit]

Building on Alan Turing’s work, Von Neumann invented the concept of the stored program where the programming is separate from the hardware of a general purpose computer.

Data and instructions share a common bus, hence the Von Neuman bottleneck.

Von Neumann Cycle

  1. IF - Fetch the instruction corresponding to the program counter from memory.
  2. ID - Decode the instruction.
  3. MEM - Fetch data from memory.
  4. EX - Execute the instruction.
  5. WB - Write back the results.

As each step is discrete, we can pipeline this process to process 5 instructions in 9 cycles instead of 1 in 5.

This assumes each step takes one cycle to complete.

Once the pipeline is full then it effectively takes 1 cycle for each instruction.

We should be aware that running instructions that take many clock cycles create pipeline bubbles. This will block further instructions and empty the pipeline.

Advanced Pipe-lining

  • Out of order execution completes execution when data is ready and presents the results in order when required.
  • Speculative execution occurs on a branch. The CPU predicts which side is executed before the evaluation is complete. If it is incorrect then it will have to backtrack.
  • Fused multiply-add implements

    \[a=a+(b\times c)\]

    as a single operation so speed up convolutions and matrix operations.

Von Neumann Bottleneck Solutions

To move less data on the bus we can use:

  • Cache on die:
    • The bigger the cache the slower it is due to searching for the data location.

      This can be solve by having a caching hierarchy: L1, L2, L3…

  • Temporal Locality:
    • If a memory location is accessed then is is very likely to be needed again soon.
  • Spatial Locality:
    • Often with arrays, if a memory location is accessed, it is likely that it’s neighbours will be accessed.

      When data is accessed, the whole cache line (64 bytes) is pulled.

Instruction Level Parallelism (ILP)

This is additional parallelism within a core.

Vector Instructions (SIMD)

Consider we have the following code:

for(int i = 0; i < 1000; i++) {
	c[i] = a[i] + b[i];
}

The compiler may optimise this to use vector instructions so that we can complete multiple additions in one clock cycle.

This requires that the serial instructions are not co-dependant.

Simultaneous Multi-threading (SMT)

If a processor has the hardware to support two threads natively:

  • Two Program Counters
  • Two Sets of Registers

Provided that we are using different pipelines within the processor (different instructions) we can get performance gains by making better use of the blocks in the processor.

OS Thread Scheduling

We can use thread pinning to ensure that there are no context-switches while a program is running.

This removes the overhead involved in changing processor:

  • Saving the program counter and registers, etc.

Multi-Core Systems

With multi-core systems we run into the following issues:

  • The memory bus is stretched further.
  • Cache contention:
    • What if one core changes some data that another core needs soon after.
  • Memory Contention:
    • Lock/semaphores.
  • Interconnects need to be faster to support more cores.
  • Programs need to be written differently to make best use of the cores.
  • Load balancing.

We will use OpenMP as a solution for this.

Multi-Socket Systems

So far we have seen systems like this:

graph TD
cpu0 --> memory
cpu1 --> memory

This stretches the memory bandwidth even thinner between the two.

NUMA

  • Each CPU gets a dedicated bus to a section of local memory.
  • Each CPU sees all the address space.
  • If a processor needs data from non-local memory it must travel over the (slow) interconnect.

Measuring Performance of Code

COMP328 Lectures

Performance Lists

You can measure super computers by different metrics:

  • Top500 (By linpack, $R_\max$)
  • Green500 (Flops/Watt)
  • Graph500 (Graph computing)
  • HPL-AI (AI computing)

Using a HPC Cluster

  1. Login to login node (don’t do any hard work here).
  2. Submit work to the compute nodes.

Measuring Runtime

  1. Use one of the functions to get the current time in millis.
  2. Check the time again at the end of execution.

    Ensure you have exclusive access over the compute node you are running on.

  3. Run Multiple times (3-5) to ramp up clocks and average out interference from other programs.
  4. Report the minimum time of all the runs.

As the error $\epsilon$ will always be positive, reporting the minimum time will give the closes to the true value:

\[r_0=r_t+\epsilon\]

where:

  • $r_0$ - Observed runtime
  • $r_t$ - True runtime

Arithmetic Intensity

Arithmetic intensity is related to the time complexity of computations in an algorithm. It can also be characterised by the following equation:

\[I(n)=\frac{W(n)}{Q(n)}\]

where $W$ is the number of flops carried out by the program and $Q$ is the number of bytes transferred from memory to cache.

For example if our CPU has 416 Gflop/s and 68 GB/s of compute and memory bandwidth, 6 operations per byte would be ideal. In this case we would be making best use of the hardware.

  • Programs with high AI are called compute-bound programs.
  • Programs with low AI are called memory-bound programs.

Generally, modern programs are memory bound.

Roofline Model

graph of roofline model

We can measure algorithm performance by comparing to the maximum performance of the system.

  • The peak performance bound is the flop/s of a system.

Terminology and Performance Optimisation

COMP328 Lectures

Terminology

Parallelism
Many queues using many resources.
Concurrency
Many queues sharing one resource.
Processor
One of: CPU, GPU, ASIC etc.
Core
A single processor on a CPU
CPU
One or more cores on a package.
Thread
In software, a single flow of control in a program.
Die
A silicon wafer including the CPU and supporting components.
Socket
The whole CPU package and pins that is inserted into a motherboard.
Node
A single motherboard.
Cluster/Supercomputer
100+ nodes connected through a high-speed interconnect.

Reasons for Sub-optimal Performance

  • CPU
    • Specialised CPU instructions are not utilised.
  • Memory Hierarchy
    • Code not exploiting spatio-temporal locality.
    • NUMA effects.
  • Work
    • Load imbalance between threads/processes.
    • Too much communication creating a network bottleneck.
    • Resource contention.
  • Parallel
    • Overheads of parallel implementation
    • Synchronisation/communication between threads.
    • Threads being scheduled on and off too quickly.

Characterising Computer Performance

We can approximate FP32 flops using the following calculation:

\[R_\text{peak}=2\times w_\text{vex}\times r_\text{clock}\times n_\text{core} \times n_\text{sock}\times n_\text{node}\]

The initial times 2 is due to combined add and multiply.

Where:

  • $w_\text{vex}$ - Vector width (AVX512 has 16 FP32 flops/s)
  • $r_\text{clock}$ - Clock speed
  • $n_\text{core}$ - Cores per socket
  • $n_\text{sock}$ - Sockets per node
  • $n_\text{node}$ - Nodes

High-Performance Computing Introduction

COMP328 Lectures

What is a Supercomputer

Supercomputers use parallelism though the use of multiple powerful computers to compute very fast. Multiple compute nodes communicate through fast networking in order to consolidate their power.

Applications

  • Weather Modelling
  • Disease Modelling
  • Drug Design and Testing
  • Large Machine Learning Models
  • Scientific Research

Reasoning

  • Faster computers allow more accurate answers. This allows us to:
    • Produce longer forecasts.
    • Capture more phenomena together.
  • Multiprocessor machines allow us to:
    • Address more memory.
    • Not be limited by the speed of a single processor.

Parallel Computing

Parallel computing makes use of pipe-lining serial code, or solving the same problem using different data, in order to increase the speed of computation by distributing it’s load onto several processors. Parallelism is hard to achieve as:

  • Humans find is easier to think, and program, sequentially.
  • The von-Neumann architecture is inherently sequential.