Skip to content
UoL CS Notes

Terminology and Performance Optimisation

COMP328 Lectures

Terminology

Parallelism
Many queues using many resources.
Concurrency
Many queues sharing one resource.
Processor
One of: CPU, GPU, ASIC etc.
Core
A single processor on a CPU
CPU
One or more cores on a package.
Thread
In software, a single flow of control in a program.
Die
A silicon wafer including the CPU and supporting components.
Socket
The whole CPU package and pins that is inserted into a motherboard.
Node
A single motherboard.
Cluster/Supercomputer
100+ nodes connected through a high-speed interconnect.

Reasons for Sub-optimal Performance

  • CPU
    • Specialised CPU instructions are not utilised.
  • Memory Hierarchy
    • Code not exploiting spatio-temporal locality.
    • NUMA effects.
  • Work
    • Load imbalance between threads/processes.
    • Too much communication creating a network bottleneck.
    • Resource contention.
  • Parallel
    • Overheads of parallel implementation
    • Synchronisation/communication between threads.
    • Threads being scheduled on and off too quickly.

Characterising Computer Performance

We can approximate FP32 flops using the following calculation:

\[R_\text{peak}=2\times w_\text{vex}\times r_\text{clock}\times n_\text{core} \times n_\text{sock}\times n_\text{node}\]

The initial times 2 is due to combined add and multiply.

Where:

  • $w_\text{vex}$ - Vector width (AVX512 has 16 FP32 flops/s)
  • $r_\text{clock}$ - Clock speed
  • $n_\text{core}$ - Cores per socket
  • $n_\text{sock}$ - Sockets per node
  • $n_\text{node}$ - Nodes