Terminology and Performance Optimisation
Terminology
- Parallelism
- Many queues using many resources.
- Concurrency
- Many queues sharing one resource.
- Processor
- One of: CPU, GPU, ASIC etc.
- Core
- A single processor on a CPU
- CPU
- One or more cores on a package.
- Thread
- In software, a single flow of control in a program.
- Die
- A silicon wafer including the CPU and supporting components.
- Socket
- The whole CPU package and pins that is inserted into a motherboard.
- Node
- A single motherboard.
- Cluster/Supercomputer
- 100+ nodes connected through a high-speed interconnect.
Reasons for Sub-optimal Performance
- CPU
- Specialised CPU instructions are not utilised.
- Memory Hierarchy
- Code not exploiting spatio-temporal locality.
- NUMA effects.
- Work
- Load imbalance between threads/processes.
- Too much communication creating a network bottleneck.
- Resource contention.
- Parallel
- Overheads of parallel implementation
- Synchronisation/communication between threads.
- Threads being scheduled on and off too quickly.
Characterising Computer Performance
We can approximate FP32 flops using the following calculation:
\[R_\text{peak}=2\times w_\text{vex}\times r_\text{clock}\times n_\text{core} \times n_\text{sock}\times n_\text{node}\]The initial times 2 is due to combined add and multiply.
Where:
- $w_\text{vex}$ - Vector width (AVX512 has 16 FP32 flops/s)
- $r_\text{clock}$ - Clock speed
- $n_\text{core}$ - Cores per socket
- $n_\text{sock}$ - Sockets per node
- $n_\text{node}$ - Nodes