ELEC320 (Neural Networks)

Neural Networks Introduction

ELEC320 Lectures 2023-01-30

Brains and (A)NNs

The brain is:

Highly Complex
Non-linear
Massively parallel

ANNs (artificial neural networks) are biologically inspired analogues of the brain and are based on primitive neuron components.

Neuron Structure

Synapses: Facilitate learning via creation of new, and modification of existing ones.
Axons: Signal transmission structures
Dendrites: Receptive structures

We can simulate a neuron with the following strucutre:

graph LR
x1 -->|w1| n[Neuron]
x2 -->|w2| n
xn -->|wn| n
n --> output

Each input has a weight and is summed as it enters the neuron to form the output:

\[\sum^N_{i=1} x_i w_i\]

We can then apply a threshold to the output if we want a binary output from the neuron.

Advantages of ANNs

Non-linearity
Input-to-output Mapping
Adaptivity
- NN can be retrained to adapt to solve a changing task or be adapted to a similar one.
Evidential Response
- A confidence can be provided as to whether the output is correct.
Contextual Information
- Knowledge is integrated into the network. Each neuron is affected by all others; local processing and global awareness.
Fault Tolerance
- When small damage occurs, the network usually degrades gracefully.
Uniformity of Analysis and Design
- NNs are composed of similar components. This allows for modularity and use of mathematics for learning and network analysis.
Neuro-biological Analogy
- As the system is already implemented in nature. The system is proven.

Using Machine Learning

Consider we want to use a NN to classify fish based on their lightness and width. We can use the following feature vector to represent each fish:

\[x = [x_1,x_2]^\top\]

Where:

$x_1$ is lightness
$x_2$ is width

Based on a set of training samples, the neural network could produce the following decision boundary by use of it’s weights:

decision boundary

From this training we may observe the following phenomenon:

Under-fitting
- A machine learning model is not complex enough to accurately capture relationships between a dataset’s features and the target variable.
Over-fitting
- The model is too closely fit to a limited set of data points and does not generalise well to unseen data points.

Neural Network Structure

ELEC320 Lectures 2023-02-03

Neuron Scheme

Each neuron is an information processing unit. Basic elements are:

Synapses/Connections - Characterised by a weight $w_i$ and a signal $x_i$.
Adder - Sums the input signals from the weighted synapses.
Bias - $b_k$ is added to the sum.
Activation function $\phi$ - Usually a non-linear function that produces the final output $y_k$.

graph LR
x1 -->|w1| n[Neuron]
x2 -->|w2| n
xn -->|wn| n
b["bias (bk)"] --> n
n -->|vk| af["Activation Function (phi(v))"]
af --> output["output (yk)"]

\[\begin{aligned} v_k&=\sum^m_{j=0}w_{kj}x_j\\ &=\mathbf w^T_k\mathbf X\\ y_k&=\phi(v_k) \end{aligned}\]

Where:

$\mathbf w^T_k=[x_{k0},\ldots,w_{km}]$
$\mathbf X=[x_0, \ldots,x_m]$

Types of Activation Functions

Threshold/Heavyside Function

Threshold function graph ¹

This has a binary output:

\[\phi(v)\begin{cases}1&\text{if}&v\geq0\\0&\text{if}&v<0\end{cases}\]

Limited to binary classification only.
No gradient information.

Piece-wise Linear Function

Piecewise linear function graph ¹

\[\phi(v)=\begin{cases}1 & \text{if} &v\geq\frac12\\v+\frac12&\text{if}&-\frac12<v<\frac12\\0&\text{if}&v\leq-\frac12\end{cases}\]

Narrow window of non-saturation.

Sigmoid & Hyperbolic Tangent Functions

The following is a sigmoid function:

Sigmoid function graph ¹

there is also another s-shaped continuous function using $\tanh$.

Their ranges are:

Sigmoid: 0 to 1
$\tanh$: -1 to 1

\[\phi(v)=\frac1{1+\exp(-\alpha\cdot v)}\]

\[\begin{aligned} \phi(v)&=\tanh(v)\\ &=\frac{\exp(2v)-1}{\exp(2v)+1}\in(-1,1) \end{aligned}\]

Rectified Linear Unit (ReLU)

ReLU function graph ²

Most common choice in state-of-the-art neural networks (deep learning):

\[\phi(v)=\max(v,0)=\begin{cases}v&v\geq0\\0&v<0\end{cases}\]

Non-linear
Variants include leaky ReLU:
- Has a shallow negative gradient when $v<0$.

Network Architectures

Single-layer Feed-forward Network

One input layer of source nodes that project directly onto the output computation.

There are no cycles allowed.

The output can be calculated using the following function:

\[y_k(\mathbf X)=\phi\left(\sum^p_{j=0}W_{kj}X_j\right)\]

Multi-layer Feed-forward Network

⁴

Has additional hidden layers in-between the input and output. These provide a more capable network:

Typically the neurons of each layer take the previous layer as their input.
Still no cycles.

The output can be calculated using the following function:

\[y_k(\mathbf X)=\phi\left(\sum^{p_\text{hidden}}_{j=0}W_{kj}\underbrace{\phi\left(\sum^{p_\text{input}}_{i=0}W_{ji}\overbrace{X_i}^\text{input layer}\right)}_\text{hidden layer output}\right)\]

Convolutional Neural Networks (CNNs)

convoultional neural network processing an image ⁵

A class of multi-layer feed-forward network with a special type of local connections:

Widely used for image classification.
A convolution is used to gain deeper understanding of the image.

Recurrent Networks

These have feedback loops which form cycles in the graph:

Feedback loops involve connections with unit-delay elements $z^{-1}$, which together with nonlinear activations enable the nonlinear dynamic behaviour of the network.

We can calculate the output of a recurrent, no hidden, delay one, self feedback network with the following:

\[y_k(n)\equiv y_k(\mathbf X(n),n)=\phi\left(\underbrace{\sum^{p_\text{input}}_{i=0}W_{ki}X_i(n)}_\text{current input}+\underbrace{\sum^{p_\text{output}}_{j=0}W_{kj}y_j(n-1)}_\text{past output}\right)\]

Knowledge Representation

The primary task of a NN is to learn it’s environment. To accomplish this, we need to feed it with environment knowledge, which is based on:

Observations - These are often noisy, contain errors and have redundant/missing information.
- These can be labelled or unlabelled with the ground truth for that sample.

Training Dataset Example

A set ${{x_i, y_i}}, i=1,2,\ldots$ of experimentally acquired data is referred as a training dataset:

OCR:
- $x$ = Pixel values or curves.
- $y$ = A, B, C, …

Training Process

Choose a subset of the available observations as a training dataset to teach the NN.

This set should be diverse and representative of data used in production.
Choose an appropriate architecture, with as many input nodes as the length of the input feature vector $x$ and output nodes needed for the output $y$.
Train the network on the training dataset.
Test the network with the remaining data.

Learning Processes

ELEC320 Lectures 2023-02-06

Learning

Learning is a process by which the free parameters of a NN are adapted through a process of stimulation by the environment in which the network is embedded.

Learning is defined in the following actions:

The network is stimulated by the environment.
It’s free parameters undergo changes to reflect this stimulation. (training mode)
The network responds in a new way to the environment, as a result of the changes in its internal structure. (online mode)

Learning Rules

Error Correction Learning

Assume that the $k^\text{th}$ neuron receives a signal vector $\mathbf x(n)$, where $n$ is the tiem variable of a discrete synaptic adaptation procedure. This signal can be the result of other neurons in the previous layers.

We can define the error in the signal by:

\[e_k(n)=d_k(n)-y_k(n)\]

where:

$y_k(n)$ - the neuron output signal.
$d_k(n)$ - the desired output signal.

Learning is achieved by minimising a cost function $E(n)$ until the system reaches a steady-state.

An example of a cost function could be squared error:

\[E(n)=\frac12e^2_k(n)\]

We can iterate on our error to converge on the desired $d_k$ value.

The amount we change the weight is given by:

\[\begin{aligned} \Delta W_{kj}(n)&=\eta e_k(n)x_j(n)\\ \underbrace{\Delta W_{kj}(n+1)}_\text{new weight}&=\underbrace{W_{kj}(n)}_\text{old weight}+\Delta W_{kj}(n)\\ \end{aligned}\]

We can change $\eta$ to change the speed of convergence and overshoot.

Memory Based Learning

Under this model we gather a large amount of data for past experiences which are stored in memory as a set $D_\text{train}$:

\[D_\text{train}=\{(\mathbf x_i, d_i)\}^N_{i=1}\]

where:

$\mathbf x_i$ - the sample
$d_i$ - the desired response

In a graph of this data, we classify new data by choosing the nearest neighbour to the new data.

To get a better response we may use $n$-nearest-neighbour where we average the response of the $n$ neighbours.

Hebbian Learning

A Hebbian synapse is defines as a time-dependent, highly local and strongly interactive mechanism to increase synaptic efficiency:

If neurons on either side of the synapse are activated simultaneously, then the strength of that synapse should be selectively increased.
If the activation is asynchronous, then the synapse should be weakened.

This section is unfinished.

Competitive Learning

The output neurons compete amongst themselves to become active, as opposed to other learning rules where more than one neuron can be active.

This is better suited to classification problems as only one neuron is allowed to fire.

This section is unfinished.

Boltzmann Learning

Learning Paradigms

Credit/Blame (Credit Assignment Problem)

Supervised

Unsupervised Learning

Reinforcement Learning

Single Layer Perceptrons

ELEC320 Lectures 2023-02-17

Least Mean Square (LMS)

This is a sample by sample approach. The linear least square method would be a batch approach.

\[\mathbf w(n+1)=\mathbf w(n)+\eta e(n)\mathbf x(n)\]

where:

$\mathbf w$ - weight vector
$\eta$ - learning rate
$e$ - the error $(d-y)$
$d$ - expected result
$y$ - neron output
$\mathbf x$ - input vector

This uses value from a simple single layer network:

graph LR
x1 --> w1
x2 --> w2
xp --> wp
w1 --> adder
w2 --> adder
wp --> adder
adder --> y

Perceptron

Rosenblatt’s Perceptrons are based on the McCulloch-Pitts neuronal modal. They are similar to linear neurons, but they have non-linear activations based on thresholding activations:

\[v_j=\sum^p_{i=1}w_ix_i+\theta=\mathbf w^T\mathbf x\]

with:

\[y_j=\phi(v_j)\]

This uses the following network:

graph LR
x1 --> w1
x2 --> w2
xp --> wp
w1 --> t[threshold θ]
w2 --> t
wp --> t
t -->|1| vj
vj -->|"ϕ(vj)"| yj

Single-Layer Perceptron Training Example

ELEC320 Tutorials 2023-02-20

Learn an `OR` function using the following perceptron:

graph LR
x1 --> w1
x2 --> w2
x0 --> w0
w0 --> n[neuron]
w1 --> n
w2 --> n
n --> ϕ
ϕ --> y

where:

$W = [w_0, w_1, w_2] = [-1, 1, -2]$
$\eta = \frac12$

$w_0$ is a bias and $w_1$ and $w_2$ are the two inputs to the neuron.

We want to adjust the weights until all the input patterns on $x_1$ and $x_2$ produce the correct outputs.

The general method is as follows:

Calculate the induced local field by calculating the dot product of the inputs and weights:
\[\mathbf x \cdot \mathbf w =\text{ILF}\]
For this example:
\[\begin{bmatrix}-1 & 0 & 1\end{bmatrix} \cdot\begin{bmatrix}0 & 1 & 0\end{bmatrix}=0\]
Calculate $y$ by using the activation function:
\[y=\phi(\text{ILF})\]
The activation function in this case has a positive bias so pulls values $\geq0$ to $1$ and drops values $<0$ to $-1$:
\[y=\phi(0) =1\]
For an input of $x_1=0, x_2=1$ we expect an output of $d=1$. We calculate the error $e$ by taking the difference between $d$ and $y$:
\[\begin{aligned} e&=d-y\\ &=1-1\\ &=0 \end{aligned}\]
If the error is non-zero we adjust the weights by the following function:
\[\mathbf w(n+1)=\mathbf w(n)+\eta e(n)\mathbf x(n)\]
In this case where no adjustment is needed and $\eta=\frac12$:
\[\begin{aligned} \mathbf w(n+1)&=\mathbf w(n)+\eta e(n)\mathbf x(n)\\ &=\begin{bmatrix}0\\1\\0\end{bmatrix}+\frac12\times 0\times\begin{bmatrix}-1\\0\\1\end{bmatrix}\\ &=\begin{bmatrix}0\\1\\0\end{bmatrix} \end{aligned}\]

Multi-Layer Perceptrons

ELEC320 Lectures 2023-02-28

Backpropagation

To update the weights for a multi-layer perceptron we need to complete the following steps:

Feed Forward
Backwards Pass

The general formula to update the weights on a multi-layer perceptron is:

\[\mathbf w_{(n+1)}=\mathbf w_n-\eta\nabla_\mathbf wE\]

where:

$\nabla_\mathbf wE$ is the gradient with respect to $E$.
$\eta$ is the learning rate.

We also have the following functions available:

Output error for the $j^{\text{th}}$ neuron:
\[e_j(n)=d_j(n)-y_i(n)\]
Output network error:
\[E(n)=\frac12\sum_{j\in\text{output}}e^2_j(n)\]
Average network error:
\[E_\text{avg}=\frac1N\sum^N_{n=1}E(n)\]
Induced local field (ILF):
\[v_j(n)=\text{sum weights and measures}\]
Neuron output:
\[y_j(n)=\phi(v_j(n))\]
Local gradient (for one node):
\[\delta_j(n)=e_j(n)\times\underbrace{\phi'(v_j(n))}_\text{activation function gradient}\]
Weight update function:
\[\nabla_\mathbf WE=-\delta_j(n)y_i(n)\]

Backwards Pass

Consider the following key equations:

A (output neuron):

\[\delta_j(n)=e_j(n)\phi'(v_j(n))\]

B (hidden neuron):

\[\delta_j(n)=\phi'_j\left(v_j\left(n\right)\right)\sum_n\left(\delta_k(n)w_{kj}(n)\right)\]

1 (weight difference):

\[\begin{aligned} \Delta w_{ji}(n)&=-\eta\frac{\partial E(n)}{\partial w_{ji}(n)}\\ &=\eta\delta_j(n)y_i(n) \end{aligned}\]

and follow the steps:

Calculate the local gradients by using A and B for the output and hidden layers.
Use equation 1 to update the gradients connecting the $j^\text{th}$ and $k^\text{th}$ neurons.

Multi-Layer Perceptron Training

ELEC320 Lectures 2023-02-28

Backpropagation Activation Functions

In multi-layer perceptrons we require activation functions with continuous gradients so that their derivatives are significant.

Using a threshold would not be acceptable as it isn’t continuous.

Both of the following functions create small derivatives when used in large networks. Leaky ReLU may be a better option in those applications.

Sigmoid

This is given by the following function:

\[\phi(z)=\frac1{1+\exp(-\alpha z)}\]

The derivative of this function is:

\[\phi'(v_j(n))=\alpha y_j(n)(1-y_i(n))\]

This can be derived by working out the partial derivatives for $\frac{\partial\phi}{\partial z}$.

Hyperbolic Tangent

This function $\tanh$ is readily available and has range of $(-1, 1)$ as opposed to $(0, 1)$ for sigmoid.

We can combine it with the scalars $\alpha$ and $\beta$ like so:

\[\phi_j(v_j(n))=\alpha\tanh\left(\beta v_j(n)\right)\]

It derivative is:

\[\phi'_j\left(v_j(n)\right)=\frac\beta\alpha\left(\alpha^2-y^2_j(n)\right)\]

Backpropagation Training Modes

Sequential training mode (also known as stochastic gradient descent) adjust weights after each input.
Batch training mode sums the errors across all samples and adjusts the weights after running all the input patterns.

A combination of the two is called mini batch which has smaller batch sizes:

This reduces storage requirements over batch training, produces better gradients than sequential and avoids local minima better than sequential.

Enhancing BP Learning Rate

We can use momentum ($\mu$) to enhance the learning rate $\eta$ without having trouble with local minima:

\[\delta w_{ji}(n)=\underbrace{\eta\delta_j(n)y_i(n)}_\text{standard}+\underbrace{\mu\delta w_{ji}(n-1)}_\text{momentum}\]

Solutions to Overfitting

There are several solutions to overfitting:

Early Stopping
Regularisation (smoothing the gradient)
- Weight Decay
- Penalisation of Derivatives
- Weight Elimination
Optimal Brain Surgeon (OBS):
- Remove least significant connections during training.
Dropout
- Remove least significant connections during testing.

Early Stopping

This method splits the labelled data into three sets:

Training Set
Validation Set

This is new.
Testing Set

This method stops learning when validation error is at minimum. This stops the network learning noise and helps it be more generalised.

Multi-Layer Perceptron Example

ELEC320 Tutorials 2023-02-28

Radial Basis Function Networks

ELEC320 Lectures 2023-03-06

Pattern Separability

A complex pattern-classification problem, cast in a high-dimensional space non-linearly, is more likely to be linearly separable than in a low-dimensional space.

This can be reduced to:

We can use a non-linear function to transform our data so that it can be classified using a linear perceptron.

This has the benefit of using a small network with less layers.

Let $D=\{\mathbf x_1,\ldots,\mathbf x_N\}$ be a set of $d$-dimensional patterns which belong to either of two classes $C_1$ or $C_2$.

For each pattern $\mathbf x_i$ we define some real-valued functions $\phi_i(\mathbf x)$ for $i=1,\ldots,b$. This defines teh multivalued mapping:

\[\phi(\mathbf X)\equiv\left[\phi_1(\mathbf x),\ldots,\phi_b(\mathbf x)\right]^T:\mathbb R^d\rightarrow\mathbb R^b\]

A dichotomy (binary partition) of the samples in $D$ if $\phi$-separable if there is a vector $\mathbf w$ such that:

\[\exists\mathbf x\in\mathbb R^b: \begin{cases} \mathbf x^T\phi(\mathbf xi)\geq 0&\forall\mathbf x\in C_1\\ \mathbf x^T\phi(\mathbf x)<0&\forall\mathbf x\in C_2 \end{cases}\]

The hyperplane $\mathbf x^T\phi(\mathbf x)=0$ partitions the $b$-dimensional $\phi$-space to two sides. At the same time the $d$-dimensional input-space to two parts.

Hence instead of using straight lines to partition the space we could use the following:

Planes
Spheres/Circles
Quadrics

This allows us to classify XOR using a linear perceptron instead of using a multi-layer one.

Interpolation

For a given set of $N$ $d$-dimensional points $\mathbf x_i$, and a corresponding set of associated vlues $y_i$, there is a function $F$ such that $F(\mathbf x_i)=y_i,\forall i$. This is to say there is a surface that passes through all the points in the dataset.

One way for solving this problem is with the method of radial basis functions (RBFs). We define a function:

\[F(\mathbf x)\equiv\sum^N_{i=1}\mathbf w_i\phi(\parallel x-x_i\parallel)\]

The parallel brackets mean that we calculate the distance between the two points.

$\phi(\parallel x-x_i\parallel)$ is a RBF: A real-valued function whose value depends only on the distance between the input an some fixed point $x_i$.

Commonly Used RBFs

Gaussian Functions:

\[\phi(r)=\exp\left(-\frac{r^2}{r\sigma^2}\right),\sigma>0,r\in\mathbb R\]

There are more but they won’t be used in this module generally.

RBF Neural Network

This network approximates the nearest neighbour approach. There is only one hidden layer that uses RBFs instead of a simple sum:

graph LR
x1["x1(n)"] --> ϕ1
x2["x2(n)"] --> ϕ2
xd["xd(n)"] --> ϕM
x1 --> ϕ2
x1 --> ϕM
x2 --> ϕ1
x2 --> ϕM
xd --> ϕ1
xd --> ϕ2
ϕ1 -->|w1| sum
ϕ2 -->|w2| sum
ϕM -->|wM| sum
sum --> y
ϕ0 -->|w0| sum
subgraph input
x1
x2
xd
end
subgraph hidden
ϕ1
ϕ2
ϕM
end
subgraph output
sum
end

The following functions are used:

\[\begin{equation} \tag{1} F(\mathbf x)=\sum^M_{i=1}w_i\phi_i(\mathbf x) \end{equation}\] \[\begin{equation} \tag{2} \phi_i(\mathbf x)\equiv\exp\left(-\frac{\parallel x-c_i\parallel^2_2}{2\sigma^2_i}\right) \end{equation}\]

Radial Basis Function Training (RBF NN)

ELEC320 Lectures 2023-03-10

Weight Update Function

\[\mathbf w=(\phi^T\phi+\lambda\mathbf I_{M\times M})^{-1}\phi^T\mathbf y\]

where:

$\lambda$ if the user-defined regularisation parameter:
- $\lambda=0$ indicates that the input data is completely trustworthy.
- $\lambda=\infty$ indicates that the dataset needs significant smoothing.

Radial Basis Functions vs. Multi-Layer Perceptrons

Universal Approximators: Given any function $f(x)=y$, we can approximate such a function using either network type.

Radial Basis Functions Require a Matrix inverse to be taken. This is slow and doesn’t scale well.
Despite the lengthy weight update process in MLP, they are preferred as it can be parallelised for large datasets.

Support Vector Machines

ELEC320 Lectures 2023-03-20

Optimal Hyperplane

Support vector machines create a hyperplane that linearly separates a set of data. This is called the optimal hyperplane.

The equation of optimal hyperplane takes the form:

\[g(x)=\mathbf x^T+b=0\]

Support Vectors

Support vectors are the closes data-points in the dataset to the optimal hyperplane. They are used to determine the location of the hyperplane that separates the set.

Using the function for the optimal hyperplane we can say:

\[g(\mathbf x_\text{sv})=1\]

The euclidean distance of a support vector from the optimal hyperplane can be calculated by:

\[\begin{aligned} \text{distance}\ r &= \frac{g(\mathbf x)}{\parallel \mathbf w\parallel_2}\\ &=\frac1{\parallel \mathbf w\parallel_2} \end{aligned}\]

Non-Separable Patterns

We can either use a slack variable ($\xi$) if the data is close to linearly separable, or we can transform the data to become linear.

Kernels

We can project each patterns $\mathbf x_i$ to a new feature space of higher dimensionality (compared to the original input space) using a projection function $\phi$:

\[\phi(\mathbf x_i):\mathbb R^d\rightarrow\mathbb R^b\]

We can use various types of kernels but the general form of the kernel machine is:

\[\begin{aligned} K(\mathbf x_i, \mathbf x_j)&=\phi(\mathbf x_i)^T\phi(\mathbf x_j)\\ &=\sum^b_{k=1}\phi_k(\mathbf x_i)\phi_k(\mathbf x_j) \end{aligned}\]

ELEC320 (Neural Networks)

Brains and (A)NNs

Neuron Structure

Advantages of ANNs

Using Machine Learning

Neuron Scheme

Types of Activation Functions

Threshold/Heavyside Function

Piece-wise Linear Function

Sigmoid & Hyperbolic Tangent Functions

Rectified Linear Unit (ReLU)

Network Architectures

Single-layer Feed-forward Network

Multi-layer Feed-forward Network

Convolutional Neural Networks (CNNs)

Recurrent Networks

Knowledge Representation

Training Dataset Example

Training Process

Learning

Learning Rules

Error Correction Learning

Memory Based Learning

Hebbian Learning

Competitive Learning

Boltzmann Learning

Learning Paradigms

Credit/Blame (Credit Assignment Problem)

Supervised

Unsupervised Learning

Reinforcement Learning

Least Mean Square (LMS)

Perceptron

Learn an OR function using the following perceptron:

Backpropagation

Backwards Pass

Backpropagation Activation Functions

Sigmoid

Hyperbolic Tangent

Backpropagation Training Modes

Enhancing BP Learning Rate

Solutions to Overfitting

Early Stopping

Pattern Separability

Interpolation

Commonly Used RBFs

RBF Neural Network

Weight Update Function

Radial Basis Functions vs. Multi-Layer Perceptrons

Optimal Hyperplane

Support Vectors

Non-Separable Patterns

Kernels

Learn an `OR` function using the following perceptron: