Multi-Layer Perceptron Training

Backpropagation Activation Functions

In multi-layer perceptrons we require activation functions with continuous gradients so that their derivatives are significant.

Using a threshold would not be acceptable as it isn’t continuous.

Both of the following functions create small derivatives when used in large networks. Leaky ReLU may be a better option in those applications.

This is given by the following function:

\[\phi(z)=\frac1{1+\exp(-\alpha z)}\]

The derivative of this function is:

\[\phi'(v_j(n))=\alpha y_j(n)(1-y_i(n))\]

This can be derived by working out the partial derivatives for $\frac{\partial\phi}{\partial z}$.

This function $\tanh$ is readily available and has range of $(-1, 1)$ as opposed to $(0, 1)$ for sigmoid.

We can combine it with the scalars $\alpha$ and $\beta$ like so:

\[\phi_j(v_j(n))=\alpha\tanh\left(\beta v_j(n)\right)\]

It derivative is:

\[\phi'_j\left(v_j(n)\right)=\frac\beta\alpha\left(\alpha^2-y^2_j(n)\right)\]

Sequential training mode (also known as stochastic gradient descent) adjust weights after each input.
Batch training mode sums the errors across all samples and adjusts the weights after running all the input patterns.

A combination of the two is called mini batch which has smaller batch sizes:

This reduces storage requirements over batch training, produces better gradients than sequential and avoids local minima better than sequential.

We can use momentum ($\mu$) to enhance the learning rate $\eta$ without having trouble with local minima:

\[\delta w_{ji}(n)=\underbrace{\eta\delta_j(n)y_i(n)}_\text{standard}+\underbrace{\mu\delta w_{ji}(n-1)}_\text{momentum}\]

There are several solutions to overfitting:

Early Stopping
Regularisation (smoothing the gradient)
- Weight Decay
- Penalisation of Derivatives
- Weight Elimination
Optimal Brain Surgeon (OBS):
- Remove least significant connections during training.
Dropout
- Remove least significant connections during testing.

This method splits the labelled data into three sets:

This method stops learning when validation error is at minimum. This stops the network learning noise and helps it be more generalised.