Skip to content
UoL CS Notes

Multi-Layer Perceptron Training

ELEC320 Lectures

Backpropagation Activation Functions

In multi-layer perceptrons we require activation functions with continuous gradients so that their derivatives are significant.

Using a threshold would not be acceptable as it isn’t continuous.

Both of the following functions create small derivatives when used in large networks. Leaky ReLU may be a better option in those applications.

Sigmoid

This is given by the following function:

\[\phi(z)=\frac1{1+\exp(-\alpha z)}\]

The derivative of this function is:

\[\phi'(v_j(n))=\alpha y_j(n)(1-y_i(n))\]

This can be derived by working out the partial derivatives for $\frac{\partial\phi}{\partial z}$.

Hyperbolic Tangent

This function $\tanh$ is readily available and has range of $(-1, 1)$ as opposed to $(0, 1)$ for sigmoid.

We can combine it with the scalars $\alpha$ and $\beta$ like so:

\[\phi_j(v_j(n))=\alpha\tanh\left(\beta v_j(n)\right)\]

It derivative is:

\[\phi'_j\left(v_j(n)\right)=\frac\beta\alpha\left(\alpha^2-y^2_j(n)\right)\]

Backpropagation Training Modes

  • Sequential training mode (also known as stochastic gradient descent) adjust weights after each input.
  • Batch training mode sums the errors across all samples and adjusts the weights after running all the input patterns.

A combination of the two is called mini batch which has smaller batch sizes:

  • This reduces storage requirements over batch training, produces better gradients than sequential and avoids local minima better than sequential.

Enhancing BP Learning Rate

We can use momentum ($\mu$) to enhance the learning rate $\eta$ without having trouble with local minima:

\[\delta w_{ji}(n)=\underbrace{\eta\delta_j(n)y_i(n)}_\text{standard}+\underbrace{\mu\delta w_{ji}(n-1)}_\text{momentum}\]

Solutions to Overfitting

There are several solutions to overfitting:

  • Early Stopping
  • Regularisation (smoothing the gradient)
    • Weight Decay
    • Penalisation of Derivatives
    • Weight Elimination
  • Optimal Brain Surgeon (OBS):
    • Remove least significant connections during training.
  • Dropout
    • Remove least significant connections during testing.

Early Stopping

This method splits the labelled data into three sets:

  • Training Set
  • Validation Set

    This is new.

  • Testing Set

This method stops learning when validation error is at minimum. This stops the network learning noise and helps it be more generalised.