Multi-Layer Perceptron Training
Backpropagation Activation Functions
In multi-layer perceptrons we require activation functions with continuous gradients so that their derivatives are significant.
Using a threshold would not be acceptable as it isn’t continuous.
Both of the following functions create small derivatives when used in large networks. Leaky ReLU may be a better option in those applications.
Sigmoid
This is given by the following function:
\[\phi(z)=\frac1{1+\exp(-\alpha z)}\]The derivative of this function is:
\[\phi'(v_j(n))=\alpha y_j(n)(1-y_i(n))\]This can be derived by working out the partial derivatives for $\frac{\partial\phi}{\partial z}$.
Hyperbolic Tangent
This function $\tanh$ is readily available and has range of $(-1, 1)$ as opposed to $(0, 1)$ for sigmoid.
We can combine it with the scalars $\alpha$ and $\beta$ like so:
\[\phi_j(v_j(n))=\alpha\tanh\left(\beta v_j(n)\right)\]It derivative is:
\[\phi'_j\left(v_j(n)\right)=\frac\beta\alpha\left(\alpha^2-y^2_j(n)\right)\]Backpropagation Training Modes
- Sequential training mode (also known as stochastic gradient descent) adjust weights after each input.
- Batch training mode sums the errors across all samples and adjusts the weights after running all the input patterns.
A combination of the two is called mini batch which has smaller batch sizes:
- This reduces storage requirements over batch training, produces better gradients than sequential and avoids local minima better than sequential.
Enhancing BP Learning Rate
We can use momentum ($\mu$) to enhance the learning rate $\eta$ without having trouble with local minima:
\[\delta w_{ji}(n)=\underbrace{\eta\delta_j(n)y_i(n)}_\text{standard}+\underbrace{\mu\delta w_{ji}(n-1)}_\text{momentum}\]Solutions to Overfitting
There are several solutions to overfitting:
- Early Stopping
- Regularisation (smoothing the gradient)
- Weight Decay
- Penalisation of Derivatives
- Weight Elimination
- Optimal Brain Surgeon (OBS):
- Remove least significant connections during training.
- Dropout
- Remove least significant connections during testing.
Early Stopping
This method splits the labelled data into three sets:
- Training Set
-
Validation Set
This is new.
- Testing Set
This method stops learning when validation error is at minimum. This stops the network learning noise and helps it be more generalised.