Skip to content
UoL CS Notes

Regression & Data Analysis

COMP116 Lectures

This is the method that we use to draw trends from experimental data. This is better than a table as it makes the data clearer.

Example

How to we best fit a line to this graph:

$x$ 1.5 2.4 3.6 4.2 4.8 5.7 6.3 7.9
$y$ 1.2 1.5 1.9 2.0 2.2 2.4 2.5 2.8

It seems that a curve would be best suited.

Least Squares

The standard interpretation of best fit is least squares.

We have data observation pairs:

\[\{\langle x_k,y_k\rangle:1\leq k\leq n\}\]

We want a function $F:\Bbb R\rightarrow \Bbb R$ that best describes these (minimises):

\[\sum^n_{k=1}(y_k-F(x_k))^2\]

This gives smaller values if the function fits the data better.

Linear Functions & Regression

This function is in the following form:

\[F(x)=ax+b\]

This means that finding the best line is finding the combination of $a$ and $b$ which results in the least squares sum.

Applying Least Squares

To minimise we would use the following function:

\[F(a,b)=\sum^n_{k=1}(y_k-ax_k-b)^2\]

Notice that the values $\{\langle x_k,y_k\rangle:1\leq k\leq n\}$ are constants.

To solve this we use partial differentiation, finding $\left(\frac{\partial F}{\partial a},\frac{\partial F}{\partial b}\right)$, and find the critical points for these.

Best Line Fit Function

From that we find that the best line fit function is:

\[\left(\frac{nW_{xy}-W_xW_y}{nW_{xx}-(W_x)^2}\right)x+\left(\frac{W_{xx}W_y-W_{xy}W_x}{nW_{xx}-(W_x)^2}\right)\]

Where:

\[\begin{aligned} W_x&=\sum^n_{k=1}x_k\\ W_y&=\sum^n_{k=1}y_k\\ W_{xx}&=\sum^n_{k=1}x_k^2\\ W_xy&=\sum^n_{k=1}x_ky_k\\ \end{aligned}\]

Other Functions

Other functions can have the following forms:

  • Powers - $ax^b$
  • Exponentials - $ae^{bx}$
    • Also can be written as $a\exp(bx)$.
  • Logarithms - $a\ln x+b$
  • Linear Rational Functions - $\frac1{ax+b}$

We may also have more than one parameter to optimise:

  • Quadratic Functions - $ax^2+bx+c$

Data Analysis Issues

The functions shown so far require manual selection and thus don’t always exactly match the pairs of data that have been observed.

  • Lagrange interpolation finds the best polynomial to fit a set of data.
  • Trigonometric interpolation finds the best trigonometric curve to fit a set of data.
    • This is useful in signal processing.