Chapter Four – Least Squares Regression

Stuart Murphy

Chapter Four – Least Squares Regression

This technique is often used when many points of data are involved and the analyst would like the resulting polynomial to be influenced by all the identified points. The degree of the Interpolated polynomial should be selected ahead of time based on the expertise of the analyst. As a general rule of thumb, the lowest degree polynomial that appears to fit is the better choice. So, one might fit a quadratic or cubic solution to a large number of points which could run to dozens or even hundreds of points. The result will always be considered mathematically a best fit to the data.

To gain an understanding of the underlying principle and process we will begin with a simple data set consisting of five points.

Scenario

A helium balloon that gathers meteorological data is released. For each mile it rises, the distance it travels downrange is also recorded. The data is recorded in the following table.

Altitude and Downrange

Altitude - x miles	Downrange - y miles
1	2
2	3
3	5
4	5
5	4

The five points that are graphed represent how far downrange a weather balloon travels for each mile increase in altitude. — Figure 4.1 Data points for a Helium Balloon

Long Description

Figure 4.1 Helium Balloon Data Points

Let’s begin with the simplest model – the straight line. We want to find a best fit linear equation that minimizes the sum of the distances between the actual and interpolated values of y for a given value of x.

1) A generalized linear equation $y = a x + b$ will serve as our starting point.

2) It is easy to see that with a little rearranging we have an equation that lends itself to finding that minimum distance mentioned above: $y - (a x + b) = 0$

We will square this equation so that resulting differences in distance are always positive as we are not interested in the direction of the difference but the sum of the differences.

Since we want the sum of these squared equations, we have the following for this example:

$[y_{1} - (a x_{1} + b)]^{2}$

$+ [y_{2} - (a x_{2} + b)]^{2}$

$+ [y_{3} - (a x_{3} + b)]^{2}$

$+ [y_{4} - (a x_{4} + b)]^{2}$

$+ [y_{5} - (a x_{5} + b)]^{2}$

Interestingly by squaring these equations we will obtain a quadratic equation which will be useful in finding a linear solution. In fact, it will allow us to create two partial derivative equations for each of the constants we are trying to solve for. In this case a, b. This will result in two linear equations in two unknowns which we can solve using elimination/substitution or more advanced techniques such as matrix computations. And because they are upward facing quadratics, we minimize each equation be setting them to zero.

1) $\frac{d}{d a} = - 2 \sum_{i = 1}^{5} [y_{i} - (a x_{i} + b)] x_{i} = 0$

2) $\frac{d}{d b} = - 2 \sum_{i = 1}^{5} [y_{i} - (a x_{i} + b)] = 0$

Next, we simplify each equation by distributing the summation notation. And, since they are equal to zero, we simply divide out the -2. We now have two equations in two unknowns a,b.

Simplify 1) $\sum_{i = 1}^{5} x_{i} y_{i} - a \sum_{i = 1}^{5} x_{i}^{2} - b \sum_{i = 1}^{5} x_{i} = 0$

Simplify 2) $\sum_{i = 1}^{5} y_{i} - a \sum_{i = 1}^{5} x_{i} - b \sum_{i = 1}^{5} 1 = 0$

We now have two equations in two unknowns a, b. Let’s calculate the various sums and plug in.

$\sum_{i = 1}^{5} x_{i} y_{i} = 63$

$\sum_{i = 1}^{5} x_{i}^{2} = 55$

$\sum_{i = 1}^{5} x_{i} = 15$

$\sum_{i = 1}^{5} y_{i} = 19$

$\sum_{i = 1}^{5} 1 = 5$

I) Plug in to set up the two equations as follows:

One: $63 - a 55 - b 15 = 0$

Two: $19 - a 15 - b 5 = 0$

II) Rearrange:

One: 55a + 15b = 63

Two: 15a + 5b = 19

III) Apply substitution/elimination to solve for a, b

$a = \frac{3}{5} = 0.6$

$b = 2$

We now have a polynomial that can interpolate values in the interval [1,5]

$y = \frac{3}{5} x + 2$ or $y = 0.6 x + 2$

Graph of the linear polynomial demonstrates that some of the points are near the interpolation line while others are not close. — Figure 4.2 Graph of Linear Solution

Long Description

Figure 4.2 Graph of Linear Solution

As we can see, the linear solution offers an estimate that is closer to some of the given points than others. Can we do better by generating a curved line? (2nd degree polynomial)

The Quadratic Solution

The challenge is to expand on the above technique and apply it to develop the best fit quadratic equation.

In the linear, our goal was to solve two equations in two unknowns. Now we want to solve three equations in three unknowns. The unknowns are the constants of our quadratic equation in standard form:

Rearranging the standard form, we develop the Least Squares Summation equation:

$E = \sum_{i = 1}^{5} [y_{1} - (a x_{i}^{2} + b x_{i} + c)]^{2}$

Now we take partial derivatives with respect to each of the three constants a, b, c as follows:

a —> $- 2 \sum_{i = 1}^{5} [y_{1} - (a x_{i}^{2} + b x_{i} + c)] x_{i}^{2} = 0$

b —> $- 2 \sum_{i = 1}^{5} [y_{1} - (a x_{i}^{2} + b x_{i} + c)] x_{i} = 0$

c —> $- 2 \sum_{i = 1}^{5} [y_{1} - (a x_{i}^{2} + b x_{i} + c)] = 0$

Simplify by dividing out the -2 and distributing the summation notation

$\frac{d}{d a} = \sum_{i = 1}^{5} x_{i}^{2} y_{i} - a \sum_{i = 1}^{5} x_{i}^{4} - b \sum_{i = 1}^{5} x_{i}^{3} - c \sum_{i = 1}^{5} x_{i}^{2} = 0$

$\frac{d}{d b} = \sum_{i = 1}^{5} x_{i} y_{i} - a \sum_{i = 1}^{5} x_{i}^{3} - b \sum_{i = 1}^{5} x_{i}^{2} - c \sum_{i = 1}^{5} x_{i} = 0$

$\frac{d}{d c} = \sum_{i = 1}^{5} y_{i} - a \sum_{i = 1}^{5} x_{i}^{2} - b \sum_{i = 1}^{5} x_{i} - c 5$

Let’s calculate the additional sums needed. We already calculated some of the sums for the linear equation. These are:

$\sum_{i = 1}^{5} x_{i} y_{i} = 63$

$\sum_{i = 1}^{5} x_{i}^{2} = 55$

$\sum_{i = 1}^{5} x_{i} = 15$

$\sum_{i = 1}^{5} y_{i} = 19$

$\sum_{i = 1}^{5} 1 = 5$

Additional sums:

$\sum_{i = 1}^{5} x_{i}^{2} y_{i} = 239$

$\sum_{i = 1}^{5} x_{i}^{4} = 979$

$\sum_{i = 1}^{5} x_{i}^{3} = 225$

Plugging in shows the three equations in three unknowns:

$239 - a (979) - b (225) - c (55) = 0$

$63 - a (225) - b (55) - c (15) = 0$

$19 - a (55) - b (15) - c (5) = 0$

Rearranging

$979 a + 225 b + 55 c = 239$

$225 a + 55 b + 15 c = 63$

$55 a + 15 b + 5 c = 19$

Solving manually or using spreadsheet software the following equation is obtained:

$y = - 0.4286 x^{2} + 3.1714 x - 1$

This is the interpolation polynomial that generates a curved line (parabola) that is the best fit for the five given data points and it estimates y values for any other point within interval.

Matrix Operations simplify the calculations

Note: multiplying the transpose by the matrix produces the summation in n-equations with n-unknowns. This holds true no matter how many data points are involved.