Title: Find a linear least squares fit for a set of points in Python

[A set of points fitted with a linear least squares fit in Python]

Linear least squares is a technique for finding a line that follows a set of points. That lets you approximate the hypothetical function that might generate the points.

Overview

Suppose you have a set of data points that you believe were generated by a process that should ideally be linear. In that case, you might like to find the best parameters m and b to make the line y = m * x + b fit those points as closely as possible.

A common approach to this problem is to minimize the sum of the squares of the vertical distances between the line and the points. For example, suppose the point P0 = (x0, y0) is one of your data points. The vertical error squared for that point is the difference between y0 and the line's Y coordinate for that X position. In this case, that's y0 - (m * x0 + b). To calculate the total error squared, you square this error and add up the results for all of the data points.

Keep in mind that you know all of the points so, for given values of m and b, you can easily loop through all of the points and calculate the error.

Here's a function that does just that:

import math def error_squared(points, m, b): '''Return the error squared for this linear curve fit.''' total = 0 for point in points: dy = point[1] - (m * point[0] + b) total += dy * dy return total

This code loops through the points subtracting each point's Y coordinate from the coordinate of the line at the point's X position. It squares that error and adds it to the total. When it finishes its loop, the method returns the total of the squared errors.

As a mathematical equation, the error function E is:

where the sum is performed over all of the points (xi, yi).

To find the least squares fit, you need to minimize this function E(m, b). That sounds intimidating until you remember that the xi and yi values are all known; they're the values you're trying to fit with the line.

The only variables in this equation are m and b so it's relatively easy to minimize this equation by using a little calculus. Simply take the partial derivatives of E with respect to m and b, set the two resulting equations equal to 0, and solve for m and b.

Taking the partial derivative with respect to m and rearranging a bit to gather common terms and pull constants out of the sums you get:

2 * (m * Sum[xi^2] + b * Sum[xi] - Sum[xi * yi])

Taking the partial derivative with respect to b and rearranging a bit you get:

2 * (m * Sum[xi] + b * Sum[1] - Sum[yi])

To find the error function's minimum, you set these two equations equal to 0 and solve for m and b.

To make working with the equations easier, let:

If you make these substitutions and set the equations equal to 0 you get:

Solving for m and b gives:

Again these look like intimidating equations, but all of the S's are values that you can calculate given the data points that you are trying to fit.

Python Code

The following code calculates the S's and uses them to find the linear least squares fit for the points in the list points.

def find_linear_least_squares_fit(points): '''Find m and b for the least squares linear fit.''' # Perform the calculation. # Find the values S1, Sx, Sy, Sxx, and Sxy. S1 = len(points) Sx = 0 Sy = 0 Sxx = 0 Sxy = 0 for point in points: Sx += point[0] Sy += point[1] Sxx += point[0] * point[0] Sxy += point[0] * point[1] # Solve for m and b. m = (Sxy * S1 - Sx * Sy) / (Sxx * S1 - Sx * Sx) b = (Sxy * Sx - Sy * Sxx) / (Sx * Sx - S1 * Sxx) return m, b, math.sqrt(error_squared(points, m, b))

That's all there is to it!

Conclusion

Run the program, click to select some points, and then click Fit to find a linear least squares fit. Click Clear to start over.

If you like, you can fit some points and then add more points to see how they change the fit.

And, of course, download the example to experiment with it and to see additional details.