How to calculate linear regression by hand

|
math statistics

The slope of a regression line is calculated from this formula:

\[m = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]

Once you have the slope, the intercept follows from:

\[b = \frac{\sum y - m \sum x}{n}\]

Together they define the regression line: y = mx + b. This line is the one that minimizes the sum of squared vertical distances from each data point to the line. That property is called least squares, and it means no other straight line fits the data with a smaller total squared error.

The five sums you need

Every linear regression calculation reduces to five numbers: n (count of data points), Σx, Σy, Σxy, and Σx². Once you have those, the slope and intercept are arithmetic.

Work through a complete example. Suppose you have five data points tracking hours studied and exam score:

x (hours) y (score) xy
1 55 55 1
2 60 120 4
3 65 195 9
4 75 300 16
5 85 425 25
Σ = 15 Σ = 340 Σ = 1095 Σ = 55

The count n = 5. Now plug everything into the slope formula.

Calculating the slope

\[m = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\] \[m = \frac{5(1095) - (15)(340)}{5(55) - (15)^2}\] \[m = \frac{5475 - 5100}{275 - 225} = \frac{375}{50} = 7.5\]

The slope is 7.5. For each additional hour studied, the model predicts an approximately 7.5-point increase in exam score.

Calculating the intercept

\[b = \frac{\sum y - m \sum x}{n} = \frac{340 - 7.5(15)}{5} = \frac{340 - 112.5}{5} = \frac{227.5}{5} = 45.5\]

The regression equation is:

\[\hat{y} = 7.5x + 45.5\]

You can verify this makes sense. Plugging x = 3 back in gives 7.5(3) + 45.5 = 22.5 + 45.5 = 68. The actual y value at x = 3 is 65, so the model is close but not exact. The line does not pass through every point; it passes through the middle of the scatter to minimize overall error.

Calculating R²

R² (the coefficient of determination) measures how well the regression line explains the variation in y. It ranges from 0 to 1, where 1 means the line fits the data perfectly.

First compute the mean of y:

\[\bar{y} = \frac{340}{5} = 68\]

Then build two sums. SS_tot is total variation in y. SS_res is the variation left unexplained after fitting the line.

x y ŷ = 7.5x + 45.5 Residual (y - ŷ) (y - ŷ)² (y - ȳ)²
1 55 53.0 2.0 4.00 169
2 60 60.5 -0.5 0.25 64
3 65 68.0 -3.0 9.00 9
4 75 75.5 -0.5 0.25 49
5 85 83.0 2.0 4.00 289
      SS_res 17.5 SS_tot = 580
\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{17.5}{580} = 1 - 0.0302 \approx 0.970\]

An R² of 0.970 means the number of hours studied explains approximately 97% of the variation in exam scores in this data set. The remaining 3% is attributed to other factors or random variation.

Predicting a new value

The regression line is useful for prediction within the range of observed x values. To estimate the exam score for a student who studied 3.5 hours:

\[\hat{y} = 7.5(3.5) + 45.5 = 26.25 + 45.5 = 71.75\]

The model predicts approximately 72 points. This is called interpolation, predicting within the range of the data. Predicting beyond the observed range (say, x = 10 hours) is extrapolation and carries much more uncertainty. A regression line trained on 1 to 5 hours has no way to know whether the relationship stays linear at 10 hours, or whether returns diminish, or whether there is a ceiling effect on scores.

When linear regression is appropriate

Linear regression works when the relationship between x and y is reasonably straight, the residuals (errors) are roughly random and not patterned, and there are no severe outliers dominating the fit.

It breaks down when the underlying relationship is curved (use polynomial or nonlinear regression instead), when one or a few extreme data points pull the line away from the majority, or when the variance of residuals grows or shrinks with x (called heteroscedasticity).

A quick check: plot your data before fitting a line. If the scatter plot shows a curve, a U-shape, or a funnel, a straight line will give a poor fit regardless of how high R² appears to be in some cases.

Linear regression also assumes the relationship runs in one direction: you are using x to predict y. It says nothing about causation. Hours studied and exam scores may be correlated, but the regression line does not prove that studying causes better scores. A third variable (prior knowledge, motivation, class difficulty) could drive both.

Multiple regression

The formula above covers simple linear regression with one predictor variable. Real datasets often require multiple predictors. Multiple regression extends the same least-squares logic to:

\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k\]

Each coefficient represents the estimated change in y for a one-unit increase in that predictor, holding all other predictors constant. The arithmetic becomes matrix algebra and is not practical to do by hand with more than two predictors, but the interpretation remains the same.

Doing it faster

The five sums take time when n is large. The linear regression calculator handles this instantly: paste or enter your data points, and it returns the slope, intercept, R², and a predicted value for any x you enter. For a full dataset with 20 or 50 points, the manual approach described here takes 10 to 15 minutes; the calculator takes seconds and eliminates arithmetic errors in the intermediate sums.

Understanding the manual steps matters because they show exactly what the line is optimizing. It is not magic: it is five sums, two formulas, and a measure of fit.