Understanding R-Squared as a Measure of Model Fit

dev
ai
datascience
tutorial

Published at: 20/08/2025

What is R-Squared?

R-squared (R2R^2) is a metric that measures how much of the variance in the actual data is captured by a model’s predictions. It’s most commonly used in evaluating linear regression models, but it can also be applied to non-linear models.

The formula is:

R2=1SSresSStotalR^2 = 1 - \frac{SS_{res}}{SS_{total}}

Where:

  • SSres=(yiy^i)2SS_{res} = \sum(y_i - \hat{y}_i)^2 : Sum of squared residuals (errors between actual and predicted values)
  • SStotal=(yiyˉ)2SS_{total} = \sum(y_i - \bar{y})^2 : Total sum of squares (errors between actual values and their mean)
  • yiy_i : Actual value at index ii
  • y^i\hat{y}_i : Predicted value at index ii
  • yˉ\bar{y} : Mean of actual values

R-Squared Value Space and Interpretations

  • R2=1R^2 = 1 : The model perfectly explains all variability in the data (SSres=0SS_{res} = 0).
  • R2=0R^2 = 0 : The model explains none of the variability; predictions are no better than predicting the mean.
  • R2<0R^2 < 0 : The model performs worse than simply predicting the mean.
  • 0<R2<10 < R^2 < 1 : The model explains part of the variance; the higher the better.

Limitations

Although R2R^2 is widely used, it can be misleading. A high R2R^2 does not always translate into a good model. Some pitfalls include:

1. Overfitting

A model can achieve a very high R2R^2 by overfitting the training data. This means it captures noise rather than general patterns, and will likely perform poorly on unseen data.

2. Coincidental Fits

Sometimes a model—especially a linear one applied to non-linear data—may still produce a high R2R^2 by coincidence. Similarly, including irrelevant but correlated features can inflate R2R^2 without improving real predictive power.

3. Increasing Complexity

R2R^2 always increases or at least stays the same when you add more features, even if those features are irrelevant. This happens because there is more flexibility for optimization, allowing residuals to shrink.

This can feel abstract, so let’s look at it intuitively and mathematically.

Intuitive Justification

Adding more features increases the dimensionality of the model, giving it more flexibility to fit the training data. This flexibility usually lowers residuals, thereby increasing R2R^2.

Mathematical Justification

Here’s one way to visualize it:

  • Let AA be a subset of a larger set BB.
  • If xAAx_A \in A makes minxAf(x)\min_{x \in A}f(x), then because ABA \subseteq B, minxBf(x)\min_{x \in B}f(x) cannot be worse—it’s either the same or better.

Mathematical justification illustrated
Mathematical justification illustrated

This extends naturally to regression with feature sets:

  • X1X_1: dataset with fewer features, with coefficient vector W1W_1
  • X2X_2: dataset containing X1X_1 plus extra features, with coefficient vector W2W_2
  • Predictions: y^1=W1X1\hat{y}_1 = W_1X_1, y^2=W2X2\hat{y}_2 = W_2X_2

We can always construct W2W_2 so that it reproduces y^1\hat{y}_1 by setting the extra feature coefficients to zero:

W2=[W100]W_2 = \begin{bmatrix} W_1 \\ 0 \\ 0 \end{bmatrix}

Thus:

  • If the added features are useless, the residual sum of squares stays the same (SSres2=SSres1SS_{res_2} = SS_{res_1}).
  • If the added features do help, the residuals shrink (SSres2<SSres1SS_{res_2} < SS_{res_1}).

Either way, R2R^2 never decreases when adding features.

Conclusion

R2R^2 is a helpful metric for understanding how well a model fits training data, but it must be interpreted with caution. A high R2R^2 can be the result of overfitting, coincidental correlations, or simply adding more features. For a more reliable assessment, R2R^2 should be used alongside other metrics and validation techniques. I’ll be covering these complementary approaches in upcoming articles, so stay tuned!

Prev

You May Also Like