Skip main navigation

Understanding R Squared

Here you will be introduced to the concept of R Squared.

R-Square (R²) is a statistical measure representing the proportion of variance in the dependent variable that is predictable from the independent variable(s).

What Does It Mean to Explain Variability in the Data?

Explaining variability in the data means understanding how much changes or differences in one variable (the dependent variable) can be accounted for by changes in another variable (the independent variable). This concept is crucial in statistical modelling and regression analysis.

R-Square (R²): A Measure of Explained Variability

R-Square (R²) is a statistical measure that tells us how well the independent variables in a model explain the variability of the dependent variable. It ranges from 0 to 1, where:

  • 0 means the model explains none of the variability.
  • 1 means the model explains all the variability.

For example, if R² = 0.70, it means that 70% of the variability in the dependent variable is explained by the independent variables in the model. The remaining 30% is due to factors not included in the model or random chance.

How R-Square is Calculated (Conceptually)

Here we just go over how R² works – remember the computer will just spit out the results. You do not have to calculate them – just understand what they mean.

  1. Total Variability (SST): This is the total variation in the dependent variable. It’s calculated by summing the squared differences between each actual value and the mean of the dependent variable.
  2. Explained Variability (SSR): This is the part of the total variability that the model explains. It’s calculated by summing the squared differences between each predicted value (from the model) and the mean of the dependent variable.
  3. Unexplained Variability (SSE): This is the part of the total variability that the model does not explain. It’s calculated by summing the squared differences between each actual value and its predicted value.

The formula for R² is:

[R^2 = 1 – frac{SSE}{SST}]

Where:
  • SSE is the sum of squared errors (unexplained variability).
  • SST is the total sum of squares (total variability).

Adjusted R-Square: A More Accurate Measure

Adjusted R-Square adjusts the R² value to account for the number of predictors in the model. It provides a more accurate measure of how well the model explains the variability, especially when multiple predictors are involved.

Why Use Adjusted R-Square?

  • Penalises for Adding Irrelevant Predictors: Adjusted R² decreases when a predictor improves the model by less than expected by chance.
  • Better for Model Comparison: It helps compare models with different numbers of predictors, ensuring that only meaningful variables are included.

Practical Example

Imagine you’re predicting students’ test scores based on hours studied, attendance, and sleep hours. If your model has an R² of 0.80, it means 80% of the differences in test scores can be explained by these three factors. If Adjusted R² is 0.78, it suggests that while the model is good, not all predictors are equally useful.

Key Points:

  1. Range:
    • R² ranges from 0 to 1 (or 0% to 100%).
    • 0 means the model explains none of the variability in the data.
    • 1 means it explains all the variability.
  2. Interpretation:
    • If R² = 0.65, your model explains 65% of the variation in your data.
    • The remaining 35% is due to factors not included in your model or random chance.
  3. Visual Representation:
    • Imagine a scatter plot. R² tells you how closely your data points cluster around your regression line.
    • A higher R² means a tighter cluster around the line.
  4. Use Cases:
    • Comparing different models for the same data.
    • Assessing how well your independent variables predict the dependent variable.
  5. Limitations:
    • R² increases with every added variable, even if it’s not meaningful.
    • It doesn’t indicate whether the coefficients are biased.
    • A high R² doesn’t necessarily mean your model is good; it just fits this particular dataset well. Hence, we use adjusted R² if we have more than one independent variable in our model.

Keep the following in mind:

If Adjusted R² decreases when adding a variable, it might not be worth including. Always report both R² and Adjusted R². This gives readers a fuller picture of your model’s performance.
Limitations:
  • Neither R² nor Adjusted R² tell you if your model is biased or if you’ve chosen the right regression method.
  • They don’t indicate whether your predictions are reliable.
Context Matters:
  • In some fields, an R² of 0.5 might be considered high.
  • In others, you might need 0.9 or higher for a “good” model.
  • Always interpret these values in the context of your specific field and research question.
Remember, while R² and Adjusted R² are valuable tools, they’re just part of the model evaluation process. Always consider other diagnostic measures and the practical significance of your results when interpreting your model’s effectiveness.
This article is from the free online

Introduction to Statistics without Maths: Regressions

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now