Updated July 6, 2023

Introduction to R Squared Regression

R Squared is a statistical measure defined by the proportion of variance in the dependent variable that can be explained from independent variables.

In other words, in a regression model, the value of R squared tests the goodness of the regression model or how well the data fits in the model. The coefficient of determination is another term for it. The value of R squared can be between 0% to 100%.

What is R Squared?

To understand R squared analysis, let us consider n number of points in two-dimensional space, say, (x₁, y₁), (x₂, y₂), … (x_n, y_n). We will plot them in two-dimensional space and consider a line y = mx + b passing through them to minimize the squared error.

You can calculate the total squared error between the points and the line using the formula below. We will consider errors along the y-axis.

e₁ = (y₁ – (m x₁ + b))²
e₂ = (y₂ – (m x₂ + b))²
e_n = (y_n – (m x_n + b))²

Let the squared error of the line be SE_line.
SE_line = (y₁ – (mx₁ + b))² + (y₂ – (mx₂ + b))² + … (y_n – (mx_n + b))²

Now to know how well the line fits in the data points, let us know “How much (or what percentage) of the variation in y is described by the variation in x?
Variance (y) = Total Variation in y / #n

But the total variation in y that is not described by the regression line can be expressed by the following equation.
The regression line does not describe the SE_line / SE_Y (%) of variation.

So let us say if the above percentage stands out to be 25%, that is, the line does not describe 25% of the variation, then we can say that the variance explained by the line can be equal to (1-25 = 75%). So the formula below describes the total variance described by the regression line.
(1 – SE_line / SE_Y) %

This indicates the percentage of the total variation that the variation in x can explain. This is called the coefficient of determination or R squared.
Coefficient of Determination = R² = (1 – SE_line / SE_Y)

Example to Implement R Squared Regression

Let us consider an example using Python. The library named sklearn contains the metrics named r2_score. And for the Linear Regression model, we will use LinerRegression from sklearn. We will use the matplotlib library for plotting the regression graph. The Numpy library will be used to reshape the input array.

1. Let us install the sklearn package and import the necessary packages.

pip install scikit-learn from sklearn.metrics import r2_score from sklearn.linear_model import LinearRegression from matplotlib import pyplot as plt import numpy as np

2. Then, we need to download the wage2.csv dataset, which is available here: http://murraylax.org/datasets/wage2.csv

The fields in the dataset are below, which can be found in wage.columns.

3. We will consider YearsEdu as x and MonthlyEarnings on y, which is the dependent variable. We will scatter plot all the points and visualize the distribution.

y = wage["MonthlyEarnings"] x = wage["YearsEdu"] plt.scatter(x, y)

Output:

4. Now, we will train the linear regression model on our data.

X = np.array(x).reshape(-1, 1) regModel = LinearRegression() regModel.fit(X, y) y_pred = regModel.predict(X)

5. Now, to view the best fitting line to our model, we will execute the following script.

plt.scatter(x,y) plt.plot(X, y_pred, color='red')

Output:

6. To compute the R-squared value of the line, the following function can be used.

r2_score(y, y_pred)

Interpretation

If the squared error is very small, then we can say that the line is a good fit. So if SEline is a small number, the whole fraction will be a very small value. And this will result in a larger number when subtracted from one. Thus if the squared error is small, then R squared, or the coefficient of determination will get larger, nearly equal to one. This shows that the line is a good fit.
Similarly, in the opposite case, if the squared error of the line is huge, that means a lot of error between data points and the line, then SEline this number will get large, resulting in a larger fraction value. So the R squared or the coefficient of determination will be smaller, showing a poorly fit regression line.

Limitation

We can get the relationship between the independent and dependent variables’ movements. It is not a parameter to identify whether the selected model is good or not. You are not provided with any information regarding the potential bias in the data and predictions. It is possible to obtain a high R squared value for an inadequate model and a low value for a well-fitted model. Thus R squared doesn’t help to identify the reliability of the model.

Conclusion

R squared value is not a metric to verify the good fit of the trained linear regression model. But it can give what amount of variance of the independent variable that the variance of the dependent variable can describe.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to R Squared Regression

What is R Squared?

Example to Implement R Squared Regression

Interpretation

Limitation

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email