Updated March 24, 2023

Introduction to Multiple Linear Regression

In machine learning Linear regression is a supervised learning algorithm that is used to predict a continuous output having a constant slope. This means that the predicted values from the linear regression will be in a specific range and it won’t try them to classify them in categories.

Linear regression is classified into two types:

Simple Linear Regression: Only one independent variable is present.
Multiple Linear Regression: Multiple independent variables is present.

In most of the applications, the number of features used to predict the dependent variable is more than one so in this article, we will cover multiple linear regression and will see its implementation using python.

Explanation for Multiple Linear Regression

Multiple Linear Regression is an extension of the simple linear regression model in the way that there are multiple independent variables(features) that are used to predict the dependent variable.
Suppose an analyst wants to know the price of the house then his linear equation will be the area of land as the independent variable and the price as the dependent variable. But in reality, more features affect the price of the house. The price of the house depends on other predictors like the floors in the house, number of bedrooms, age of the house, etc. To understand the relationship in which multiple independent variables are involved, we use multiple linear regression.
MLR is used to determine a mathematical relationship between the multiple independent variables. Once the factor or coefficient for each independent variable is determined then the information can be used to accurately predict the outcome. The relationship created by the model is informed of a linear (straight) line that best approximates all the individual data points.

Having an idea about multiple linear regression lets us understand the maths behind it.

Equation in Multiple Linear Regression

Given below shows the equation in Multiple Linear Regression:

1. Hypothesis

The hypothesis or the model of the multiple linear regression is given by the equation:

h(x) = θ0 + θ_1×1 + θ_2×2 + θ_3×3…θ_nxn

Where,

x_i is the i^th feature or the independent variables.
θ_iis the weight or coefficient of i^th feature.

Note: That x₀= 1 and θ₀is the bias term.

This linear equation is used to approximate all the individual data points. If there would have been only 1 feature, then this equation would have resulted in a straight line.

2. Cost Function

Our objective is to find the optimal weight of the hypothesis so that the difference between the computed value and the actual value is minimum. This difference between the actual and computed value is termed as cost. Cost helps in determining the overall function of the model.

So, the cost of the multiple regression is given by the following equation:

J (θ) = 1/2m ∑^m_i=1(h_θ(x⁽ⁱ⁾) – y⁽ⁱ⁾)²

Where,

J(c): Cost Function
m: Number of the Training Dataset
h(θ): Hypothesis
θ: Weights
y: Labels

3. Gradient Descent

Next is the gradient descent, which is needed to update the weight values. The Gradient descent takes use of the partial derivative of the cost function.

Gradient descent for multiple linear regression is given by:

θ_j = θ_j – ∝*J (θ) (Simultaneously Update for Every j = 0,…n)

Where,

∝: Is the learning rate.

4. Normalization

It is used to speed up the computation of the gradient descent process. To do so we need to ensure that all the features are within a specific range. It is important in the dataset which has a high standard deviation or has a different range of attributes. So first we have to bring the features in the range of -1 to 1.

To perform normalization, we can use feature scaling and mean normalization. It is given by the equation:

x_i = (x_i– µ_i) / (max(x_i) – min(x_i))

That’s it, these are the three major steps that we need to perform while simplifying the equation to fit the parameter in the multiple linear regression.

Implementing Multiple Linear Regression

We will now see the implementation of the MLR using the scikit-learn module of python. Since it contains the optimize implementation of most of the model including the Linear Regression it is recommended to use them despite creating your own. In order to implement the code, I have used Kaggle’s workspace.

The steps involved in the implementation are:

Import the Libraries
Import Dataset
Data Pre-Processing
Splitting Data Into Test and Training Set
Training Model
Evaluating Model

1. Import Libraries

To implement the multiple linear regression model we will take the help of the scikit-learn module as it comes prepacked with some of the sample datasets and useful functions.

2. Import Dataset

The next step is to import the dataset. Since scikit already has sample datasets we’ll be using the Boston Housing dataset. This dataset consists of information about the homes in Boston. The features in the dataset are used to predict the price of the house.

Let’s create the instance of the dataset and see what feature it contains:

Dataset contains the following information:

Data: The actual information on the house.
Target: Price of the house.
Feature names: Name of the feature of the info that the data contains.
DESCR: It is the command used to describe the feature of the dataset.

3. Data Pre-Processing

Now will create a DataFrame object of the data, by keeping the feature name as the header, using the panda’s library.

You can check the loaded data using the head command of the pandas.

We want to predict the price of the house but in our current data frame, we don’t have that information. So we need a new column and store the information from the dataset.

Normally we have to perform multiple techniques of data pre-processing but that is for the other article. I will directly take the feature with stronger correlation i.e. ‘LTSAT’, and ‘RM’ (number of rooms). So, we will use these two features to perform linear regression.

X value will contain the independent values while the Y variable will store the target values.

4. Splitting Data Into Test and Training Set

Now we will split the data into training and test set. Let’s keep 80% of the data for training and the remaining 20% of the data for testing.

5. Training Model

Now we need to create the instance of the model using the scikit-Learn. We do so by simply calling the LinearRegression().

Once the instance is created we can fit the model on the training data.

That’s it our model has been fitted. Now we can use it to make predictions by calling the predict command. We will run the prediction on the test data.

6. Evaluating Model

Using the RMSE and R2 metric we will compute the prediction against the actual values. RMSE and R2 are among the two popular metrics used for evaluating the regression task.
The lower the value of rmse the better is the fit and Closer the value of R2 is to 1 then better the model fits.

For our model, we got the result as 5.66 and 0.67 for rmse and r2_score respectively.
You can consider trying more features to improve the accuracy of the model.

Conclusion

In this article, we understood the Multiple linear regression along with its math and actual implementation using python. Since you have completed this article, I recommend you to try working on a different dataset and try to increase the accuracy.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to Multiple Linear Regression

Explanation for Multiple Linear Regression

Equation in Multiple Linear Regression

1. Hypothesis

2. Cost Function

3. Gradient Descent

4. Normalization

Implementing Multiple Linear Regression

1. Import Libraries

2. Import Dataset

3. Data Pre-Processing

4. Splitting Data Into Test and Training Set

5. Training Model

6. Evaluating Model

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email