Multiple Linear Regression
Multiple Linear Regression or Multiple Regression attempts to predict the relationship between more than one independent variable and a dependent variable through a model.
Y = b0 + b1X1 +b2X2 +……+bpXp
Y is the expected value and X1 to Xp are the independent values. b0,b1….bp are the estimated coefficients.
Using Multiple Linear Regression let’s predict the CO2 emission of engines.
Let’s import the required python libraries.
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
Let’s read and view the data
df = pd.read_csv("../input/FuelConsumptionCo2.csv")# take a look at the dataset
df.head()
In the above dataset, we have many columns that are not relevant or might not be of much use for our prediction, so we will have a new data set which will have only listed columns that will be used for prediction
Let’s select some features that we want to use for regression.
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(9)
Let’s check the data type
cdf.dtypes
# At the same time you can check with below syntax as well where you can see the sum of missing values available into data
cdf.isnull().sum()
Let’s get the graph of each of the columns in cdf
cdf.hist(bins=50,figsize=(15,12))
plt.show()
Time to train and test the data. This is the most popular approach in the machine learning algorithm where we first train our model with limited data and then use the train data to test the data to attain the prediction. It is more realistic for real-world problems.
In reality, there are multiple factors that can affect CO2 emission. This process of predicting one variable with one or more independent variable is called Multiple regression
Let’s train our model.
from sklearn import linear_model
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
regr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)
Out:
Coefficients: [[10.49781812 7.75062008 9.58528905]]
random.rand(len(df)) is an array of size len(df) with randomly and uniformly distributed float values in range [0, 1]. The < 0.8 applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True and value >= 0.8 become False
Coefficient and intercept are two parameters that can help to estimate a fit line.
Prediction
Let’s test our model on the data
y_hat= regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
% np.mean((y_hat - y) ** 2))# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x, y))Residual sum of squares: 545.00
Variance score: 0.88
In the above code, we are testing our data with the same parameter used to train. Residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of the squared estimate of errors (SSE).
RSS is the sum of the squared differences between the actual Y and the predicted Y. The less RSS the better the fit. The more the value the poor is the model. The perfect value for RSS is 0. One of the major usages of RSS is to predict the value of R2.
Variance score: If Y_hat is the estimated target output, y the corresponding target output, and Var is Variance, the square of the standard deviation, then the explained variance is estimated as follows:
𝚅𝚊𝚛𝚒𝚊𝚗𝚌𝚎(𝑦, Y_hat)=1 − (𝑉𝑎𝑟{𝑦−Y_hat }/𝑉𝑎𝑟{𝑦})
In our case, the variance score is 0.85 which can be considered as a good value.
Let’s see how fit is our model through the graph
import seaborn as snsax2 = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(y_hat, hist=False, color="b", label="Fitted Values", ax=ax2)plt.title('Actual vs Fitted Values for Emission')
plt.xlabel('Co2 Emission')
plt.ylabel('Proportion of Engine')plt.show()
plt.close()
The code and data can be obtained from https://www.kaggle.com/adityakumar529/multiple-regression/ and Github:https://github.com/adityakumar529/Coursera_Capstone/blob/master/Multiple%20linear%20regression.ipynb