Logistic Regression using Python
Logistic regression to predict diabetes.
Regression: Regression is a model to predict the outcome using the relation between independent and dependent(target) variables.
Regression can be of three types.
- Linear regression.
- Logistic regression.
- Polynomial regression.
In Logistic regression, the outcome of the model should be in a discrete format(True/False, yes/no, 0/1).
Logistic Regression Curve/Sigmoid(‘S’) curve.
A sigmoid curve converts any value into a binary format. In the case of value between 0 and 1, the sigmoid curve uses threshold value to convert it into 0 or 1. So if the value is more than 0.5 (threshold value) it will be considered as 1 else 0.
Linear regression VS Logistic regression.
Values: In Linear regression, we have continuous values that may range from infinity to infinity while in Logistic it's restricted to discrete value.
Prediction and classification: Linear regression can be used to predict the regression problem. For example, we need to predict the price of the car. It can be any value ranging from 0 to any number while Logistic regression can be used to predict the classification problem. It can classify if someone will buy a car or not.
Curve: Linear regression gives a straight line curve while Logistic has a sigmoid curve.
Let’s use a linear regression to predict if someone is having diabetes or not.
Import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
The libraries are pandas, NumPy, seaborn, and matplotlib with the alias pd, np, sns, and plt.
Let's import our data.
data = pd.read_csv("../input/diabetes1/diabetes.csv")
data.head()
In this exercise, we will find out if a patient has Diabetes or not using Logistic regression.
Let's get the info about the data.
data.info()
We will read the CSV file through pd.read_csv. And through the head() we can see the top 5 rows. There are some factors where the values cannot be zero. For example, Glucose value cannot be 0 for a human. Similarly, blood pressure, skin thickness, Insulin, and BMI cannot be zero for a human.
non_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for coloumn in non_zero:
data[coloumn] = data[coloumn].replace(0,np.NaN)
mean = int(data[coloumn].mean(skipna = True))
data[coloumn] = data[coloumn].replace(np.NaN,mean)
print(data[coloumn])
Let’s analyze our data and try to find out the relationship between the independent variables.
Let's find out how many of the people are diabetic and how many are not.
sns.countplot(x="Outcome",data = data)
We have a relational graph that shows 1 as people who are diabetic and 0 as the one who is non-diabetic. We have approx 500 people who are non-diabetic and approx 250 as a diabetic.
Let’s try to find the details of the age group.
data["Age"].plot.hist()
The graph shows in our dataset how many people are there and what age group.
data["BMI"].plot.hist()
As we have the dataset for the analysis let’s check if the dataset has any zero value associated.
sns.heatmap(data.isnull(),yticklabels=False, cbar=False)
As the heatmap is completely black it shows it does not have any zero associated
X =data.iloc[:,0:8]
y =data.iloc[:,8]
For data X we are taking all the rows of columns ranging from 0 to 7. Similarly, for y we are taking all the rows for the 8th column.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
Here we are creating a train and test data set with 30% of data.
from sklearn.linear_model import LogisticRegressionlogmodel = LogisticRegression() # creating an instance of logistice regression
logmodel.fit(X_train,y_train)
Out:
/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)LogisticRegression()
It gives us all the data related to the logistic model.
Let’s predict our value.
pred = logmodel.predict(X_test)pred
We have an array of data, but we need to evaluate our model to check the accuracy. Let's start it with a confusion matrix.
from sklearn.metrics import accuracy_score,confusion_matrix
confusion_matrix(y_test,pred)
Out:
array([[125, 26],
[ 32, 48]])
We have the confusion matrix where the diagonal with 125 and 48 shows the correct value while 26 and 32 show the prediction that we missed.
We will check the accuracy score.
accuracy_score(y_test,pred)
Out:
0.7489177489177489
We have an accuracy score of 0.74. Let’s check the difference between the actual and the predicted value.
df=pd.DataFrame({'Actual':y_test, 'Predicted':pred})
df
import seaborn as sns
plt.figure(figsize=(5, 7))
ax = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(pred, hist=False, color="b", label="Fitted Values" , ax=ax)
plt.title('Actual vs Fitted Values for outCome')
plt.show()
plt.close()
For the working code please check the below link.