Logistic Regression using Python

Aditya Kumar
5 min readJul 26, 2020

--

Logistic regression to predict diabetes.

Regression: Regression is a model to predict the outcome using the relation between independent and dependent(target) variables.

Regression can be of three types.

  1. Linear regression.
  2. Logistic regression.
  3. Polynomial regression.

In Logistic regression, the outcome of the model should be in a discrete format(True/False, yes/no, 0/1).

Logistic Regression Curve/Sigmoid(‘S’) curve.

A sigmoid curve converts any value into a binary format. In the case of value between 0 and 1, the sigmoid curve uses threshold value to convert it into 0 or 1. So if the value is more than 0.5 (threshold value) it will be considered as 1 else 0.

Sigmoid graph by Author

Linear regression VS Logistic regression.

Values: In Linear regression, we have continuous values that may range from infinity to infinity while in Logistic it's restricted to discrete value.

Prediction and classification: Linear regression can be used to predict the regression problem. For example, we need to predict the price of the car. It can be any value ranging from 0 to any number while Logistic regression can be used to predict the classification problem. It can classify if someone will buy a car or not.

Curve: Linear regression gives a straight line curve while Logistic has a sigmoid curve.

Let’s use a linear regression to predict if someone is having diabetes or not.

Import the libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

The libraries are pandas, NumPy, seaborn, and matplotlib with the alias pd, np, sns, and plt.

Let's import our data.

data = pd.read_csv("../input/diabetes1/diabetes.csv")
data.head()
Output image by Author

In this exercise, we will find out if a patient has Diabetes or not using Logistic regression.

Let's get the info about the data.

data.info()
Output Image by Author

We will read the CSV file through pd.read_csv. And through the head() we can see the top 5 rows. There are some factors where the values cannot be zero. For example, Glucose value cannot be 0 for a human. Similarly, blood pressure, skin thickness, Insulin, and BMI cannot be zero for a human.

non_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for coloumn in non_zero:
data[coloumn] = data[coloumn].replace(0,np.NaN)
mean = int(data[coloumn].mean(skipna = True))
data[coloumn] = data[coloumn].replace(np.NaN,mean)
print(data[coloumn])

Let’s analyze our data and try to find out the relationship between the independent variables.

Let's find out how many of the people are diabetic and how many are not.

sns.countplot(x="Outcome",data = data)
Output Image by Author

We have a relational graph that shows 1 as people who are diabetic and 0 as the one who is non-diabetic. We have approx 500 people who are non-diabetic and approx 250 as a diabetic.

Let’s try to find the details of the age group.

data["Age"].plot.hist()
Output Image by Author

The graph shows in our dataset how many people are there and what age group.

data["BMI"].plot.hist()
Output Image by Author

As we have the dataset for the analysis let’s check if the dataset has any zero value associated.

sns.heatmap(data.isnull(),yticklabels=False, cbar=False)
Output Image by Author

As the heatmap is completely black it shows it does not have any zero associated

X =data.iloc[:,0:8]
y =data.iloc[:,8]

For data X we are taking all the rows of columns ranging from 0 to 7. Similarly, for y we are taking all the rows for the 8th column.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

Here we are creating a train and test data set with 30% of data.

from sklearn.linear_model import LogisticRegressionlogmodel = LogisticRegression()  # creating an instance of logistice regression
logmodel.fit(X_train,y_train)

Out:

/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression()

It gives us all the data related to the logistic model.

Let’s predict our value.

pred = logmodel.predict(X_test)pred
Output Image by Author

We have an array of data, but we need to evaluate our model to check the accuracy. Let's start it with a confusion matrix.

from sklearn.metrics import accuracy_score,confusion_matrix
confusion_matrix(y_test,pred)

Out:

array([[125,  26],
[ 32, 48]])

We have the confusion matrix where the diagonal with 125 and 48 shows the correct value while 26 and 32 show the prediction that we missed.

We will check the accuracy score.

accuracy_score(y_test,pred)

Out:

0.7489177489177489

We have an accuracy score of 0.74. Let’s check the difference between the actual and the predicted value.

df=pd.DataFrame({'Actual':y_test, 'Predicted':pred})
df
Output Image by Author
import seaborn as sns
plt.figure(figsize=(5, 7))


ax = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(pred, hist=False, color="b", label="Fitted Values" , ax=ax)


plt.title('Actual vs Fitted Values for outCome')


plt.show()
plt.close()
Output Image by Author

For the working code please check the below link.

--

--

Aditya Kumar
Aditya Kumar

Written by Aditya Kumar

Data Scientist with 6 years of experience. To find out more connect with me on https://www.linkedin.com/in/adityakumar529/

No responses yet