from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

Machine Learning Workflow on Diabetes Data¶

This article will portray how data related to diabetes can be leveraged to predict if a person has diabetes or not. More specifically, this article will focus on how machine learning can be utilized to predict diseases such as diabetes. By the end of this article you will be able to understand concepts like data exploration, data cleansing, feature selection, model selection, model evaluation, and apply them in a practical way.

What is Diabetes ?¶

Diabetes is a disease which occurs when the blood glucose level becomes high, which ultimately leads to other health problems such as heart diseases, kidney disease etc. Diabetes is caused mainly due to the consumption of highly processed food, bad consumption habits etc. According to WHO, the number of people with diabetes has been increased over the years.

Prerequisites¶

Python 3.+
Anaconda (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
Jupyter Notebook.
Basic understanding of supervised machine learning methods : specifically classification.

Phase 0 — Data Preparation¶

In this tutorial we aren’t going to create our own data set, instead we will be using an existing data set called the “Pima Indians Diabetes Database” provided by the UCI Machine Learning Repository

Phase 1 — Data Exploration¶

When encountered with a data set, first we should analyse and “get to know” the data set. This step is necessary to familiarize with the data, to gain some understanding about the potential features and to see if data cleaning is needed.

First we will import the necessary libraries and import our data set to the Jupyter notebook. We can observe the mentioned columns in the data set.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diabetes = pd.read_csv('C:\Users\KEVIN\PycharmProjects\diabetes\diabetes.csv')
diabetes.columns

Index([u'Pregnancies', u'Glucose', u'BloodPressure', u'SkinThickness',
       u'Insulin', u'BMI', u'DiabetesPedigreeFunction', u'Age', u'Outcome'],
      dtype='object')

We can examine the data set using the pandas’ head() method.

diabetes.head()

We can find the dimensions of the data set using the panda Dataframes’ ‘shape’ attribute.

print("Diabetes data set dimensions : {}".format(diabetes.shape))

Diabetes data set dimensions : (768, 9)

We can observe that the data set contain 768 rows and 9 columns. ‘Outcome’ is the column which we are going to predict , which says if the patient is diabetic or not. 1 means the person is diabetic and 0 means person is not. We can identify that out of the 768 persons, 500 are labeled as 0 (non diabetic) and 268 as 1 (diabetic)

# Display how many 0 value each feature have
for field in diabetes.columns[:8]:
    print('Number of 0-entries for "{field_name}" feature: {amount}'.format(
        field_name=field,
        amount=np.count_nonzero(diabetes[field] == 0)
    ))

Number of 0-entries for "Pregnancies" feature: 111
Number of 0-entries for "Glucose" feature: 5
Number of 0-entries for "BloodPressure" feature: 35
Number of 0-entries for "SkinThickness" feature: 227
Number of 0-entries for "Insulin" feature: 374
Number of 0-entries for "BMI" feature: 11
Number of 0-entries for "DiabetesPedigreeFunction" feature: 0
Number of 0-entries for "Age" feature: 0

Visualization of data is an imperative aspect of data science. It helps to understand data and also to explain the data to another person. Python has several interesting visualization libraries such as Matplotlib, Seaborn etc.

In this tutorial we will use pandas’ visualization which is built on top of matplotlib, to find the data distribution of the features.

diabetes.groupby('Outcome').hist(figsize=(9, 9))

Outcome
0    [[AxesSubplot(0.125,0.670278;0.215278x0.209722...
1    [[AxesSubplot(0.125,0.670278;0.215278x0.209722...
dtype: object

Correlation Matrix¶

A matrix of correlations provides useful insight into relationships between pairs of variables.

sns.heatmap(
    data=X.corr(),
    annot=True,
    fmt='.2f',
    cmap='RdYlGn'
)

fig = plt.gcf()
fig.set_size_inches(10, 8)

plt.show()

Phase 2— Data Cleaning¶

Next phase of the machine learning work flow is the data cleaning. Considered to be one of the crucial steps of the work flow, because it can make or break the model. There is a saying in machine learning “Better data beats fancier algorithms”, which suggests better data gives you better resulting models.

There are several factors to consider in the data cleaning process.

Duplicate or irrelevant observations.
Bad labeling of data, same category occurring multiple times.
Missing or null data points.
Unexpected outliers.

Missing or Null Data points We can find any missing or null data points of the data set (if there is any) using the following pandas function.

diabetes.isnull().sum()
diabetes.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

We can observe that there are no data points missing in the data set. If there were any, we should deal with them accordingly.

Unexpected Outliers When analyzing the histogram we can identify that there are some outliers in some columns. We will further analyse those outliers and determine what we can do about them.

Blood pressure : By observing the data we can see that there are 0 values for blood pressure. And it is evident that the readings of the data set seems wrong because a living person cannot have diastolic blood pressure of zero. By observing the data we can see 35 counts where the value is 0.

print("Total : ", diabetes[diabetes.BloodPressure == 0].shape[0])
print(diabetes[diabetes.BloodPressure == 0].groupby('Outcome')['Age'].count())

('Total : ', 35)
Outcome
0    19
1    16
Name: Age, dtype: int64

Plasma glucose levels : Even after fasting glucose level would not be as low as zero. Therefor zero is an invalid reading. By observing the data we can see 5 counts where the value is 0.

print("Total : ", diabetes[diabetes.Glucose == 0].shape[0])
print(diabetes[diabetes.Glucose == 0].groupby('Outcome')['Age'].count())

('Total : ', 5)
Outcome
0    3
1    2
Name: Age, dtype: int64

Skin Fold Thickness : For normal people skin fold thickness can’t be less than 10 mm better yet zero. Total count where value is 0 : 227.

print("Total : ", diabetes[diabetes.SkinThickness == 0].shape[0])
print(diabetes[diabetes.SkinThickness == 0].groupby('Outcome')['Age'].count())

('Total : ', 227)
Outcome
0    139
1     88
Name: Age, dtype: int64

BMI : Should not be 0 or close to zero unless the person is really underweight which could be life threatening.

print("Total : ", diabetes[diabetes.BMI == 0].shape[0])
print(diabetes[diabetes.BMI == 0].groupby('Outcome')['Age'].count())

('Total : ', 11)
Outcome
0    9
1    2
Name: Age, dtype: int64

Insulin : In a rare situation a person can have zero insulin but by observing the data, we can find that there is a total of 374 counts.

print("Total : ", diabetes[diabetes.Insulin == 0].shape[0])
print(diabetes[diabetes.Insulin == 0].groupby('Outcome')['Age'].count())

('Total : ', 374)
Outcome
0    236
1    138
Name: Age, dtype: int64

Here are several ways to handle invalid data values :

Ignore/remove these cases : This is not actually possible in most cases because that would mean losing valuable information. And in this case “skin thickness” and “insulin” columns means have a lot of invalid points. But it might work for “BMI”, “glucose ”and “blood pressure” data points.
Put average/mean values : This might work for some data sets, but in our case putting a mean value to the blood pressure column would send a wrong signal to the model.
Avoid using features : It is possible to not use the features with a lot of invalid values for the model. This may work for “skin thickness” but its hard to predict that.

By the end of the data cleaning process we have come to the conclusion that this given data set is incomplete. Since this is a demonstration for machine learning we will proceed with the given data with some minor adjustments.

We will remove the rows which the “BloodPressure”, “BMI” and “Glucose” are zero.

diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]
print(diabetes_mod.shape)

(724, 9)

Phase 3— Feature Engineering¶

Feature engineering is the process of transforming the gathered data into features that better represent the problem that we are trying to solve to the model, to improve its performance and accuracy.

Feature engineering create more input features from the existing features and also combine several features to produce more intuitive features to feed to the model.

“ Feature engineering enables to highlight the important features and facilitate to bring domain expertise on the problem to the table. It also allows to avoid overfitting the model despite providing many input features”.

The domain of the problem we are trying to tackle requires lots of related features. Since the data set is already provided, and by examining the data we can’t further create or dismiss any data at this point. In the data set we have the following features.

By a crude observation we can say that the ‘Skin Thickness’ is not an indicator of diabetes. But we can’t deny the fact that it is unusable at this point.

Therefore we will use all the features available. We separate the data set into features and the response that we are going to predict. We will assign the features to the X variable and the response to the y variable.

feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes_mod[feature_names]
y = diabetes_mod.Outcome

Generally feature engineering is performed before selecting the model. However for this tutorial we follow a different approach. Initially we will be utilizing all the features provided in the data set to the model, we will revisit features engineering to discuss feature importance on the selected model.

Phase 4— Model Selection¶

Model selection or algorithm selection phase is the most exciting and the heart of machine learning. It is the phase where we select the model which performs best for the data set at hand.

First we will be calculating the “Classification Accuracy (Testing Accuracy)” of a given set of classification models with their default parameters to determine which model performs better with the diabetes data set.

We will import the necessary libraries to the notebook. We import 7 classifiers namely K-Nearest Neighbors, Support Vector Classifier, Logistic Regression, Gaussian Naive Bayes, Random Forest and Gradient Boost to be contenders for the best classifier.

from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFECV

We will initialize the classifier models with their default parameters and add them to a model list.

models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVC', SVC()))
models.append(('LR', LogisticRegression()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))

Evaluation Methods¶

It is a general practice to avoid training and testing on the same data. The reasons are that, the goal of the model is to predict out-of-sample data, and the model could be overly complex leading to overfitting. To avoid the aforementioned problems, there are two precautions.

Train/Test Split with Scikit Learn :

Next we can split the features and responses into train and test portions. We stratify ( process where each response class should be represented with equal proportions in each of the portions) the samples.

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = diabetes_mod.Outcome, random_state=0)

Then we fit each model in a loop and calculate the accuracy of the respective model using the “accuracy_score”.

names = []
scores = []

for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    scores.append(accuracy_score(y_test, y_pred))
    names.append(name)
tr_split = pd.DataFrame({'Name': names, 'Score': scores})
print tr_split.sort_values(by=['Score'], ascending=False)

  Name     Score
6   GB  0.773481
2   LR  0.767956
3   DT  0.745856
4  GNB  0.734807
0  KNN  0.729282
5   RF  0.723757
1  SVC  0.657459

names = []
scores = []
for name, model in models:
    
    kfold = KFold(n_splits=10, random_state=10) 
    score = cross_val_score(model, X, y, cv=kfold, scoring='accuracy').mean()
    
    names.append(name)
    scores.append(score)
    
kf_cross_val = pd.DataFrame({'Name': names, 'Score': scores})
print kf_cross_val.sort_values(by=['Score'], ascending=False)

  Name     Score
6   GB  0.775076
2   LR  0.766781
4  GNB  0.757021
5   RF  0.746176
0  KNN  0.719787
3   DT  0.710046
1  SVC  0.656279

We can plot the accuracy scores using seaborn

axis = sns.barplot(x = 'Name', y = 'Score', data = kf_cross_val)
axis.set(xlabel='Classifier', ylabel='Accuracy')
for p in axis.patches:
    height = p.get_height()
    axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center") 
    
plt.show()

We can see the Logistic Regression, Gaussian Naive Bayes, Random Forest and Gradient Boosting have performed better than the rest. From the base level we can observe that the Logistic Regression performs better than the other algorithms.

At the baseline Logistic Regression managed to achieve a classification accuracy of 77.64 %. This will be selected as the prime candidate for the next phases.

Logistic Regression — Feature Selection¶

strat_k_fold = StratifiedKFold(
    n_splits=10,
    random_state=42
)

from sklearn.feature_selection import RFECV
logreg_model = LogisticRegression()
rfecv = RFECV(estimator=logreg_model, step=1, cv=strat_k_fold, scoring='accuracy')
rfecv.fit(X, y)

RFECV(cv=StratifiedKFold(n_splits=10, random_state=42, shuffle=False),
   estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='accuracy', step=1,
   verbose=0)

After fitting, it exposes an attribute gridscores which returns a list of accuracy scores for each of the features selected. We can use that to plot a graph to see the no of features which gives the maximum accuracy for the given model.

plt.figure()
plt.title('Logistic Regression CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

By looking at the plot we can see that inputting 4 features to the model gives the best accuracy score. RFECV exposes support_ which is another attribute to find out the features which contribute the most to predicting. In order to find out which features are selected we can use the following code.

feature_importance = list(zip(feature_names, rfecv.support_))
new_features = []
for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)

['Pregnancies', 'Glucose', 'BMI', 'DiabetesPedigreeFunction']

We can see that the given features are the most suitable for predicting the response class. We can do a comparison of the model with original features and the RFECV selected features to see if there is an improvement in the accuracy scores.

By observing the accuracy there is a slight increase in the accuracy after feeding the selected features to the model.

# Calculate accuracy scores 
X_new = diabetes_mod[new_features]
initial_score = cross_val_score(logreg_model, X, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(logreg_model, X_new, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

Initial accuracy : 0.776440071173 
Accuracy after Feature Selection : 0.780587711964

Gradient Boosting — Feature Selection¶

We can also analyze the second best model we had which is Gradient Boost classifier, to see if features selection process increases the model accuracy and if it would be better than Logistic Regression after the process.

We follow the same procedure which we did for Logistic Regression

gb_model = GradientBoostingClassifier()
gb_rfecv = RFECV(estimator=gb_model, step=1, cv=strat_k_fold, scoring='accuracy')
gb_rfecv.fit(X, y)
plt.figure()
plt.title('Gradient Boost CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(gb_rfecv.grid_scores_) + 1), gb_rfecv.grid_scores_)
plt.show()

We can see that having 4 input features generates the maximum accuracy.

feature_importance = list(zip(feature_names, gb_rfecv.support_))
new_features = []
for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)

['Pregnancies', 'Glucose', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

The above 4 features are most suitable for the model. We can compare the accuracy before and after feature selection.

X_new_gb = diabetes_mod[new_features]
initial_score = cross_val_score(gb_model, X, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))
fe_score = cross_val_score(gb_model, X_new_gb, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

Initial accuracy : 0.762702317405 
Accuracy after Feature Selection : 0.783345927927

We can see that there is an increase in accuracy after the feature selection.

However Logistic Regression is more accurate than Gradient Boosting. So we will use Logistic Regression for the parameter tuning phase.

Phase 6 — Model Parameter Tuning¶

Scikit Learn provides the model with sensible default parameters which gives decent accuracy scores. It also provides the user with the option to tweak the parameters to further increase the accuracy.

Out of the classifiers we will select Logistic Regression for fine tuning, where we change the model parameters in order to possibly increase the accuracy of the model for the particular data set.

Instead of having to manually search for optimum parameters, we can easily perform an exhaustive search using the GridSearchCV, which does an “exhaustive search over specified parameter values for an estimator”.

It is clearly a very handy tool, but comes with the price of computational cost when parameters to be searched are high.

Important : When using the GridSearchCV, there are some models which have parameters that don’t work with each other. Since GridSearchCV uses combinations of all the parameters given, if two parameters don’t work with each other we will not be able to run the GridSearchCV. If that happens, a list of parameter grids can be provided to overcome the given issue. It is recommended that you read the class document of the which you are trying to fine tune, to find how parameters work with each other. First we import GridSearchCV.

Logistic Regression model has some hyperparameters that doesn’t work with each other. Therefore we provide a list of grids with compatible parameters to fine tune the model. Through trial and error the following compatible parameters were found.

from sklearn.model_selection import GridSearchCV

# Specify parameters
c_values = list(np.arange(1, 10))
param_grid = [
    {'C': c_values, 'penalty': ['l1'], 'solver' : ['liblinear'], 'multi_class' : ['ovr']},
    {'C': c_values, 'penalty': ['l2'], 'solver' : ['liblinear', 'newton-cg', 'lbfgs'], 'multi_class' : ['ovr']}
]

Then we fit the data to the GridSearchCV, which performs a K-fold cross validation on the data for the given combinations of the parameters. This might take a little while to finish.

grid = GridSearchCV(LogisticRegression(), param_grid, cv=strat_k_fold, scoring='accuracy')
grid.fit(X_new, y)

c:\python27\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=42, shuffle=False),
       error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'penalty': ['l1'], 'multi_class': ['ovr'], 'C': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'solver': ['liblinear']}, {'penalty': ['l2'], 'multi_class': ['ovr'], 'C': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'solver': ['liblinear', 'newton-cg', 'lbfgs']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

After the training and scoring is done, GridSearchCV provides some useful attributes to find the best parameters and the best estimator.

print(grid.best_params_)
print(grid.best_estimator_)

{'penalty': 'l2', 'multi_class': 'ovr', 'C': 1, 'solver': 'liblinear'}
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

We can observe that the best hyper parameters are as follows.

{'C': 1, 'multi_class': 'ovr', 'penalty': 'l2', 'solver': 'liblinear'}

We can feed the best parameters to the Logistic Regression model and observe whether it’s accuracy has increased.

We could conclude that the hyper parameter tuning didn’t increase it’s accuracy. Maybe the hyper parameters we chose weren’t indicative. However you are welcome to try adding more parameter combinations.

The most important aspect is the procedure of doing hyper parameter tuning, not the result it self. In most cases hyper parameter tuning increases the accuracy. We manage to achieve a classification accuracy of 78.05% which we can say is quite good.

logreg_new = LogisticRegression(C=1, multi_class='ovr', penalty='l2', solver='liblinear')
initial_score = cross_val_score(logreg_new, X_new, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Final accuracy : {} ".format(initial_score))

Final accuracy : 0.780587711964

HOA QUACH