Supervised Machine Learning With Scikit-learn and Diabetes dataset - Part 2¶

Exploratory Data Analysis (EDA) and Wranging, Classification (k-nearest neighbor), Model Fitting, Hyperparameter Tuning, and Performance Evaluation¶

Dataset:¶

Diabetes is a chronic health condition affecting millions worldwide. Early prediction of diabetes can help in timely management and prevention of complications. In this article, we will walk through a Python-based machine learning project for predicting diabetes usinga Diabetes Dataset from Kaggle.
We will use Python libraries such as numpy, pandas, scikit-learn, and the K-nearest-neighbours (knn) classification algorithm. We will also learn the importance of handling missing data and hyperparameter tuning for each model to achieve the best performance and generability.

diabetes_prediction_dataset.csv: https://www.kaggle.com/code/mahmoudbahnasy29/diabetes?select=diabetes_prediction_dataset.csv
This file contains medical and demographic data of patients along with their diabetes status, whether positive or negative. It consists of various features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The Dataset can be utilized to construct machine learning models that can predict the likelihood of diabetes in patients based on their medical history and demographic details.

Step 1: Loading the necessary packages¶

In [130]:
## Please uncomment the folloing line and run pip install to install scikit-plot for visualization for first run of the notebook. 
# Once it is installed, you can comment it out again for subsequent clean runs of the notebook. 

# %pip install scikit-plot
In [191]:
# Utility Libraries
import os

# data handling
import pandas as pd
import numpy as np

# Data preprocessing for ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score

# model traning and testing faciliators
from sklearn.preprocessing import StandardScaler

# Overfiitting/underfitting guide
from sklearn.model_selection import GridSearchCV

# ML models to be explored
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Performace measurements
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve
from sklearn.metrics import roc_curve, auc

# Visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set()
%matplotlib inline

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category= FutureWarning)

print("Imports Done.")
Imports Done.

Step 2 : Data collection¶

Locate correct directory¶

In [192]:
 # go to the directory where your files are. The data is located in csv file -> '/inputs/ddiabetes_prediction_dataset.csv'
data_dir = "<path/to/your/code/directory/where/this/jupyter/notebook/is/located/>"
#print("Current Working Directory:",os.getcwd())
#print("Current working directory contains the following files:\n",os.listdir("./"))

Read data¶

In [133]:
# load data in a pandas dataframe
df = pd.read_csv(data_dir+"inputs/diabetes_prediction_dataset.csv")
df.head(5) # read first 5 lines of the data
Out[133]:
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
0 Female 80.0 0 1 never 25.19 6.6 140 0
1 Female 54.0 0 0 No Info 27.32 6.6 80 0
2 Male 28.0 0 0 never 27.32 5.7 158 0
3 Female 36.0 0 0 current 23.45 5.0 155 0
4 Male 76.0 1 1 current 20.14 4.8 155 0
In [134]:
original_dataframe_shape = df.shape
original_dataframe_shape
Out[134]:
(100000, 9)
  • Raw dataset contains 100000 rows, one for each patient and 9 columns, each representing one feature.
  • Supervised machine learning algorithm requires labelled data. The last column diabetes is the label; 0/1 -> non-diabetic/diabetic patient.

Describe data¶

Dataframe.info() gives information about the data types,columns, null value counts, memory usage etc.

In [135]:
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB

DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

In [136]:
df.describe().T
Out[136]:
count mean std min 25% 50% 75% max
age 100000.0 41.885856 22.516840 0.08 24.00 43.00 60.00 80.00
hypertension 100000.0 0.074850 0.263150 0.00 0.00 0.00 0.00 1.00
heart_disease 100000.0 0.039420 0.194593 0.00 0.00 0.00 0.00 1.00
bmi 100000.0 27.320767 6.636783 10.01 23.63 27.32 29.58 95.69
HbA1c_level 100000.0 5.527507 1.070672 3.50 4.80 5.80 6.20 9.00
blood_glucose_level 100000.0 138.058060 40.708136 80.00 100.00 140.00 159.00 300.00
diabetes 100000.0 0.085000 0.278883 0.00 0.00 0.00 0.00 1.00

Preprocessing¶

a. check for missing/ null values¶

In [137]:
print(df.isnull().sum())
gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

No missing values found.

b. remove duplicates¶

In [138]:
#Detect & Handle Duplicates
df.duplicated().sum()
Out[138]:
3854
In [139]:
df.drop_duplicates(inplace=True, ignore_index=True)
In [140]:
df.shape
Out[140]:
(96146, 9)
In [141]:
df.describe().T
Out[141]:
count mean std min 25% 50% 75% max
age 96146.0 41.794326 22.462948 0.08 24.0 43.00 59.00 80.00
hypertension 96146.0 0.077601 0.267544 0.00 0.0 0.00 0.00 1.00
heart_disease 96146.0 0.040803 0.197833 0.00 0.0 0.00 0.00 1.00
bmi 96146.0 27.321461 6.767716 10.01 23.4 27.32 29.86 95.69
HbA1c_level 96146.0 5.532609 1.073232 3.50 4.8 5.80 6.20 9.00
blood_glucose_level 96146.0 138.218231 40.909771 80.00 100.0 140.00 159.00 300.00
diabetes 96146.0 0.088220 0.283616 0.00 0.0 0.00 0.00 1.00

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [142]:
for col in df.columns:
    if col in df.select_dtypes('O').columns.to_list() + ['hypertension','heart_disease','diabetes']:
        fig,axes = plt.subplots(1,2)
        axc = sns.countplot(x=df[col],ax=axes[0])
        if col == 'smoking_history':
            axc.tick_params(axis='x', rotation=90) 
        plt.pie(x=df[col].value_counts().values,labels=df[col].value_counts().index,autopct='%.2f%%')
    else:
        fig,axes = plt.subplots(1,2)
        ax = sns.histplot(x=df[col],kde=True,ax=axes[0])
        sns.boxplot(x=df[col],ax=axes[1])
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Bivariate Analysis¶

In [143]:
sns.boxplot(x=df['diabetes'],y=df['bmi'])
Out[143]:
<Axes: xlabel='diabetes', ylabel='bmi'>
No description has been provided for this image
In [144]:
sns.boxplot(x=df['diabetes'],y=df['age'])
Out[144]:
<Axes: xlabel='diabetes', ylabel='age'>
No description has been provided for this image

Handling categorical variables¶

We will use k nearest neighbours classification algorithm as our model later, which is a distance-based algorithm.
Therefore, we need to enumerate two categorical variables gender and smoking_history to numerical values.

In [145]:
# Define your mapping dictionary
mapping_gender = {'Male':1,'Female':2,'Other':0}
mapping_smoking_history = {'No Info':0,'never':1,'ever':2,'former':3,'not current':4,'current':5}

# Apply the mapping
df['gender'] = df['gender'].map(mapping_gender)
df['smoking_history'] = df['smoking_history'].map(mapping_smoking_history)

df.head(5)
Out[145]:
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
0 2 80.0 0 1 1 25.19 6.6 140 0
1 2 54.0 0 0 0 27.32 6.6 80 0
2 1 28.0 0 0 1 27.32 5.7 158 0
3 2 36.0 0 0 5 23.45 5.0 155 0
4 1 76.0 1 1 5 20.14 4.8 155 0

Data Wrangling (handling missing values)¶

In general, there are three approaches a data scientist may handle so many missing values.

  1. Delete all observations with missing values; which would result in substantial loss of data❗Therefore, NOT recommended. But, in case of large dataset, if deleting them all looks safe, that is the simplest way to handle missing data.
  2. Substitute missing values with either mean, median, or mode; which can be a great trade between regression to the mean but keeping the data. However, it is not the best solution when it comes to the substitution of more than half of values in a variable.
  3. Give up; No!** ❌
  • Our data do not have any NaN. Therefore, 0 values in the this data represent missing records in some columns, not all.

Missing data:¶

  • smoking history have some rows with No info.
  • Gender have some rows with Other.

It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values.

In [146]:
# Copy original dataframe to a copy and manipulate this later
df_copy = df.copy(deep=True)
df_copy.shape
Out[146]:
(96146, 9)
In [147]:
# replace zeros with NaNs for columns 'smoking_history' and 'gender'
df_copy[['smoking_history']] = df_copy[['smoking_history']].replace(0, np.NaN)
df_copy[['gender']] = df_copy[['gender']].replace(0, np.NaN)
In [148]:
# count total rows with NaNs
df_copy.isna().any(axis=1).sum()
Out[148]:
32899

Drop NaN values¶

In [149]:
# Drop rows where column 'smoking_history' has NaN values
df_cleaned = df_copy.dropna(subset=['smoking_history', 'gender'])
df_cleaned.shape
Out[149]:
(63247, 9)
In [150]:
hist = df.hist(figsize = [15, 15])
No description has been provided for this image
In [151]:
# pair plot on cleaned data
p = sns.pairplot(df_cleaned, hue = 'diabetes', aspect=1.5) # height=2.5
No description has been provided for this image

The variables exhibit various distribution patterns.
Note that, hbA1c_level and blood_glucose_level are two strongest predictors.
Even with common sense, we can tell that these two features are directly related to diabetics.

Also, the diabetes graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost 12 times higher than the number of diabetic patients, indicating data imbalance.

Standardization (min-max Normalization or Scaling):¶

This method rescales features to a specific range, typically between 0 and 1.
image.png

In [152]:
df_min_max_scaled = (df_cleaned - df_cleaned.min()) / (df_cleaned.max() - df_cleaned.min())
#df_min_max_scaled.head(5)
In [153]:
df_min_max_scaled.describe().T
Out[153]:
count mean std min 25% 50% 75% max
gender 63247.0 0.603855 0.489099 0.0 0.000000 1.000000 1.000000 1.0
age 63247.0 0.581186 0.244647 0.0 0.386273 0.586673 0.762024 1.0
hypertension 63247.0 0.099135 0.298846 0.0 0.000000 0.000000 0.000000 1.0
heart_disease 63247.0 0.047686 0.213103 0.0 0.000000 0.000000 0.000000 1.0
smoking_history 63247.0 0.310133 0.382747 0.0 0.000000 0.000000 0.500000 1.0
bmi 63247.0 0.224624 0.080276 0.0 0.176658 0.210913 0.258380 1.0
HbA1c_level 63247.0 0.375776 0.199365 0.0 0.236364 0.418182 0.490909 1.0
blood_glucose_level 63247.0 0.271327 0.191983 0.0 0.090909 0.272727 0.359091 1.0
diabetes 63247.0 0.111262 0.314459 0.0 0.000000 0.000000 0.000000 1.0

Feature Importance:¶

There is a way how to quickly and visually investigate the importance of the features using RandomForestClassifier with the following tool: skplt.estimators.plot_feature_importances. It plots the classifier's feature importance. You can visually inspect how much the variable, relative to other features, correlates to the occurrence of diabetes.
Let's plot the features.

In [154]:
df2 = df_min_max_scaled
feature_names = df2.columns[:-1]

randfor = RandomForestClassifier()
randfor.fit(df2.drop(columns = "diabetes", axis=1),df2["diabetes"])

sp = skplt.estimators.plot_feature_importances(randfor, feature_names=feature_names, figsize=(10, 5), x_tick_rotation=90)
plt.show()
No description has been provided for this image

The result supports the assumption that blood_glucose_level and hbA1c_level are very strong predictor for the diagnosis of diabetes.

Feature Corelation:¶

Lastly, one should examine correlations between the variables which helps to find out the relationship between two quantities. It gives the measure of the strength of association between two variables. The value of Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.
Pearson, Kendall rank and Spearman’s rank correlation coefficient are currently computed using pairwise complete observations.\

A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information.
Heatmap with annotation is a nice way of visualizing a corelation matrix.

In [155]:
# Spearman’s rank correlation coefficient
correlation_matrix = df2.corr(method='spearman') # kendall, pearson
plt.figure(figsize=(8, 5.5)) # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", square=False, cmap='Blues',) #RdYlGn/jet/coolwarm (vibrant), viridis/cividis (for colorblind), Blues (sequential)
plt.title("Spearman's rank Correlation Matrix")
plt.show()
No description has been provided for this image

Observations:

  1. All variables except gender shows positive corelation with target variable diabetes.\
  2. Most of the times, the variables exhibit low correlations between themselves and diabetes. However, blood_glucose_level and hbA1c_level seem to be higher correlated with the target diabetes than other variables. age is also moderately positively co-related with diabetes.\
  3. All the features are mostly uncorrelated.\

The observations seems reasonably correct.

Step 3: Pre-setting and Modeling Strategy¶

Now comes the Machine Learning model part.

Data splitting¶

First, split the data into labels and 2D arrays for training and testing as is the standard approach in ML.

train_test_split function from the sklearn.model_selection module is commonly used to divide a dataset into training and testing sets, o have unknown datapoints to test the data rather than testing with the same points with which the model was trained. This is a crucial step in machine learning to evaluate a model's performance on unseen data.

image.png

Cross Validation: When model is split into training and testing it can be possible that specific type of data point may go entirely into either training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross validation techniques.

image.png X and y: These are the input features and the target variable of your dataset, respectively. Your model will never see the y testset data while training, it will be hidden from the model. It is used to evaluate the performance of your model on unseen data.

test_size: The proportion of the dataset to include in the test split (e.g., 0.2 for 20%).

random_state: Controls the shuffling applied to the data for reproducibility.

In [156]:
X = df_min_max_scaled.drop(columns = "diabetes", axis=1) # data with feature columns
y = df_min_max_scaled["diabetes"] # target labels 

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 42) 

X_train -> the data to be used to train a model
y_train -> label of the data in the train set.
X_test -> test set input features (never used probably)
y_test -> test set target labels (you will evavualte your predictions by comparing with this set)\

In [157]:
[X_train.shape, X_test.shape, y_train.shape, y_test.shape] 
Out[157]:
[(42164, 8), (21083, 8), (42164,), (21083,)]

Step 4: ML Approach¶

The following steps summarize our approach to fit a particular model. We will repeat it for all the models:

  1. Decide on the training parameters and create and train the model

    • Create the model by using an appropriate classifier from scilit-learn. E.g. model = KNeighborsClassifier(n_neighbors=k)
    • call model.fit(X_train, y_train) to train a model
  2. predict the label pred_model by calling model.predict(X_test)

  3. Evaluate the model performance on test data by:

    • accuracy_score(y_test, pred_model)
    • classification_report(y_test, pred_model)
    • confusion_matrix(y_test, pred_model)

The strategy is identical to all the models I will attempt to fit, i.e.: i.) k-nearest neighbor (k-NN), ii.) logistic regression, iii.) decision tree and iv.) random forest. Therefore, , I will discuss the key issues only in I.)k - Nearest Neighbour. I am intentionally avoiding commenting on other models because the logic remains the same. However, the list** parameters_model should be unique to each classifier because the parameters differ from model to model. I will comment on the performance and draw conclusion in the last part of article.

Model 1: k - Nearest Neighbour¶

How to choose the value of k?
Choosing the optimal value of 'k' in a K-Nearest Neighbors (KNN) classifier in Python is a crucial step for achieving good model performance. While there's no single "best" method, several approaches are commonly employed.

In [158]:
# # Elbow method
# error_rates = [] # Initialize an empty list to store error rates
# k_values = range(1, 51) # # Iterate through a range of K values
# for k in k_values:
#     knn = KNeighborsClassifier(n_neighbors=k) # Initialize KNN classifier with the current K
#     knn.fit(X_train, y_train) # Fit the model to the training data
#     predictions = knn.predict(X_test) # Make predictions on the test data
#     error_rate = np.mean(predictions != y_test) # Calculate the error rate (misclassification rate)
#     error_rates.append(error_rate) # Append the error rate to the list

# # Plot the error rate vs. K value
# plt.figure(figsize=(8, 6))
# plt.plot(k_values, error_rates, color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=8)
# plt.title('Error Rate vs. K Value (Elbow Method for KNN)')
# plt.xlabel('K Value')
# plt.ylabel('Error Rate')
# plt.grid(True)
# plt.show()
In [159]:
# # Square root of N method
# int(np.sqrt(df_cleaned.shape[0]))

Hyperparameter Tuning¶

GridSearchCV in scikit-learn for determining the value of k¶

5-fold cross-validation on train set¶

Here is an example using GridSearchCV in scikit-learn: GridSearchCV splits X_train and y_train into 5 identical data sets and performs cross-validation on each of them. However, each data set is split into different test and train sections. While cross-validating, GridSearchCV searches for the best parameters specified in parameters_model.

In [160]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 51, 2)} # Test odd k values from 1 to 51
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_
print(f"Best k: {best_k}, Best accuracy: {best_score}")
Best k: 13, Best accuracy: 0.9492220378242069

Model Training¶

In [161]:
# Step 1. Initialize and train the KNN classifier
k = best_k
model_knn = KNeighborsClassifier(n_neighbors=k)
model_knn.fit(X_train, y_train)
Out[161]:
KNeighborsClassifier(n_neighbors=13)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=13)

Cross Validation¶

In [162]:
# Define the cross-validation strategy 
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform 5-fold cross-validation
# 'cv=5' specifies 5-fold cross-validation
# 'scoring="accuracy"' specifies the evaluation metric
scores = np.round(cross_val_score(model_knn, X_train, y_train, cv=kf, scoring='accuracy'), 2)

print("Cross-validation scores for each fold on train data:", scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation of accuracy:", scores.std())
Cross-validation scores for each fold on train data: [0.95 0.95 0.95 0.95 0.95]
Mean accuracy: 0.95
Standard deviation of accuracy: 0.0

Prediction Accuracy on Test data (unseen by model)¶

In [163]:
# Make predictions on the test set using the model
pred_knn = model_knn.predict(X_test)
[y_test.shape, pred_knn.shape]
Out[163]:
[(21083,), (21083,)]

Evaluation (model Performance Analysis)¶

1. Confusion Matrix¶

The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs. A confusion matrix is a table that describes the performance of a classification model. It summarizes the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed view of how well the model is distinguishing between different classes. image.png

2. classification_report:¶

Methos to see greater picture of model's performance, which includes precision and recall and F1-score.

Precision Score:¶

  • Precosion (aka positive predictive value (PPV)) - Accuracy of positive predictions.\
  • $PPV = \frac{TP}{TP+FP}$
  • It is similar to accuracy, but focuses only on data the model predicted to be positive, i.e. diabetes = 1. Referring to a confusion matrix, precision of 1 means there were no false positives.

Recall Score:¶

  • Recall(sensitivity or true positive rate): Fraction of positives that were correctly identified.\
  • It is also called sensitivity OR True Positive Rate (TPR), answers the question how complete are the results, i.e. did the model miss any positive class and in what extent?
  • $TPR = \frac{TP}{TP+FN}$
  • A recall greater than 0.5 is good.
  • In our case, low recall would mean the model incorrectly classified a lot of individuals with diabetes as healthy ones.

F1 Score:¶

  • F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers.
  • F1 Score takes into account precision and the recall.
  • It is created by finding the the harmonic mean of precision and recall.
  • $F1 = \frac{2 * precision * recall}{precision + recall}$

One should ask, what is is the superior metric from the two? In fact, it really depends!

For example, imagine cancer diagnostics. Would you rather classify few more patients as false positive and after more precise examination conclude they had no cancer or would you rather let escape the ones with cancer as healthy individuals? In this particular case, the model should minimize $FN$ in the confusion matrix. Consequently, recall,i.e. TPR, should be closer to 1. Lastly, there is always a trade-off between the two negatively correlated metrics.

1. Confusion Matrix¶

In [164]:
pd.crosstab(y_test, pred_knn, rownames=['True'], colnames=['Predicted'], margins = True)
Out[164]:
Predicted 0.0 1.0 All
True
0.0 18780 64 18844
1.0 934 1305 2239
All 19714 1369 21083
In [165]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, pred_knn)
p = sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[165]:
Text(0.5, 20.049999999999997, 'Predicted label')
No description has been provided for this image

2. Classification Report¶

In [166]:
from sklearn.metrics import classification_report
print(classification_report(y_test,pred_knn))
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97     18844
         1.0       0.95      0.58      0.72      2239

    accuracy                           0.95     21083
   macro avg       0.95      0.79      0.85     21083
weighted avg       0.95      0.95      0.95     21083

3. Accuracy and Error¶

In [167]:
print("KNN Accuracy: {}".format(np.round(accuracy_score(y_test, pred_knn), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_knn)), 2)))
KNN Accuracy: 0.95
Root Mean Squared Error: 0.22

4. ROC - AUC:¶

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

In [168]:
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
In [169]:
roc_auc = np.round(auc(fpr, tpr), 2)
roc_auc
Out[169]:
0.94
In [170]:
[y_test.shape, y_pred_proba.shape]
Out[170]:
[(21083,), (21083,)]
In [171]:
# plt.plot([0,1],[0,1],'k--')
# plt.plot(fpr,tpr, label='Knn')
# plt.xlabel('FPR')
# plt.ylabel('TPR')
# plt.title('ROC curve')
# plt.show()
In [172]:
plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image
In [173]:
# import scikitplot as skplt

# skplt.metrics.plot_roc_curve(y_test, y_pred_proba)
# plt.show()
In [174]:
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
Out[174]:
0.9431043169706584
In [175]:
# Evaluate the model using scikit-learn metrices 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

pred_knn = model_knn.predict(X_test)

accuracy = accuracy_score(y_test, pred_knn)
precision = precision_score(y_test, pred_knn)
recall = recall_score(y_test, pred_knn)
auc = roc_auc_score(y_test, pred_knn)
cr = classification_report(y_test, pred_knn)
cm =confusion_matrix(y_test, pred_knn)

---------------------------------------------------------- PART 2 --------------------------------------------------------------------------¶

Model 2. Logistic Regression¶

Logistic regression is a machine learning algorithm used for classification, predicting the probability of a binary outcome (like Yes/No, 0/1, Spam/Not Spam) based on input variables, fitting an S-shaped curve (sigmoid or logistic) to map inputs to a probability between 0 and 1, and then using a threshold (often 0.5) to classify the result. It's similar to linear regression but solves the issue of linear models predicting values outside the 0-1 range for binary outcomes, making it ideal for predicting the likelihood of an event.

Model Training¶

  1. Import the LogisticRegression class from sklearn.linear_model.
  2. Create an instance of the LogisticRegression model.
  3. Fit the model to your training data using the fit() method, providing the features (X_train) and the target variable (y_train).n).
In [176]:
# Initialize and train the Logistic Regression model
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
Out[176]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Prediction¶

In [177]:
# Make predictions
pred_lr = model_lr.predict(X_test)

Evaluation¶

In [178]:
# Step 3. Evaluate the model
print("Model: Logistic Regression")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_lr), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_lr)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_lr), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_lr), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_lr), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_lr), 2)))
print("Classification Report:\n", classification_report(y_test, pred_lr))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_lr))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Logistic Regression
Accuracy: 0.95
Root Mean Squared Error: 0.22
R-squared: 0.48
Precision: 0.86
Recall: 0.64
AUC: 0.81
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.99      0.97     18844
         1.0       0.86      0.64      0.73      2239

    accuracy                           0.95     21083
   macro avg       0.91      0.81      0.85     21083
weighted avg       0.95      0.95      0.95     21083

Confusion Matrix:
 {}
No description has been provided for this image
In [179]:
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_lr.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)

plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Model 3. Decision Tree¶

A decision tree in Python is a supervised machine learning algorithm used for both classification and regression tasks. It operates by building a model in the form of a tree structure, where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (in classification) or a numerical value (in regression).

Key Concepts:

Root Node: The topmost node in the tree, representing the initial decision or feature test.

Internal Nodes: Nodes that represent a test on a feature and have branches leading to other nodes.

Leaf Nodes: Terminal nodes that represent the final outcome or prediction.

Splitting: The process of dividing data into subsets based on feature values at each node.

Information Gain/Gini Impurity: Metrics used to determine the "best" split at each node, aiming to create purer child nodes.

Model Training¶

In [180]:
# Instantiate the Decision Tree Classifier
model_dt = DecisionTreeClassifier(max_depth=3, random_state=42)

* How to determine max_depth?¶

  1. Cross-Validation with Grid Search or Randomized Search (using scikit-learn and GridSearchCV)
  2. Heuristic Rules (Initial Guidance)
  3. Monitoring Training and Validation Accuracy
In [181]:
# Train the model
model_dt.fit(X_train, y_train)
Out[181]:
DecisionTreeClassifier(max_depth=3, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=42)
In [182]:
# Make predictions
pred_dt = model_dt.predict(X_test)

Evaluation¶

In [183]:
# Step 3. Evaluate the model
print("Model: Decision Tree")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_dt), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_dt)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_dt), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_dt), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_dt), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_dt), 2)))
print("Classification Report:\n", classification_report(y_test, pred_dt))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_dt))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Decision Tree
Accuracy: 0.96
Root Mean Squared Error: 0.19
R-squared: 0.62
Precision: 1.0
Recall: 0.66
AUC: 0.83
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18844
         1.0       1.00      0.66      0.79      2239

    accuracy                           0.96     21083
   macro avg       0.98      0.83      0.89     21083
weighted avg       0.96      0.96      0.96     21083

Confusion Matrix:
 {}
No description has been provided for this image
In [184]:
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_dt.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)

plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Model 4. Random Forest¶

A random forest classification model is an ensemble machine learning method that builds multiple decision trees during training and combines their predictions through majority voting to determine the final class. This approach improves predictive accuracy and reduces the risk of overfitting compared to a single decision tree model.

In [ ]:
# imprt module form scikit-learn
from sklearn.ensemble import RandomForestClassifier

# Instantiate the Random Forest Classifier
model_rf = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is number of trees

# Train the model 
model_rf.fit(X_train, y_train)

#Prediction
pred_rf = model_rf.predict(X_test)

# Evaluate the model
print(accuracy_score(y_test, pred_rf)))
print(classification_report(y_test, pred_rf))
In [185]:
# Instantiate the Random Forest Classifier
model_rf = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is number of trees
In [186]:
# Train the model 
model_rf.fit(X_train, y_train)
Out[186]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
In [187]:
#Prediction
pred_rf = model_rf.predict(X_test)

Evaluation¶

In [188]:
# Step 3. Evaluate the model
print("Model: Random Forest")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_rf), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_rf)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_rf), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_rf), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_rf), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_rf), 2)))
print("Classification Report:\n", classification_report(y_test, pred_rf))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_rf))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Random Forest
Accuracy: 0.96
Root Mean Squared Error: 0.2
R-squared: 0.6
Precision: 0.94
Recall: 0.68
AUC: 0.84
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18844
         1.0       0.94      0.68      0.79      2239

    accuracy                           0.96     21083
   macro avg       0.95      0.84      0.88     21083
weighted avg       0.96      0.96      0.96     21083

Confusion Matrix:
 {}
No description has been provided for this image
In [189]:
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)

plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Best Model Selection¶

I already mentioned precision and recall but did not touch accuracy. In fact, accuracy may not be the best parameter for choosing the right model. Consider test data with 100 individuals from which 99 subjects are healthy ones and only 1 individual has diabetes. Also, assume the model successfully classified 99 healthy people but completely failed to classify the one individual with diabetes.

Gived that $accuracy = \frac{TP + TN} {TP + TN + FP + FN}$, than the accuracy would be 99%. However, the algorithm missed 100% individuals in the positive class.

The following code display the best parameters and accuracy of every classifier I fitted. Moreover, classification_report displays the table with recall and precision, so you can effectively evaluate each model.

In [190]:
print("K-nearest neighbor:")
print(classification_report(y_test, pred_knn))

print("Logistic regression:")
print(classification_report(y_test, pred_lr))

print("Decision Tree")
print(classification_report(y_test, pred_dt))

print("Random Forest:")
print(classification_report(y_test, pred_rf))
K-nearest neighbor:
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97     18844
         1.0       0.95      0.58      0.72      2239

    accuracy                           0.95     21083
   macro avg       0.95      0.79      0.85     21083
weighted avg       0.95      0.95      0.95     21083

Logistic regression:
              precision    recall  f1-score   support

         0.0       0.96      0.99      0.97     18844
         1.0       0.86      0.64      0.73      2239

    accuracy                           0.95     21083
   macro avg       0.91      0.81      0.85     21083
weighted avg       0.95      0.95      0.95     21083

Decision Tree
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18844
         1.0       1.00      0.66      0.79      2239

    accuracy                           0.96     21083
   macro avg       0.98      0.83      0.89     21083
weighted avg       0.96      0.96      0.96     21083

Random Forest:
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98     18844
         1.0       0.94      0.68      0.79      2239

    accuracy                           0.96     21083
   macro avg       0.95      0.84      0.88     21083
weighted avg       0.96      0.96      0.96     21083

All the models make prediction with high accuracy. Here, both Decision Tree and Random Forest model performs with the highest accuracy of 0.96 meaning the model predicted 96% of cases correctly. But, accuracy is not the best and only one parameter for model selection.

You do not want to send home any patient with diabetes as she was healthy. From this perspective, you should choose the model with recall close to 1. Since, Random Forest and Decision Tree both has ideal recall = 1, you may consider either, but, given that both have same recall and knn have higher accuracy, you may want to decide for the decision tree classifier. You want to choose the largest recall from as first priority all the considered classifiers.

Administrating this model, you would sent home the lowest possible number of patients with diabetes at the cost of reexamining greater number of healthy individuals.

Selecting best model depends on your data and purpose. For smaller datasets, Decision tree mat generare faster and accurate decision, for larger dataset, Random Forest often works better.

Beyond scikit-learn, deep learning and neural network models are used. For example, a visual algorithm can detect cancer when it is trained on pictures of human cells.

Conclusion¶

  • Keep in mind how important is to explore your data before modeling. Specifically, there are more possibilities how missing values can be recorded. To see a greater picture of your data, it is important to ask yourself common sence questions, e.g. "can someone have 0 blood pressure?". Once you identify missing values in your data, you should decide how to deal with them. Would you delete every record with a missing value, would you substitute it with a mean, or would you try to be efficient and keep as much data as possible?

  • It is recommended to find feature's importance to make most accurate and realistic predictions.

  • Evaluation of the models based on recall along with accuracy is strongly recommended, because recall close to 1 minimizes the number of cases when a patient with diabetes would be classified incorrectly.

In [ ]: