Supervised Machine Learning With Scikit-learn and Diabetes dataset - Part 2¶
Exploratory Data Analysis (EDA) and Wranging, Classification (k-nearest neighbor), Model Fitting, Hyperparameter Tuning, and Performance Evaluation¶
Dataset:¶
Diabetes is a chronic health condition affecting millions worldwide. Early prediction of diabetes can help in timely management and
prevention of complications. In this article, we will walk through a Python-based machine learning project for predicting diabetes usinga Diabetes Dataset from Kaggle.
We will use Python libraries such as numpy, pandas, scikit-learn, and the K-nearest-neighbours (knn) classification algorithm. We will also learn the importance of handling missing data and hyperparameter tuning for each model to achieve the best performance and generability.
diabetes_prediction_dataset.csv: https://www.kaggle.com/code/mahmoudbahnasy29/diabetes?select=diabetes_prediction_dataset.csv
This file contains medical and demographic data of patients along with their diabetes status, whether positive or negative. It consists of various features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The Dataset can be utilized to construct machine learning models that can predict the likelihood of diabetes in patients based on their medical history and demographic details.
Step 1: Loading the necessary packages¶
## Please uncomment the folloing line and run pip install to install scikit-plot for visualization for first run of the notebook.
# Once it is installed, you can comment it out again for subsequent clean runs of the notebook.
# %pip install scikit-plot
# Utility Libraries
import os
# data handling
import pandas as pd
import numpy as np
# Data preprocessing for ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
# model traning and testing faciliators
from sklearn.preprocessing import StandardScaler
# Overfiitting/underfitting guide
from sklearn.model_selection import GridSearchCV
# ML models to be explored
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Performace measurements
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve
from sklearn.metrics import roc_curve, auc
# Visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set()
%matplotlib inline
# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category= FutureWarning)
print("Imports Done.")
Imports Done.
Step 2 : Data collection¶
Locate correct directory¶
# go to the directory where your files are. The data is located in csv file -> '/inputs/ddiabetes_prediction_dataset.csv'
data_dir = "<path/to/your/code/directory/where/this/jupyter/notebook/is/located/>"
#print("Current Working Directory:",os.getcwd())
#print("Current working directory contains the following files:\n",os.listdir("./"))
Read data¶
# load data in a pandas dataframe
df = pd.read_csv(data_dir+"inputs/diabetes_prediction_dataset.csv")
df.head(5) # read first 5 lines of the data
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 80.0 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 |
| 1 | Female | 54.0 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 |
| 2 | Male | 28.0 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 |
| 3 | Female | 36.0 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 |
| 4 | Male | 76.0 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 |
original_dataframe_shape = df.shape
original_dataframe_shape
(100000, 9)
- Raw dataset contains 100000 rows, one for each patient and 9 columns, each representing one feature.
- Supervised machine learning algorithm requires labelled data. The last column
diabetesis the label; 0/1 -> non-diabetic/diabetic patient.
Describe data¶
Dataframe.info() gives information about the data types,columns, null value counts, memory usage etc.
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 100000 non-null object 1 age 100000 non-null float64 2 hypertension 100000 non-null int64 3 heart_disease 100000 non-null int64 4 smoking_history 100000 non-null object 5 bmi 100000 non-null float64 6 HbA1c_level 100000 non-null float64 7 blood_glucose_level 100000 non-null int64 8 diabetes 100000 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 6.9+ MB
DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 100000.0 | 41.885856 | 22.516840 | 0.08 | 24.00 | 43.00 | 60.00 | 80.00 |
| hypertension | 100000.0 | 0.074850 | 0.263150 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| heart_disease | 100000.0 | 0.039420 | 0.194593 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| bmi | 100000.0 | 27.320767 | 6.636783 | 10.01 | 23.63 | 27.32 | 29.58 | 95.69 |
| HbA1c_level | 100000.0 | 5.527507 | 1.070672 | 3.50 | 4.80 | 5.80 | 6.20 | 9.00 |
| blood_glucose_level | 100000.0 | 138.058060 | 40.708136 | 80.00 | 100.00 | 140.00 | 159.00 | 300.00 |
| diabetes | 100000.0 | 0.085000 | 0.278883 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
print(df.isnull().sum())
gender 0 age 0 hypertension 0 heart_disease 0 smoking_history 0 bmi 0 HbA1c_level 0 blood_glucose_level 0 diabetes 0 dtype: int64
No missing values found.
b. remove duplicates¶
#Detect & Handle Duplicates
df.duplicated().sum()
3854
df.drop_duplicates(inplace=True, ignore_index=True)
df.shape
(96146, 9)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 96146.0 | 41.794326 | 22.462948 | 0.08 | 24.0 | 43.00 | 59.00 | 80.00 |
| hypertension | 96146.0 | 0.077601 | 0.267544 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| heart_disease | 96146.0 | 0.040803 | 0.197833 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| bmi | 96146.0 | 27.321461 | 6.767716 | 10.01 | 23.4 | 27.32 | 29.86 | 95.69 |
| HbA1c_level | 96146.0 | 5.532609 | 1.073232 | 3.50 | 4.8 | 5.80 | 6.20 | 9.00 |
| blood_glucose_level | 96146.0 | 138.218231 | 40.909771 | 80.00 | 100.0 | 140.00 | 159.00 | 300.00 |
| diabetes | 96146.0 | 0.088220 | 0.283616 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
for col in df.columns:
if col in df.select_dtypes('O').columns.to_list() + ['hypertension','heart_disease','diabetes']:
fig,axes = plt.subplots(1,2)
axc = sns.countplot(x=df[col],ax=axes[0])
if col == 'smoking_history':
axc.tick_params(axis='x', rotation=90)
plt.pie(x=df[col].value_counts().values,labels=df[col].value_counts().index,autopct='%.2f%%')
else:
fig,axes = plt.subplots(1,2)
ax = sns.histplot(x=df[col],kde=True,ax=axes[0])
sns.boxplot(x=df[col],ax=axes[1])
plt.show()
Bivariate Analysis¶
sns.boxplot(x=df['diabetes'],y=df['bmi'])
<Axes: xlabel='diabetes', ylabel='bmi'>
sns.boxplot(x=df['diabetes'],y=df['age'])
<Axes: xlabel='diabetes', ylabel='age'>
Handling categorical variables¶
We will use k nearest neighbours classification algorithm as our model later, which is a distance-based algorithm.
Therefore, we need to enumerate two categorical variables gender and smoking_history to numerical values.
# Define your mapping dictionary
mapping_gender = {'Male':1,'Female':2,'Other':0}
mapping_smoking_history = {'No Info':0,'never':1,'ever':2,'former':3,'not current':4,'current':5}
# Apply the mapping
df['gender'] = df['gender'].map(mapping_gender)
df['smoking_history'] = df['smoking_history'].map(mapping_smoking_history)
df.head(5)
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 80.0 | 0 | 1 | 1 | 25.19 | 6.6 | 140 | 0 |
| 1 | 2 | 54.0 | 0 | 0 | 0 | 27.32 | 6.6 | 80 | 0 |
| 2 | 1 | 28.0 | 0 | 0 | 1 | 27.32 | 5.7 | 158 | 0 |
| 3 | 2 | 36.0 | 0 | 0 | 5 | 23.45 | 5.0 | 155 | 0 |
| 4 | 1 | 76.0 | 1 | 1 | 5 | 20.14 | 4.8 | 155 | 0 |
Data Wrangling (handling missing values)¶
In general, there are three approaches a data scientist may handle so many missing values.
- Delete all observations with missing values; which would result in substantial loss of data❗Therefore, NOT recommended. But, in case of large dataset, if deleting them all looks safe, that is the simplest way to handle missing data.
- Substitute missing values with either
mean,median, ormode; which can be a great trade between regression to the mean but keeping the data. However, it is not the best solution when it comes to the substitution of more than half of values in a variable. - Give up; No!** ❌
- Our data do not have any
NaN. Therefore,0values in the this data represent missing records in some columns, not all.
Missing data:¶
smoking historyhave some rows withNo info.Genderhave some rows withOther.
It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values.
# Copy original dataframe to a copy and manipulate this later
df_copy = df.copy(deep=True)
df_copy.shape
(96146, 9)
# replace zeros with NaNs for columns 'smoking_history' and 'gender'
df_copy[['smoking_history']] = df_copy[['smoking_history']].replace(0, np.NaN)
df_copy[['gender']] = df_copy[['gender']].replace(0, np.NaN)
# count total rows with NaNs
df_copy.isna().any(axis=1).sum()
32899
Drop NaN values¶
# Drop rows where column 'smoking_history' has NaN values
df_cleaned = df_copy.dropna(subset=['smoking_history', 'gender'])
df_cleaned.shape
(63247, 9)
hist = df.hist(figsize = [15, 15])
# pair plot on cleaned data
p = sns.pairplot(df_cleaned, hue = 'diabetes', aspect=1.5) # height=2.5
The variables exhibit various distribution patterns.
Note that, hbA1c_level and blood_glucose_level are two strongest predictors.
Even with common sense, we can tell that these two features are directly related to diabetics.
Also, the diabetes graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost 12 times higher than the number of diabetic patients, indicating data imbalance.
Standardization (min-max Normalization or Scaling):¶
This method rescales features to a specific range, typically between 0 and 1.
df_min_max_scaled = (df_cleaned - df_cleaned.min()) / (df_cleaned.max() - df_cleaned.min())
#df_min_max_scaled.head(5)
df_min_max_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | 63247.0 | 0.603855 | 0.489099 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 1.0 |
| age | 63247.0 | 0.581186 | 0.244647 | 0.0 | 0.386273 | 0.586673 | 0.762024 | 1.0 |
| hypertension | 63247.0 | 0.099135 | 0.298846 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| heart_disease | 63247.0 | 0.047686 | 0.213103 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| smoking_history | 63247.0 | 0.310133 | 0.382747 | 0.0 | 0.000000 | 0.000000 | 0.500000 | 1.0 |
| bmi | 63247.0 | 0.224624 | 0.080276 | 0.0 | 0.176658 | 0.210913 | 0.258380 | 1.0 |
| HbA1c_level | 63247.0 | 0.375776 | 0.199365 | 0.0 | 0.236364 | 0.418182 | 0.490909 | 1.0 |
| blood_glucose_level | 63247.0 | 0.271327 | 0.191983 | 0.0 | 0.090909 | 0.272727 | 0.359091 | 1.0 |
| diabetes | 63247.0 | 0.111262 | 0.314459 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
Feature Importance:¶
There is a way how to quickly and visually investigate the importance of the features using RandomForestClassifier with the following tool: skplt.estimators.plot_feature_importances. It plots the classifier's feature importance. You can visually inspect how much the variable, relative to other features, correlates to the occurrence of diabetes.
Let's plot the features.
df2 = df_min_max_scaled
feature_names = df2.columns[:-1]
randfor = RandomForestClassifier()
randfor.fit(df2.drop(columns = "diabetes", axis=1),df2["diabetes"])
sp = skplt.estimators.plot_feature_importances(randfor, feature_names=feature_names, figsize=(10, 5), x_tick_rotation=90)
plt.show()
The result supports the assumption that blood_glucose_level and hbA1c_level are very strong predictor for the diagnosis of diabetes.
Feature Corelation:¶
Lastly, one should examine correlations between the variables which helps to find out the relationship between two quantities. It gives the measure of the strength of association between two variables. The value of Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.
Pearson, Kendall rank and Spearman’s rank correlation coefficient are currently computed using pairwise complete observations.\
A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information.
Heatmap with annotation is a nice way of visualizing a corelation matrix.
# Spearman’s rank correlation coefficient
correlation_matrix = df2.corr(method='spearman') # kendall, pearson
plt.figure(figsize=(8, 5.5)) # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", square=False, cmap='Blues',) #RdYlGn/jet/coolwarm (vibrant), viridis/cividis (for colorblind), Blues (sequential)
plt.title("Spearman's rank Correlation Matrix")
plt.show()
Observations:
- All variables except
gendershows positive corelation with target variablediabetes.\ - Most of the times, the variables exhibit low correlations between themselves and
diabetes. However,blood_glucose_levelandhbA1c_levelseem to be higher correlated with the targetdiabetesthan other variables.ageis also moderately positively co-related withdiabetes.\ - All the features are mostly uncorrelated.\
The observations seems reasonably correct.
Step 3: Pre-setting and Modeling Strategy¶
Now comes the Machine Learning model part.
Data splitting¶
First, split the data into labels and 2D arrays for training and testing as is the standard approach in ML.
train_test_split function from the sklearn.model_selection module is commonly used to divide a dataset into training and testing sets, o have unknown datapoints to test the data rather than testing with the same points with which the model was trained. This is a crucial step in machine learning to evaluate a model's performance on unseen data.
Cross Validation: When model is split into training and testing it can be possible that specific type of data point may go entirely into either training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross validation techniques.
X and y: These are the input features and the target variable of your dataset, respectively. Your model will never see the y testset data while training, it will be hidden from the model. It is used to evaluate the performance of your model on unseen data.
test_size: The proportion of the dataset to include in the test split (e.g., 0.2 for 20%).
random_state: Controls the shuffling applied to the data for reproducibility.
X = df_min_max_scaled.drop(columns = "diabetes", axis=1) # data with feature columns
y = df_min_max_scaled["diabetes"] # target labels
#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 42)
X_train -> the data to be used to train a model
y_train -> label of the data in the train set.
X_test -> test set input features (never used probably)
y_test -> test set target labels (you will evavualte your predictions by comparing with this set)\
[X_train.shape, X_test.shape, y_train.shape, y_test.shape]
[(42164, 8), (21083, 8), (42164,), (21083,)]
Step 4: ML Approach¶
The following steps summarize our approach to fit a particular model. We will repeat it for all the models:
Decide on the training parameters and create and train the model
- Create the model by using an appropriate
classifierfromscilit-learn. E.g.model = KNeighborsClassifier(n_neighbors=k) - call
model.fit(X_train, y_train)to train a model
- Create the model by using an appropriate
predict the label
pred_modelby callingmodel.predict(X_test)Evaluate the model performance on test data by:
accuracy_score(y_test, pred_model)classification_report(y_test, pred_model)confusion_matrix(y_test, pred_model)
The strategy is identical to all the models I will attempt to fit, i.e.: i.) k-nearest neighbor (k-NN), ii.) logistic regression, iii.) decision tree and iv.) random forest.
Therefore, , I will discuss the key issues only in I.)k - Nearest Neighbour. I am intentionally avoiding commenting on other models because the logic remains the same. However, the list** parameters_model should be unique to each classifier because the parameters differ from model to model.
I will comment on the performance and draw conclusion in the last part of article.
Model 1: k - Nearest Neighbour¶
How to choose the value of k?
Choosing the optimal value of 'k' in a K-Nearest Neighbors (KNN) classifier in Python is a crucial step for achieving good model performance. While there's no single "best" method, several approaches are commonly employed.
# # Elbow method
# error_rates = [] # Initialize an empty list to store error rates
# k_values = range(1, 51) # # Iterate through a range of K values
# for k in k_values:
# knn = KNeighborsClassifier(n_neighbors=k) # Initialize KNN classifier with the current K
# knn.fit(X_train, y_train) # Fit the model to the training data
# predictions = knn.predict(X_test) # Make predictions on the test data
# error_rate = np.mean(predictions != y_test) # Calculate the error rate (misclassification rate)
# error_rates.append(error_rate) # Append the error rate to the list
# # Plot the error rate vs. K value
# plt.figure(figsize=(8, 6))
# plt.plot(k_values, error_rates, color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=8)
# plt.title('Error Rate vs. K Value (Elbow Method for KNN)')
# plt.xlabel('K Value')
# plt.ylabel('Error Rate')
# plt.grid(True)
# plt.show()
# # Square root of N method
# int(np.sqrt(df_cleaned.shape[0]))
Hyperparameter Tuning¶
GridSearchCV in scikit-learn for determining the value of k¶
5-fold cross-validation on train set¶
Here is an example using GridSearchCV in scikit-learn:
GridSearchCV splits X_train and y_train into 5 identical data sets and performs cross-validation on each of them. However, each data set is split into different test and train sections. While cross-validating, GridSearchCV searches for the best parameters specified in parameters_model.
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 51, 2)} # Test odd k values from 1 to 51
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_
print(f"Best k: {best_k}, Best accuracy: {best_score}")
Best k: 13, Best accuracy: 0.9492220378242069
Model Training¶
# Step 1. Initialize and train the KNN classifier
k = best_k
model_knn = KNeighborsClassifier(n_neighbors=k)
model_knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=13)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=13)
Cross Validation¶
# Define the cross-validation strategy
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform 5-fold cross-validation
# 'cv=5' specifies 5-fold cross-validation
# 'scoring="accuracy"' specifies the evaluation metric
scores = np.round(cross_val_score(model_knn, X_train, y_train, cv=kf, scoring='accuracy'), 2)
print("Cross-validation scores for each fold on train data:", scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation of accuracy:", scores.std())
Cross-validation scores for each fold on train data: [0.95 0.95 0.95 0.95 0.95] Mean accuracy: 0.95 Standard deviation of accuracy: 0.0
Prediction Accuracy on Test data (unseen by model)¶
# Make predictions on the test set using the model
pred_knn = model_knn.predict(X_test)
[y_test.shape, pred_knn.shape]
[(21083,), (21083,)]
Evaluation (model Performance Analysis)¶
1. Confusion Matrix¶
The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs.
A confusion matrix is a table that describes the performance of a classification model. It summarizes the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed view of how well the model is distinguishing between different classes.
2. classification_report:¶
Methos to see greater picture of model's performance, which includes precision and recall and F1-score.
Precision Score:¶
- Precosion (aka positive predictive value (PPV)) - Accuracy of positive predictions.\
- $PPV = \frac{TP}{TP+FP}$
- It is similar to accuracy, but focuses only on data the model predicted to be positive, i.e.
diabetes = 1. Referring to a confusion matrix,precisionof 1 means there were no false positives.
Recall Score:¶
- Recall(sensitivity or true positive rate): Fraction of positives that were correctly identified.\
- It is also called sensitivity OR True Positive Rate (TPR), answers the question how complete are the results, i.e. did the model miss any positive class and in what extent?
- $TPR = \frac{TP}{TP+FN}$
- A recall greater than 0.5 is good.
- In our case, low
recallwould mean the model incorrectly classified a lot of individuals with diabetes as healthy ones.
F1 Score:¶
- F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers.
- F1 Score takes into account precision and the recall.
- It is created by finding the the harmonic mean of precision and recall.
- $F1 = \frac{2 * precision * recall}{precision + recall}$
One should ask, what is is the superior metric from the two? In fact, it really depends!
For example, imagine cancer diagnostics. Would you rather classify few more patients as false positive and after more precise examination conclude they had no cancer or would you rather let escape the ones with cancer as healthy individuals? In this particular case, the model should minimize $FN$ in the confusion matrix. Consequently, recall,i.e. TPR, should be closer to 1. Lastly, there is always a trade-off between the two negatively correlated metrics.
1. Confusion Matrix¶
pd.crosstab(y_test, pred_knn, rownames=['True'], colnames=['Predicted'], margins = True)
| Predicted | 0.0 | 1.0 | All |
|---|---|---|---|
| True | |||
| 0.0 | 18780 | 64 | 18844 |
| 1.0 | 934 | 1305 | 2239 |
| All | 19714 | 1369 | 21083 |
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, pred_knn)
p = sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 20.049999999999997, 'Predicted label')
2. Classification Report¶
from sklearn.metrics import classification_report
print(classification_report(y_test,pred_knn))
precision recall f1-score support
0.0 0.95 1.00 0.97 18844
1.0 0.95 0.58 0.72 2239
accuracy 0.95 21083
macro avg 0.95 0.79 0.85 21083
weighted avg 0.95 0.95 0.95 21083
3. Accuracy and Error¶
print("KNN Accuracy: {}".format(np.round(accuracy_score(y_test, pred_knn), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_knn)), 2)))
KNN Accuracy: 0.95 Root Mean Squared Error: 0.22
4. ROC - AUC:¶
ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)
roc_auc
0.94
[y_test.shape, y_pred_proba.shape]
[(21083,), (21083,)]
# plt.plot([0,1],[0,1],'k--')
# plt.plot(fpr,tpr, label='Knn')
# plt.xlabel('FPR')
# plt.ylabel('TPR')
# plt.title('ROC curve')
# plt.show()
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
# import scikitplot as skplt
# skplt.metrics.plot_roc_curve(y_test, y_pred_proba)
# plt.show()
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
0.9431043169706584
# Evaluate the model using scikit-learn metrices
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
pred_knn = model_knn.predict(X_test)
accuracy = accuracy_score(y_test, pred_knn)
precision = precision_score(y_test, pred_knn)
recall = recall_score(y_test, pred_knn)
auc = roc_auc_score(y_test, pred_knn)
cr = classification_report(y_test, pred_knn)
cm =confusion_matrix(y_test, pred_knn)
---------------------------------------------------------- PART 2 --------------------------------------------------------------------------¶
Model 2. Logistic Regression¶
Logistic regression is a machine learning algorithm used for classification, predicting the probability of a binary outcome (like Yes/No, 0/1, Spam/Not Spam) based on input variables, fitting an S-shaped curve (sigmoid or logistic) to map inputs to a probability between 0 and 1, and then using a threshold (often 0.5) to classify the result. It's similar to linear regression but solves the issue of linear models predicting values outside the 0-1 range for binary outcomes, making it ideal for predicting the likelihood of an event.
Model Training¶
- Import the
LogisticRegressionclass fromsklearn.linear_model. - Create an instance of the
LogisticRegressionmodel. - Fit the model to your training data using the
fit()method, providing the features (X_train) and the target variable (y_train).n).
# Initialize and train the Logistic Regression model
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Prediction¶
# Make predictions
pred_lr = model_lr.predict(X_test)
Evaluation¶
# Step 3. Evaluate the model
print("Model: Logistic Regression")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_lr), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_lr)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_lr), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_lr), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_lr), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_lr), 2)))
print("Classification Report:\n", classification_report(y_test, pred_lr))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_lr))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Logistic Regression
Accuracy: 0.95
Root Mean Squared Error: 0.22
R-squared: 0.48
Precision: 0.86
Recall: 0.64
AUC: 0.81
Classification Report:
precision recall f1-score support
0.0 0.96 0.99 0.97 18844
1.0 0.86 0.64 0.73 2239
accuracy 0.95 21083
macro avg 0.91 0.81 0.85 21083
weighted avg 0.95 0.95 0.95 21083
Confusion Matrix:
{}
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_lr.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
Model 3. Decision Tree¶
A decision tree in Python is a supervised machine learning algorithm used for both classification and regression tasks. It operates by building a model in the form of a tree structure, where each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (in classification) or a numerical value (in regression).
Key Concepts:
Root Node: The topmost node in the tree, representing the initial decision or feature test.
Internal Nodes: Nodes that represent a test on a feature and have branches leading to other nodes.
Leaf Nodes: Terminal nodes that represent the final outcome or prediction.
Splitting: The process of dividing data into subsets based on feature values at each node.
Information Gain/Gini Impurity: Metrics used to determine the "best" split at each node, aiming to create purer child nodes.
Model Training¶
# Instantiate the Decision Tree Classifier
model_dt = DecisionTreeClassifier(max_depth=3, random_state=42)
* How to determine max_depth?¶
- Cross-Validation with Grid Search or Randomized Search (using
scikit-learnandGridSearchCV) - Heuristic Rules (Initial Guidance)
- Monitoring Training and Validation Accuracy
# Train the model
model_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, random_state=42)
# Make predictions
pred_dt = model_dt.predict(X_test)
Evaluation¶
# Step 3. Evaluate the model
print("Model: Decision Tree")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_dt), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_dt)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_dt), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_dt), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_dt), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_dt), 2)))
print("Classification Report:\n", classification_report(y_test, pred_dt))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_dt))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Decision Tree
Accuracy: 0.96
Root Mean Squared Error: 0.19
R-squared: 0.62
Precision: 1.0
Recall: 0.66
AUC: 0.83
Classification Report:
precision recall f1-score support
0.0 0.96 1.00 0.98 18844
1.0 1.00 0.66 0.79 2239
accuracy 0.96 21083
macro avg 0.98 0.83 0.89 21083
weighted avg 0.96 0.96 0.96 21083
Confusion Matrix:
{}
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_dt.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
Model 4. Random Forest¶
A random forest classification model is an ensemble machine learning method that builds multiple decision trees during training and combines their predictions through majority voting to determine the final class. This approach improves predictive accuracy and reduces the risk of overfitting compared to a single decision tree model.
# imprt module form scikit-learn
from sklearn.ensemble import RandomForestClassifier
# Instantiate the Random Forest Classifier
model_rf = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is number of trees
# Train the model
model_rf.fit(X_train, y_train)
#Prediction
pred_rf = model_rf.predict(X_test)
# Evaluate the model
print(accuracy_score(y_test, pred_rf)))
print(classification_report(y_test, pred_rf))
# Instantiate the Random Forest Classifier
model_rf = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is number of trees
# Train the model
model_rf.fit(X_train, y_train)
RandomForestClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)
#Prediction
pred_rf = model_rf.predict(X_test)
Evaluation¶
# Step 3. Evaluate the model
print("Model: Random Forest")
print("Accuracy: {}".format(np.round(accuracy_score(y_test, pred_rf), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_rf)), 2)))
print("R-squared: {}".format(np.round(r2_score(y_test, pred_rf), 2)))
print("Precision: {}".format(np.round(precision_score(y_test, pred_rf), 2)))
print("Recall: {}".format(np.round(recall_score(y_test, pred_rf), 2)))
print("AUC: {}".format(np.round(roc_auc_score(y_test, pred_rf), 2)))
print("Classification Report:\n", classification_report(y_test, pred_rf))
print("Confusion Matrix:\n {}")
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, pred_rf))
disp.plot(cmap='viridis')
plt.grid(False)
plt.show()
Model: Random Forest
Accuracy: 0.96
Root Mean Squared Error: 0.2
R-squared: 0.6
Precision: 0.94
Recall: 0.68
AUC: 0.84
Classification Report:
precision recall f1-score support
0.0 0.96 1.00 0.98 18844
1.0 0.94 0.68 0.79 2239
accuracy 0.96 21083
macro avg 0.95 0.84 0.88 21083
weighted avg 0.96 0.96 0.96 21083
Confusion Matrix:
{}
# ROC plot
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
Best Model Selection¶
I already mentioned precision and recall but did not touch accuracy. In fact, accuracy may not be the best parameter for choosing the right model. Consider test data with 100 individuals from which 99 subjects are healthy ones and only 1 individual has diabetes. Also, assume the model successfully classified 99 healthy people but completely failed to classify the one individual with diabetes.
Gived that $accuracy = \frac{TP + TN} {TP + TN + FP + FN}$, than the accuracy would be 99%. However, the algorithm missed 100% individuals in the positive class.
The following code display the best parameters and accuracy of every classifier I fitted. Moreover, classification_report displays the table with recall and precision, so you can effectively evaluate each model.
print("K-nearest neighbor:")
print(classification_report(y_test, pred_knn))
print("Logistic regression:")
print(classification_report(y_test, pred_lr))
print("Decision Tree")
print(classification_report(y_test, pred_dt))
print("Random Forest:")
print(classification_report(y_test, pred_rf))
K-nearest neighbor:
precision recall f1-score support
0.0 0.95 1.00 0.97 18844
1.0 0.95 0.58 0.72 2239
accuracy 0.95 21083
macro avg 0.95 0.79 0.85 21083
weighted avg 0.95 0.95 0.95 21083
Logistic regression:
precision recall f1-score support
0.0 0.96 0.99 0.97 18844
1.0 0.86 0.64 0.73 2239
accuracy 0.95 21083
macro avg 0.91 0.81 0.85 21083
weighted avg 0.95 0.95 0.95 21083
Decision Tree
precision recall f1-score support
0.0 0.96 1.00 0.98 18844
1.0 1.00 0.66 0.79 2239
accuracy 0.96 21083
macro avg 0.98 0.83 0.89 21083
weighted avg 0.96 0.96 0.96 21083
Random Forest:
precision recall f1-score support
0.0 0.96 1.00 0.98 18844
1.0 0.94 0.68 0.79 2239
accuracy 0.96 21083
macro avg 0.95 0.84 0.88 21083
weighted avg 0.96 0.96 0.96 21083
All the models make prediction with high accuracy. Here, both Decision Tree and Random Forest model performs with the highest accuracy of 0.96 meaning the model predicted 96% of cases correctly. But, accuracy is not the best and only one parameter for model selection.
You do not want to send home any patient with diabetes as she was healthy. From this perspective, you should choose the model with recall close to 1. Since, Random Forest and Decision Tree both has ideal recall = 1, you may consider either, but, given that both have same recall and knn have higher accuracy, you may want to decide for the decision tree classifier. You want to choose the largest recall from as first priority all the considered classifiers.
Administrating this model, you would sent home the lowest possible number of patients with diabetes at the cost of reexamining greater number of healthy individuals.
Selecting best model depends on your data and purpose. For smaller datasets, Decision tree mat generare faster and accurate decision, for larger dataset, Random Forest often works better.
Beyond scikit-learn, deep learning and neural network models are used. For example, a visual algorithm can detect cancer when it is trained on pictures of human cells.
Conclusion¶
Keep in mind how important is to explore your data before modeling. Specifically, there are more possibilities how missing values can be recorded. To see a greater picture of your data, it is important to ask yourself common sence questions, e.g. "can someone have 0 blood pressure?". Once you identify missing values in your data, you should decide how to deal with them. Would you delete every record with a missing value, would you substitute it with a mean, or would you try to be efficient and keep as much data as possible?
It is recommended to find feature's importance to make most accurate and realistic predictions.
Evaluation of the models based on
recallalong withaccuracyis strongly recommended, becauserecallclose to 1 minimizes the number of cases when a patient with diabetes would be classified incorrectly.