Supervised Machine Learning With Scikit-learn and Diabetes dataset¶
Exploratory Data Analysis (EDA) and Wranging, Classification (k-nearest neighbor), Model Fitting, Hyperparameter Tuning, and Performance Evaluation¶
Dataset:¶
Diabetes is a chronic health condition affecting millions worldwide. Early prediction of diabetes can help in timely management and
prevention of complications. In this article, we will walk through a Python-based machine learning project for predicting diabetes usinga Diabetes Dataset from Kaggle.
We will use Python libraries such as numpy, pandas, scikit-learn, and the K-nearest-neighbours (knn) classification algorithm. We will also learn the importance of handling missing data and hyperparameter tuning for each model to achieve the best performance and generability.
diabetes_prediction_dataset.csv: https://www.kaggle.com/code/mahmoudbahnasy29/diabetes?select=diabetes_prediction_dataset.csv
This file contains medical and demographic data of patients along with their diabetes status, whether positive or negative. It consists of various features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The Dataset can be utilized to construct machine learning models that can predict the likelihood of diabetes in patients based on their medical history and demographic details.
Step 1: Loading the necessary packages¶
## Please uncomment the folloing line and run pip install to install scikit-plot for visualization for first run of the notebook.
# Once it is installed, you can comment it out again for subsequent clean runs of the notebook.
# %pip install scikit-plot
# Utility Libraries
import os
# data handling
import pandas as pd
import numpy as np
# Data preprocessing for ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
# model traning and testing faciliators
from sklearn.preprocessing import StandardScaler
# Overfiitting/underfitting guide
from sklearn.model_selection import GridSearchCV
# ML models to be explored
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Performace measurements
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve
# Visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set()
%matplotlib inline
# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category= FutureWarning)
print("Imports Done.")
Imports Done.
Step 2 : Data collection¶
Locate correct directory¶
# go to the directory where your files are. The data is located in csv file -> '/inputs/ddiabetes_prediction_dataset.csv'
data_dir = "<path/to/your/code/directory/where/this/jupyter/notebook/is/located/"
#os.chdir(data_dir)
#print("Current Working Directory:",os.getcwd())
#print("Current working directory contains the following files:\n",os.listdir("./"))
Read data¶
# load data in a pandas dataframe
df = pd.read_csv(data_dir+"inputs/diabetes_prediction_dataset.csv")
df.head(5) # read first 5 lines of the data
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 80.0 | 0 | 1 | never | 25.19 | 6.6 | 140 | 0 |
| 1 | Female | 54.0 | 0 | 0 | No Info | 27.32 | 6.6 | 80 | 0 |
| 2 | Male | 28.0 | 0 | 0 | never | 27.32 | 5.7 | 158 | 0 |
| 3 | Female | 36.0 | 0 | 0 | current | 23.45 | 5.0 | 155 | 0 |
| 4 | Male | 76.0 | 1 | 1 | current | 20.14 | 4.8 | 155 | 0 |
original_dataframe_shape = df.shape
original_dataframe_shape
(100000, 9)
- Raw dataset contains 100000 rows, one for each patient and 9 columns, each representing one feature.
- Supervised machine learning algorithm requires labelled data. The last column
diabetesis the label; 0/1 -> non-diabetic/diabetic patient.
Describe data¶
Dataframe.info() gives information about the data types,columns, null value counts, memory usage etc.
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 100000 non-null object 1 age 100000 non-null float64 2 hypertension 100000 non-null int64 3 heart_disease 100000 non-null int64 4 smoking_history 100000 non-null object 5 bmi 100000 non-null float64 6 HbA1c_level 100000 non-null float64 7 blood_glucose_level 100000 non-null int64 8 diabetes 100000 non-null int64 dtypes: float64(3), int64(4), object(2) memory usage: 6.9+ MB
DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 100000.0 | 41.885856 | 22.516840 | 0.08 | 24.00 | 43.00 | 60.00 | 80.00 |
| hypertension | 100000.0 | 0.074850 | 0.263150 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| heart_disease | 100000.0 | 0.039420 | 0.194593 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| bmi | 100000.0 | 27.320767 | 6.636783 | 10.01 | 23.63 | 27.32 | 29.58 | 95.69 |
| HbA1c_level | 100000.0 | 5.527507 | 1.070672 | 3.50 | 4.80 | 5.80 | 6.20 | 9.00 |
| blood_glucose_level | 100000.0 | 138.058060 | 40.708136 | 80.00 | 100.00 | 140.00 | 159.00 | 300.00 |
| diabetes | 100000.0 | 0.085000 | 0.278883 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
print(df.isnull().sum())
gender 0 age 0 hypertension 0 heart_disease 0 smoking_history 0 bmi 0 HbA1c_level 0 blood_glucose_level 0 diabetes 0 dtype: int64
No missing values found.
b. remove duplicates¶
#Detect & Handle Duplicates
df.duplicated().sum()
3854
df.drop_duplicates(inplace=True, ignore_index=True)
df.shape
(96146, 9)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age | 96146.0 | 41.794326 | 22.462948 | 0.08 | 24.0 | 43.00 | 59.00 | 80.00 |
| hypertension | 96146.0 | 0.077601 | 0.267544 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| heart_disease | 96146.0 | 0.040803 | 0.197833 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
| bmi | 96146.0 | 27.321461 | 6.767716 | 10.01 | 23.4 | 27.32 | 29.86 | 95.69 |
| HbA1c_level | 96146.0 | 5.532609 | 1.073232 | 3.50 | 4.8 | 5.80 | 6.20 | 9.00 |
| blood_glucose_level | 96146.0 | 138.218231 | 40.909771 | 80.00 | 100.0 | 140.00 | 159.00 | 300.00 |
| diabetes | 96146.0 | 0.088220 | 0.283616 | 0.00 | 0.0 | 0.00 | 0.00 | 1.00 |
for col in df.columns:
if col in df.select_dtypes('O').columns.to_list() + ['hypertension','heart_disease','diabetes']:
fig,axes = plt.subplots(1,2)
axc = sns.countplot(x=df[col],ax=axes[0])
if col == 'smoking_history':
axc.tick_params(axis='x', rotation=90)
plt.pie(x=df[col].value_counts().values,labels=df[col].value_counts().index,autopct='%.2f%%')
else:
fig,axes = plt.subplots(1,2)
ax = sns.histplot(x=df[col],kde=True,ax=axes[0])
sns.boxplot(x=df[col],ax=axes[1])
plt.show()
Bivariate Analysis¶
sns.boxplot(x=df['diabetes'],y=df['bmi'])
<Axes: xlabel='diabetes', ylabel='bmi'>
sns.boxplot(x=df['diabetes'],y=df['age'])
<Axes: xlabel='diabetes', ylabel='age'>
Handling categorical variables¶
We will use k nearest neighbours classification algorithm as our model later, which is a distance-based algorithm.
Therefore, we need to enumerate two categorical variables gender and smoking_history to numerical values.
# Define your mapping dictionary
mapping_gender = {'Male':1,'Female':2,'Other':0}
mapping_smoking_history = {'No Info':0,'never':1,'ever':2,'former':3,'not current':4,'current':5}
# Apply the mapping
df['gender'] = df['gender'].map(mapping_gender)
df['smoking_history'] = df['smoking_history'].map(mapping_smoking_history)
df.head(5)
| gender | age | hypertension | heart_disease | smoking_history | bmi | HbA1c_level | blood_glucose_level | diabetes | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 80.0 | 0 | 1 | 1 | 25.19 | 6.6 | 140 | 0 |
| 1 | 2 | 54.0 | 0 | 0 | 0 | 27.32 | 6.6 | 80 | 0 |
| 2 | 1 | 28.0 | 0 | 0 | 1 | 27.32 | 5.7 | 158 | 0 |
| 3 | 2 | 36.0 | 0 | 0 | 5 | 23.45 | 5.0 | 155 | 0 |
| 4 | 1 | 76.0 | 1 | 1 | 5 | 20.14 | 4.8 | 155 | 0 |
Data Wrangling (handling missing values)¶
In general, there are three approaches a data scientist may handle so many missing values.
- Delete all observations with missing values; which would result in substantial loss of data❗Therefore, NOT recommended. But, in case of large dataset, if deleting them all looks safe, that is the simplest way to handle missing data.
- Substitute missing values with either
mean,median, ormode; which can be a great trade between regression to the mean but keeping the data. However, it is not the best solution when it comes to the substitution of more than half of values in a variable. - Give up; No!** ❌
- Our data do not have any
NaN. Therefore,0values in the this data represent missing records in some columns, not all.
Missing data:¶
smoking historyhave some rows withNo info.Genderhave some rows withOther.
It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values.
# Copy original dataframe to a copy and manipulate this later
df_copy = df.copy(deep=True)
df_copy.shape
(96146, 9)
# replace zeros with NaNs for columns 'smoking_history' and 'gender'
df_copy[['smoking_history']] = df_copy[['smoking_history']].replace(0, np.NaN)
df_copy[['gender']] = df_copy[['gender']].replace(0, np.NaN)
# count total rows with NaNs
df_copy.isna().any(axis=1).sum()
32899
Drop NaN values¶
# Drop rows where column 'smoking_history' has NaN values
df_cleaned = df_copy.dropna(subset=['smoking_history', 'gender'])
df_cleaned.shape
(63247, 9)
hist = df.hist(figsize = [15, 15])
# pair plot on cleaned data
p = sns.pairplot(df_cleaned, hue = 'diabetes', aspect=1.5) # height=2.5
The variables exhibit various distribution patterns.
Note that, hbA1c_level and blood_glucose_level are two strongest predictors.
Even with common sense, we can tell that these two features are directly related to diabetics.
Also, the diabetes graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost 12 times higher than the number of diabetic patients, indicating data imbalance.
Standardization (min-max Normalization or Scaling):¶
This method rescales features to a specific range, typically between 0 and 1.
df_min_max_scaled = (df_cleaned - df_cleaned.min()) / (df_cleaned.max() - df_cleaned.min())
#df_min_max_scaled.head(5)
df_min_max_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| gender | 63247.0 | 0.603855 | 0.489099 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 1.0 |
| age | 63247.0 | 0.581186 | 0.244647 | 0.0 | 0.386273 | 0.586673 | 0.762024 | 1.0 |
| hypertension | 63247.0 | 0.099135 | 0.298846 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| heart_disease | 63247.0 | 0.047686 | 0.213103 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
| smoking_history | 63247.0 | 0.310133 | 0.382747 | 0.0 | 0.000000 | 0.000000 | 0.500000 | 1.0 |
| bmi | 63247.0 | 0.224624 | 0.080276 | 0.0 | 0.176658 | 0.210913 | 0.258380 | 1.0 |
| HbA1c_level | 63247.0 | 0.375776 | 0.199365 | 0.0 | 0.236364 | 0.418182 | 0.490909 | 1.0 |
| blood_glucose_level | 63247.0 | 0.271327 | 0.191983 | 0.0 | 0.090909 | 0.272727 | 0.359091 | 1.0 |
| diabetes | 63247.0 | 0.111262 | 0.314459 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
Feature Importance:¶
There is a way how to quickly and visually investigate the importance of the features using RandomForestClassifier with the following tool: skplt.estimators.plot_feature_importances. It plots the classifier's feature importance. You can visually inspect how much the variable, relative to other features, correlates to the occurrence of diabetes.
Let's plot the features.
df2 = df_min_max_scaled
feature_names = df2.columns[:-1]
randfor = RandomForestClassifier()
randfor.fit(df2.drop(columns = "diabetes", axis=1),df2["diabetes"])
sp = skplt.estimators.plot_feature_importances(randfor, feature_names=feature_names, figsize=(10, 5), x_tick_rotation=90)
plt.show()
The result supports the assumption that blood_glucose_level and hbA1c_level are very strong predictor for the diagnosis of diabetes.
Feature Corelation:¶
Lastly, one should examine correlations between the variables which helps to find out the relationship between two quantities. It gives the measure of the strength of association between two variables. The value of Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.
Pearson, Kendall rank and Spearman’s rank correlation coefficient are currently computed using pairwise complete observations.\
A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information.
Heatmap with annotation is a nice way of visualizing a corelation matrix.
# Spearman’s rank correlation coefficient
correlation_matrix = df2.corr(method='spearman') # kendall, pearson
plt.figure(figsize=(8, 5.5)) # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", square=False, cmap='Blues',) #RdYlGn/jet/coolwarm (vibrant), viridis/cividis (for colorblind), Blues (sequential)
plt.title("Spearman's rank Correlation Matrix")
plt.show()
Observations:
- All variables except
gendershows positive corelation with target variablediabetes.\ - Most of the times, the variables exhibit low correlations between themselves and
diabetes. However,blood_glucose_levelandhbA1c_levelseem to be higher correlated with the targetdiabetesthan other variables.ageis also moderately positively co-related withdiabetes.\ - All the features are mostly uncorrelated.
The observations seems reasonably correct.
Step 3: Pre-setting and Modeling Strategy¶
Now comes the Machine Learning model part.
Data splitting¶
First, split the data into labels and 2D arrays for training and testing as is the standard approach in ML.
train_test_split function from the sklearn.model_selection module is commonly used to divide a dataset into training and testing sets, o have unknown datapoints to test the data rather than testing with the same points with which the model was trained. This is a crucial step in machine learning to evaluate a model's performance on unseen data.
Cross Validation: When model is split into training and testing it can be possible that specific type of data point may go entirely into either training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross validation techniques.
X and y: These are the input features and the target variable of your dataset, respectively. Your model will never see the y testset data while training, it will be hidden from the model. It is used to evaluate the performance of your model on unseen data.
test_size: The proportion of the dataset to include in the test split (e.g., 0.2 for 20%).
random_state: Controls the shuffling applied to the data for reproducibility.
X = df_min_max_scaled.drop(columns = "diabetes", axis=1) # data with feature columns
y = df_min_max_scaled["diabetes"] # target labels
#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 42)
X_train -> the data to be used to train a model
y_train -> label of the data in the train set.
X_test -> test set input features (never used probably)
y_test -> test set target labels (you will evavualte your predictions by comparing with this set)\
[X_train.shape, X_test.shape, y_train.shape, y_test.shape]
[(42164, 8), (21083, 8), (42164,), (21083,)]
Step 4: ML Approach¶
The following steps summarize our approach to fit a particular model. We will repeat it for all the models:
Decide on the training parameters and create and train the model
- Create the model by using an appropriate
classifierfromscilit-learn. E.g.model = KNeighborsClassifier(n_neighbors=k) - call
model.fit(X_train, y_train)to train a model
- Create the model by using an appropriate
predict the label
pred_modelby callingmodel.predict(X_test)Evaluate the model performance on test data by:
accuracy_score(y_test, pred_model)classification_report(y_test, pred_model)confusion_matrix(y_test, pred_model)
The strategy is identical to all the models I will attempt to fit, i.e.: i.) k-nearest neighbor (k-NN), ii.) logistic regression, iii.) decision tree and iv.) random forest.
Therefore, , I will discuss the key issues only in I.)k - Nearest Neighbour. I am intentionally avoiding commenting on other models because the logic remains the same. However, the list** parameters_model should be unique to each classifier because the parameters differ from model to model.
I will comment on the performance and draw conclusion in the last part of article.
Model: k - Nearest Neighbour¶
How to choose the value of k?
Choosing the optimal value of 'k' in a K-Nearest Neighbors (KNN) classifier in Python is a crucial step for achieving good model performance. While there's no single "best" method, several approaches are commonly employed.
# # Elbow method
# error_rates = [] # Initialize an empty list to store error rates
# k_values = range(1, 51) # # Iterate through a range of K values
# for k in k_values:
# knn = KNeighborsClassifier(n_neighbors=k) # Initialize KNN classifier with the current K
# knn.fit(X_train, y_train) # Fit the model to the training data
# predictions = knn.predict(X_test) # Make predictions on the test data
# error_rate = np.mean(predictions != y_test) # Calculate the error rate (misclassification rate)
# error_rates.append(error_rate) # Append the error rate to the list
# # Plot the error rate vs. K value
# plt.figure(figsize=(8, 6))
# plt.plot(k_values, error_rates, color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=8)
# plt.title('Error Rate vs. K Value (Elbow Method for KNN)')
# plt.xlabel('K Value')
# plt.ylabel('Error Rate')
# plt.grid(True)
# plt.show()
# # Square root of N method
# int(np.sqrt(df_cleaned.shape[0]))
Hyperparameter Tuning¶
GridSearchCV in scikit-learn for determining the value of k¶
5-fold cross-validation on train set¶
Here is an example using GridSearchCV in scikit-learn:
GridSearchCV splits X_train and y_train into 5 identical data sets and performs cross-validation on each of them. However, each data set is split into different test and train sections. While cross-validating, GridSearchCV searches for the best parameters specified in parameters_model.
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 51, 2)} # Test odd k values from 1 to 51
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_
print(f"Best k: {best_k}, Best accuracy: {best_score}")
Best k: 13, Best accuracy: 0.9492220378242069
Model Training¶
# Step 1. Initialize and train the KNN classifier
k = best_k
model_knn = KNeighborsClassifier(n_neighbors=k)
model_knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=13)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=13)
Cross Validation¶
# Define the cross-validation strategy
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform 5-fold cross-validation
# 'cv=5' specifies 5-fold cross-validation
# 'scoring="accuracy"' specifies the evaluation metric
scores = np.round(cross_val_score(model_knn, X_train, y_train, cv=kf, scoring='accuracy'), 2)
print("Cross-validation scores for each fold on train data:", scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation of accuracy:", scores.std())
Cross-validation scores for each fold on train data: [0.95 0.95 0.95 0.95 0.95] Mean accuracy: 0.95 Standard deviation of accuracy: 0.0
Prediction Accuracy on Test data (unseen by model)¶
# Make predictions on the test set using the model
pred_knn = model_knn.predict(X_test)
[y_test.shape, pred_knn.shape]
[(21083,), (21083,)]
Evaluation (model Performance Analysis)¶
1. Confusion Matrix¶
The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs.
A confusion matrix is a table that describes the performance of a classification model. It summarizes the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed view of how well the model is distinguishing between different classes.
2. classification_report:¶
Methos to see greater picture of model's performance, which includes precision and recall and F1-score.
Precision Score:¶
- Precosion (aka positive predictive value (PPV)) - Accuracy of positive predictions.\
- $PPV = \frac{TP}{TP+FP}$
- It is similar to accuracy, but focuses only on data the model predicted to be positive, i.e.
diabetes = 1. Referring to a confusion matrix,precisionof 1 means there were no false positives.
Recall Score:¶
- Recall(sensitivity or true positive rate): Fraction of positives that were correctly identified.\
- It is also called sensitivity OR True Positive Rate (TPR), answers the question how complete are the results, i.e. did the model miss any positive class and in what extent?
- $TPR = \frac{TP}{TP+FN}$
- A recall greater than 0.5 is good.
- In our case, low
recallwould mean the model incorrectly classified a lot of individuals with diabetes as healthy ones.
F1 Score:¶
- F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers.
- F1 Score takes into account precision and the recall.
- It is created by finding the the harmonic mean of precision and recall.
- $F1 = \frac{2 * precision * recall}{precision + recall}$
One should ask, what is is the superior metric from the two? In fact, it really depends!
For example, imagine cancer diagnostics. Would you rather classify few more patients as false positive and after more precise examination conclude they had no cancer or would you rather let escape the ones with cancer as healthy individuals? In this particular case, the model should minimize $FN$ in the confusion matrix. Consequently, recall,i.e. TPR, should be closer to 1. Lastly, there is always a trade-off between the two negatively correlated metrics.
1. Confusion Matrix¶
pd.crosstab(y_test, pred_knn, rownames=['True'], colnames=['Predicted'], margins = True)
| Predicted | 0.0 | 1.0 | All |
|---|---|---|---|
| True | |||
| 0.0 | 18780 | 64 | 18844 |
| 1.0 | 934 | 1305 | 2239 |
| All | 19714 | 1369 | 21083 |
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, pred_knn)
p = sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Text(0.5, 20.049999999999997, 'Predicted label')
2. Classification Report¶
from sklearn.metrics import classification_report
print(classification_report(y_test,pred_knn))
precision recall f1-score support
0.0 0.95 1.00 0.97 18844
1.0 0.95 0.58 0.72 2239
accuracy 0.95 21083
macro avg 0.95 0.79 0.85 21083
weighted avg 0.95 0.95 0.95 21083
3. Accuracy and Error¶
print("KNN Accuracy: {}".format(np.round(accuracy_score(y_test, pred_knn), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_knn)), 2)))
KNN Accuracy: 0.95 Root Mean Squared Error: 0.22
4. ROC - AUC:¶
ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.
from sklearn.metrics import roc_curve, auc
y_pred_proba = model_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = np.round(auc(fpr, tpr), 2)
roc_auc
0.94
[y_test.shape, y_pred_proba.shape]
[(21083,), (21083,)]
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
plt.figure()
lw = 2
plt.plot(
fpr,
tpr,
color="darkorange",
lw=lw,
label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()
# import scikitplot as skplt
# skplt.metrics.plot_roc_curve(y_test, y_pred_proba)
# plt.show()
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
0.9430934380578406
# Evaluate the model using scikit-learn metrices
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
pred_knn = model_knn.predict(X_test)
accuracy = accuracy_score(y_test, pred_knn)
precision = precision_score(y_test, pred_knn)
recall = recall_score(y_test, pred_knn)
auc = roc_auc_score(y_test, pred_knn)
cr = classification_report(y_test, pred_knn)
cm =confusion_matrix(y_test, pred_knn)
Conclusion¶
As you can see, the KNN model performs with the highest accuracy of 0.95 meaning the model predicted 94% of cases correctly.