Intro_scikit-learn

Supervised Machine Learning With Scikit-learn and Diabetes dataset¶

Exploratory Data Analysis (EDA) and Wranging, Classification (k-nearest neighbor), Model Fitting, Hyperparameter Tuning, and Performance Evaluation¶

Dataset:¶

Diabetes is a chronic health condition affecting millions worldwide. Early prediction of diabetes can help in timely management and prevention of complications. In this article, we will walk through a Python-based machine learning project for predicting diabetes usinga Diabetes Dataset from Kaggle.
We will use Python libraries such as numpy, pandas, scikit-learn, and the K-nearest-neighbours (knn) classification algorithm. We will also learn the importance of handling missing data and hyperparameter tuning for each model to achieve the best performance and generability.

diabetes_prediction_dataset.csv: https://www.kaggle.com/code/mahmoudbahnasy29/diabetes?select=diabetes_prediction_dataset.csv
This file contains medical and demographic data of patients along with their diabetes status, whether positive or negative. It consists of various features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The Dataset can be utilized to construct machine learning models that can predict the likelihood of diabetes in patients based on their medical history and demographic details.

Step 1: Loading the necessary packages¶

In [1]:

## Please uncomment the folloing line and run pip install to install scikit-plot for visualization for first run of the notebook. 
# Once it is installed, you can comment it out again for subsequent clean runs of the notebook. 

# %pip install scikit-plot

In [2]:

# Utility Libraries
import os

# data handling
import pandas as pd
import numpy as np

# Data preprocessing for ML
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score

# model traning and testing faciliators
from sklearn.preprocessing import StandardScaler

# Overfiitting/underfitting guide
from sklearn.model_selection import GridSearchCV

# ML models to be explored
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Performace measurements
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay, roc_curve

# Visualization
import scikitplot as skplt
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
sns.set()
%matplotlib inline

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category= FutureWarning)

print("Imports Done.")

Imports Done.

Step 2 : Data collection¶

Locate correct directory¶

In [3]:

 # go to the directory where your files are. The data is located in csv file -> '/inputs/ddiabetes_prediction_dataset.csv'
data_dir = "<path/to/your/code/directory/where/this/jupyter/notebook/is/located/"
#os.chdir(data_dir)
#print("Current Working Directory:",os.getcwd())
#print("Current working directory contains the following files:\n",os.listdir("./"))

Read data¶

In [4]:

# load data in a pandas dataframe
df = pd.read_csv(data_dir+"inputs/diabetes_prediction_dataset.csv")
df.head(5) # read first 5 lines of the data

Out[4]:

	gender	age	hypertension	heart_disease	smoking_history	bmi	HbA1c_level	blood_glucose_level
0	Female	80.0	0	1	never	25.19	6.6	140
1	Female	54.0	0	0	No Info	27.32	6.6	80
2	Male	28.0	0	0	never	27.32	5.7	158
3	Female	36.0	0	0	current	23.45	5.0	155
4	Male	76.0	1	1	current	20.14	4.8	155

In [5]:

original_dataframe_shape = df.shape
original_dataframe_shape

Out[5]:

(100000, 9)

Raw dataset contains 100000 rows, one for each patient and 9 columns, each representing one feature.
Supervised machine learning algorithm requires labelled data. The last column diabetes is the label; 0/1 -> non-diabetic/diabetic patient.

Describe data¶

Dataframe.info() gives information about the data types,columns, null value counts, memory usage etc.

In [6]:

df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB

DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

In [7]:

df.describe().T

Out[7]:

	count	mean	std	min	25%	50%	75%	max
age	100000.0	41.885856	22.516840	0.08	24.00	43.00	60.00	80.00
hypertension	100000.0	0.074850	0.263150	0.00	0.00	0.00	0.00	1.00
heart_disease	100000.0	0.039420	0.194593	0.00	0.00	0.00	0.00	1.00
bmi	100000.0	27.320767	6.636783	10.01	23.63	27.32	29.58	95.69
HbA1c_level	100000.0	5.527507	1.070672	3.50	4.80	5.80	6.20	9.00
blood_glucose_level	100000.0	138.058060	40.708136	80.00	100.00	140.00	159.00	300.00
diabetes	100000.0	0.085000	0.278883	0.00	0.00	0.00	0.00	1.00

Preprocessing¶

a. check for missing/ null values¶

In [8]:

print(df.isnull().sum())

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

No missing values found.

b. remove duplicates¶

In [9]:

#Detect & Handle Duplicates
df.duplicated().sum()

Out[9]:

In [10]:

df.drop_duplicates(inplace=True, ignore_index=True)

In [11]:

df.shape

Out[11]:

(96146, 9)

In [12]:

df.describe().T

Out[12]:

	count	mean	std	min	25%	50%	75%	max
age	96146.0	41.794326	22.462948	0.08	24.0	43.00	59.00	80.00
hypertension	96146.0	0.077601	0.267544	0.00	0.0	0.00	0.00	1.00
heart_disease	96146.0	0.040803	0.197833	0.00	0.0	0.00	0.00	1.00
bmi	96146.0	27.321461	6.767716	10.01	23.4	27.32	29.86	95.69
HbA1c_level	96146.0	5.532609	1.073232	3.50	4.8	5.80	6.20	9.00
blood_glucose_level	96146.0	138.218231	40.909771	80.00	100.0	140.00	159.00	300.00
diabetes	96146.0	0.088220	0.283616	0.00	0.0	0.00	0.00	1.00

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

In [13]:

for col in df.columns:
    if col in df.select_dtypes('O').columns.to_list() + ['hypertension','heart_disease','diabetes']:
        fig,axes = plt.subplots(1,2)
        axc = sns.countplot(x=df[col],ax=axes[0])
        if col == 'smoking_history':
            axc.tick_params(axis='x', rotation=90) 
        plt.pie(x=df[col].value_counts().values,labels=df[col].value_counts().index,autopct='%.2f%%')
    else:
        fig,axes = plt.subplots(1,2)
        ax = sns.histplot(x=df[col],kde=True,ax=axes[0])
        sns.boxplot(x=df[col],ax=axes[1])
    plt.show()

No description has been provided for this image

Bivariate Analysis¶

In [14]:

sns.boxplot(x=df['diabetes'],y=df['bmi'])

Out[14]:

<Axes: xlabel='diabetes', ylabel='bmi'>

In [15]:

sns.boxplot(x=df['diabetes'],y=df['age'])

Out[15]:

<Axes: xlabel='diabetes', ylabel='age'>

Handling categorical variables¶

We will use k nearest neighbours classification algorithm as our model later, which is a distance-based algorithm.
Therefore, we need to enumerate two categorical variables gender and smoking_history to numerical values.

In [16]:

# Define your mapping dictionary
mapping_gender = {'Male':1,'Female':2,'Other':0}
mapping_smoking_history = {'No Info':0,'never':1,'ever':2,'former':3,'not current':4,'current':5}

# Apply the mapping
df['gender'] = df['gender'].map(mapping_gender)
df['smoking_history'] = df['smoking_history'].map(mapping_smoking_history)

df.head(5)

Out[16]:

	gender	age	hypertension	heart_disease	smoking_history	bmi	HbA1c_level	blood_glucose_level
0	2	80.0	0	1	1	25.19	6.6	140
1	2	54.0	0	0	0	27.32	6.6	80
2	1	28.0	0	0	1	27.32	5.7	158
3	2	36.0	0	0	5	23.45	5.0	155
4	1	76.0	1	1	5	20.14	4.8	155

Data Wrangling (handling missing values)¶

In general, there are three approaches a data scientist may handle so many missing values.

Delete all observations with missing values; which would result in substantial loss of data❗Therefore, NOT recommended. But, in case of large dataset, if deleting them all looks safe, that is the simplest way to handle missing data.
Substitute missing values with either mean, median, or mode; which can be a great trade between regression to the mean but keeping the data. However, it is not the best solution when it comes to the substitution of more than half of values in a variable.
Give up; No!** ❌

Our data do not have any NaN. Therefore, 0 values in the this data represent missing records in some columns, not all.

Missing data:¶

smoking history have some rows with No info.
Gender have some rows with Other.

It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values.

In [17]:

# Copy original dataframe to a copy and manipulate this later
df_copy = df.copy(deep=True)
df_copy.shape

Out[17]:

(96146, 9)

In [18]:

# replace zeros with NaNs for columns 'smoking_history' and 'gender'
df_copy[['smoking_history']] = df_copy[['smoking_history']].replace(0, np.NaN)
df_copy[['gender']] = df_copy[['gender']].replace(0, np.NaN)

In [19]:

# count total rows with NaNs
df_copy.isna().any(axis=1).sum()

Out[19]:

Drop NaN values¶

In [20]:

# Drop rows where column 'smoking_history' has NaN values
df_cleaned = df_copy.dropna(subset=['smoking_history', 'gender'])
df_cleaned.shape

Out[20]:

(63247, 9)

In [21]:

hist = df.hist(figsize = [15, 15])

In [22]:

# pair plot on cleaned data
p = sns.pairplot(df_cleaned, hue = 'diabetes', aspect=1.5) # height=2.5

$No description has been provided for this image$

The variables exhibit various distribution patterns.
Note that, hbA1c_level and blood_glucose_level are two strongest predictors.
Even with common sense, we can tell that these two features are directly related to diabetics.

Also, the diabetes graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost 12 times higher than the number of diabetic patients, indicating data imbalance.

Standardization (min-max Normalization or Scaling):¶

This method rescales features to a specific range, typically between 0 and 1.

In [23]:

df_min_max_scaled = (df_cleaned - df_cleaned.min()) / (df_cleaned.max() - df_cleaned.min())
#df_min_max_scaled.head(5)

In [25]:

df_min_max_scaled.describe().T

Out[25]:

	count	mean	std	25%	50%	75%	max
gender	63247.0	0.603855	0.489099	0.000000	1.000000	1.000000	1.0
age	63247.0	0.581186	0.244647	0.386273	0.586673	0.762024	1.0
hypertension	63247.0	0.099135	0.298846	0.000000	0.000000	0.000000	1.0
heart_disease	63247.0	0.047686	0.213103	0.000000	0.000000	0.000000	1.0
smoking_history	63247.0	0.310133	0.382747	0.000000	0.000000	0.500000	1.0
bmi	63247.0	0.224624	0.080276	0.176658	0.210913	0.258380	1.0
HbA1c_level	63247.0	0.375776	0.199365	0.236364	0.418182	0.490909	1.0
blood_glucose_level	63247.0	0.271327	0.191983	0.090909	0.272727	0.359091	1.0
diabetes	63247.0	0.111262	0.314459	0.000000	0.000000	0.000000	1.0

Feature Importance:¶

There is a way how to quickly and visually investigate the importance of the features using RandomForestClassifier with the following tool: skplt.estimators.plot_feature_importances. It plots the classifier's feature importance. You can visually inspect how much the variable, relative to other features, correlates to the occurrence of diabetes.
Let's plot the features.

In [26]:

df2 = df_min_max_scaled
feature_names = df2.columns[:-1]

randfor = RandomForestClassifier()
randfor.fit(df2.drop(columns = "diabetes", axis=1),df2["diabetes"])

sp = skplt.estimators.plot_feature_importances(randfor, feature_names=feature_names, figsize=(10, 5), x_tick_rotation=90)
plt.show()

The result supports the assumption that blood_glucose_level and hbA1c_level are very strong predictor for the diagnosis of diabetes.

Feature Corelation:¶

Lastly, one should examine correlations between the variables which helps to find out the relationship between two quantities. It gives the measure of the strength of association between two variables. The value of Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.
Pearson, Kendall rank and Spearman’s rank correlation coefficient are currently computed using pairwise complete observations.\

A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information.
Heatmap with annotation is a nice way of visualizing a corelation matrix.

In [27]:

# Spearman’s rank correlation coefficient
correlation_matrix = df2.corr(method='spearman') # kendall, pearson
plt.figure(figsize=(8, 5.5)) # Adjust figure size as needed
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", square=False, cmap='Blues',) #RdYlGn/jet/coolwarm (vibrant), viridis/cividis (for colorblind), Blues (sequential)
plt.title("Spearman's rank Correlation Matrix")
plt.show()

Observations:

All variables except gender shows positive corelation with target variable diabetes.\
Most of the times, the variables exhibit low correlations between themselves and diabetes. However, blood_glucose_level and hbA1c_level seem to be higher correlated with the target diabetes than other variables. age is also moderately positively co-related with diabetes.\
All the features are mostly uncorrelated.
The observations seems reasonably correct.

Step 3: Pre-setting and Modeling Strategy¶

Now comes the Machine Learning model part.

Data splitting¶

First, split the data into labels and 2D arrays for training and testing as is the standard approach in ML.

train_test_split function from the sklearn.model_selection module is commonly used to divide a dataset into training and testing sets, o have unknown datapoints to test the data rather than testing with the same points with which the model was trained. This is a crucial step in machine learning to evaluate a model's performance on unseen data.

Cross Validation: When model is split into training and testing it can be possible that specific type of data point may go entirely into either training or testing portion. This would lead the model to perform poorly. Hence over-fitting and underfitting problems can be well avoided with cross validation techniques.

X and y: These are the input features and the target variable of your dataset, respectively. Your model will never see the y testset data while training, it will be hidden from the model. It is used to evaluate the performance of your model on unseen data.

test_size: The proportion of the dataset to include in the test split (e.g., 0.2 for 20%).

random_state: Controls the shuffling applied to the data for reproducibility.

In [28]:

X = df_min_max_scaled.drop(columns = "diabetes", axis=1) # data with feature columns
y = df_min_max_scaled["diabetes"] # target labels 

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 42)

X_train -> the data to be used to train a model
y_train -> label of the data in the train set.
X_test -> test set input features (never used probably)
y_test -> test set target labels (you will evavualte your predictions by comparing with this set)\

In [29]:

[X_train.shape, X_test.shape, y_train.shape, y_test.shape]

Out[29]:

[(42164, 8), (21083, 8), (42164,), (21083,)]

Step 4: ML Approach¶

The following steps summarize our approach to fit a particular model. We will repeat it for all the models:

Decide on the training parameters and create and train the model
- Create the model by using an appropriate classifier from scilit-learn. E.g. model = KNeighborsClassifier(n_neighbors=k)
- call model.fit(X_train, y_train) to train a model
predict the label pred_model by calling model.predict(X_test)
Evaluate the model performance on test data by:
- accuracy_score(y_test, pred_model)
- classification_report(y_test, pred_model)
- confusion_matrix(y_test, pred_model)

The strategy is identical to all the models I will attempt to fit, i.e.: i.) k-nearest neighbor (k-NN), ii.) logistic regression, iii.) decision tree and iv.) random forest. Therefore, , I will discuss the key issues only in I.)k - Nearest Neighbour. I am intentionally avoiding commenting on other models because the logic remains the same. However, the list** parameters_model should be unique to each classifier because the parameters differ from model to model. I will comment on the performance and draw conclusion in the last part of article.

Model: k - Nearest Neighbour¶

How to choose the value of k?
Choosing the optimal value of 'k' in a K-Nearest Neighbors (KNN) classifier in Python is a crucial step for achieving good model performance. While there's no single "best" method, several approaches are commonly employed.

In [ ]:

# # Elbow method
# error_rates = [] # Initialize an empty list to store error rates
# k_values = range(1, 51) # # Iterate through a range of K values
# for k in k_values:
#     knn = KNeighborsClassifier(n_neighbors=k) # Initialize KNN classifier with the current K
#     knn.fit(X_train, y_train) # Fit the model to the training data
#     predictions = knn.predict(X_test) # Make predictions on the test data
#     error_rate = np.mean(predictions != y_test) # Calculate the error rate (misclassification rate)
#     error_rates.append(error_rate) # Append the error rate to the list

# # Plot the error rate vs. K value
# plt.figure(figsize=(8, 6))
# plt.plot(k_values, error_rates, color='blue', linestyle='dashed', marker='o',markerfacecolor='red', markersize=8)
# plt.title('Error Rate vs. K Value (Elbow Method for KNN)')
# plt.xlabel('K Value')
# plt.ylabel('Error Rate')
# plt.grid(True)
# plt.show()

In [ ]:

# # Square root of N method
# int(np.sqrt(df_cleaned.shape[0]))

Hyperparameter Tuning¶

`GridSearchCV` in `scikit-learn` for determining the value of `k`¶

5-fold cross-validation on train set¶

Here is an example using GridSearchCV in scikit-learn: GridSearchCV splits X_train and y_train into 5 identical data sets and performs cross-validation on each of them. However, each data set is split into different test and train sections. While cross-validating, GridSearchCV searches for the best parameters specified in parameters_model.

In [30]:

knn = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1, 51, 2)} # Test odd k values from 1 to 51
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
best_score = grid_search.best_score_
print(f"Best k: {best_k}, Best accuracy: {best_score}")

Best k: 13, Best accuracy: 0.9492220378242069

Model Training¶

In [31]:

# Step 1. Initialize and train the KNN classifier
k = best_k
model_knn = KNeighborsClassifier(n_neighbors=k)
model_knn.fit(X_train, y_train)

Out[31]:

KNeighborsClassifier(n_neighbors=13)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Cross Validation¶

In [32]:

# Define the cross-validation strategy 
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Perform 5-fold cross-validation
# 'cv=5' specifies 5-fold cross-validation
# 'scoring="accuracy"' specifies the evaluation metric
scores = np.round(cross_val_score(model_knn, X_train, y_train, cv=kf, scoring='accuracy'), 2)

print("Cross-validation scores for each fold on train data:", scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation of accuracy:", scores.std())

Cross-validation scores for each fold on train data: [0.95 0.95 0.95 0.95 0.95]
Mean accuracy: 0.95
Standard deviation of accuracy: 0.0

Prediction Accuracy on Test data (unseen by model)¶

In [33]:

# Make predictions on the test set using the model
pred_knn = model_knn.predict(X_test)
[y_test.shape, pred_knn.shape]

Out[33]:

[(21083,), (21083,)]

Evaluation (model Performance Analysis)¶

1. Confusion Matrix¶

The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs. A confusion matrix is a table that describes the performance of a classification model. It summarizes the counts of true positive, true negative, false positive, and false negative predictions, providing a detailed view of how well the model is distinguishing between different classes.

2. classification_report:¶

Methos to see greater picture of model's performance, which includes precision and recall and F1-score.

Precision Score:¶

Precosion (aka positive predictive value (PPV)) - Accuracy of positive predictions.\
$PPV = \frac{TP}{TP+FP}$
It is similar to accuracy, but focuses only on data the model predicted to be positive, i.e. diabetes = 1. Referring to a confusion matrix, precision of 1 means there were no false positives.

Recall Score:¶

Recall(sensitivity or true positive rate): Fraction of positives that were correctly identified.\
It is also called sensitivity OR True Positive Rate (TPR), answers the question how complete are the results, i.e. did the model miss any positive class and in what extent?
$TPR = \frac{TP}{TP+FN}$
A recall greater than 0.5 is good.
In our case, low recall would mean the model incorrectly classified a lot of individuals with diabetes as healthy ones.

F1 Score:¶

F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers.
F1 Score takes into account precision and the recall.
It is created by finding the the harmonic mean of precision and recall.
$F1 = \frac{2 * precision * recall}{precision + recall}$

One should ask, what is is the superior metric from the two? In fact, it really depends!

For example, imagine cancer diagnostics. Would you rather classify few more patients as false positive and after more precise examination conclude they had no cancer or would you rather let escape the ones with cancer as healthy individuals? In this particular case, the model should minimize $FN$ in the confusion matrix. Consequently, recall,i.e. TPR, should be closer to 1. Lastly, there is always a trade-off between the two negatively correlated metrics.

1. Confusion Matrix¶

In [34]:

pd.crosstab(y_test, pred_knn, rownames=['True'], colnames=['Predicted'], margins = True)

Out[34]:

Predicted	0.0	1.0	All
True
0.0	18780	64	18844
1.0	934	1305	2239
All	19714	1369	21083

In [35]:

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, pred_knn)
p = sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[35]:

Text(0.5, 20.049999999999997, 'Predicted label')

2. Classification Report¶

In [36]:

from sklearn.metrics import classification_report
print(classification_report(y_test,pred_knn))

              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97     18844
         1.0       0.95      0.58      0.72      2239

    accuracy                           0.95     21083
   macro avg       0.95      0.79      0.85     21083
weighted avg       0.95      0.95      0.95     21083

3. Accuracy and Error¶

In [37]:

print("KNN Accuracy: {}".format(np.round(accuracy_score(y_test, pred_knn), 2)))
print("Root Mean Squared Error: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred_knn)), 2)))

KNN Accuracy: 0.95
Root Mean Squared Error: 0.22

4. ROC - AUC:¶

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

In [38]:

from sklearn.metrics import roc_curve, auc
y_pred_proba = model_knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

In [39]:

roc_auc = np.round(auc(fpr, tpr), 2)
roc_auc

Out[39]:

0.94

In [40]:

[y_test.shape, y_pred_proba.shape]

Out[40]:

[(21083,), (21083,)]

In [41]:

plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()

In [42]:

plt.figure()
lw = 2
plt.plot(
    fpr,
    tpr,
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc,
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve")
plt.legend(loc="lower right")
plt.show()

In [49]:

# import scikitplot as skplt

# skplt.metrics.plot_roc_curve(y_test, y_pred_proba)
# plt.show()

In [50]:

#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)

Out[50]:

0.9430934380578406

In [51]:

# Evaluate the model using scikit-learn metrices 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

pred_knn = model_knn.predict(X_test)

accuracy = accuracy_score(y_test, pred_knn)
precision = precision_score(y_test, pred_knn)
recall = recall_score(y_test, pred_knn)
auc = roc_auc_score(y_test, pred_knn)
cr = classification_report(y_test, pred_knn)
cm =confusion_matrix(y_test, pred_knn)

Conclusion¶

As you can see, the KNN model performs with the highest accuracy of 0.95 meaning the model predicted 94% of cases correctly.

Supervised Machine Learning With Scikit-learn and Diabetes dataset¶

Exploratory Data Analysis (EDA) and Wranging, Classification (k-nearest neighbor), Model Fitting, Hyperparameter Tuning, and Performance Evaluation¶

Dataset:¶

Step 1: Loading the necessary packages¶

Step 2 : Data collection¶

Locate correct directory¶

Read data¶

Describe data¶

Preprocessing¶

a. check for missing/ null values¶

b. remove duplicates¶

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Bivariate Analysis¶

Handling categorical variables¶

Data Wrangling (handling missing values)¶

Missing data:¶

Drop NaN values¶

Standardization (min-max Normalization or Scaling):¶

Feature Importance:¶

Feature Corelation:¶

Step 3: Pre-setting and Modeling Strategy¶

Data splitting¶

Step 4: ML Approach¶

Model: k - Nearest Neighbour¶

Hyperparameter Tuning¶

GridSearchCV in scikit-learn for determining the value of k¶

5-fold cross-validation on train set¶

Model Training¶

Cross Validation¶

Prediction Accuracy on Test data (unseen by model)¶

Evaluation (model Performance Analysis)¶

1. Confusion Matrix¶

2. classification_report:¶

Precision Score:¶

Recall Score:¶

F1 Score:¶

1. Confusion Matrix¶

2. Classification Report¶

3. Accuracy and Error¶

4. ROC - AUC:¶

Conclusion¶

`GridSearchCV` in `scikit-learn` for determining the value of `k`¶