Skip to content

Clustering with R and RStudio

In this session of the BTEP Coding Club, Brian Luke, PhD, Senior Principal Computational Scientist with the Advanced Biomedical Computational Science (ABCS) group, showed us the basic techniques involved in clustering using R and RStudio and the swiss data set. These techniques included building a distance/dissimilarity matrix, agglomerative and divisive hierarchical clustering and its associated dendrogram, and K-means clustering with a principal component and nonlinear projection of the resulting clusters. He also demonstrated how to compare different clustering using silhouette width.

Why should you care about clustering? Clustering is one of the fundamental unsupervised machine learning algorithms. It is often used to group quantitative proteomic or RNAseq expression data to suggest sub-types of a particular cancer.

R Script and Data files

R script

Access the R script used in this tutorial here.

For a more detailed theoretical background on clustering, check out this related presentation ("Introduction to Clustering") also by Brian Luke, PhD.