Lesson5: Visualizing clusters with heatmap and dendrogram
The following questions will help you gain more confidence in exploring data through heatmap. We will work with a subset of the Human Brain Reference (HBR) and Universal Human Reference (UHR) RNA sequencing dataset and use the heatmap to
- Visualize gene expression
- Determine whether subsets of genes can help us differentiate between the HBR and UHR samples
Load necessary packages
library(pheatmap)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Question 1
Could you import the hbr_uhr_normalized_counts.csv file into your workspace?
Solution
hbr_uhr_normalized_counts <- read.csv("./data/hbr_uhr_normalized_counts.csv",row.names=1)
Question 2
Explore this gene expression dataset a bit. How many samples (columns) and genes (row names) does this dataset have?
Solution
This dataset contains 6 samples (
- HBR_1.bam
- HBR_2.bam
- HBR_3.bam
- UHR_1.bam
- UHR_2.bam
- UHR_3.bam
The samples with names starting with HBR are from the Human Brain Reference (HBR) and those with names starting with UHR are from the Universal Human Reference (UHR). Remember this for a later questions.
hbr_uhr_normalized_counts
## HBR_1.bam HBR_2.bam HBR_3.bam UHR_1.bam UHR_2.bam UHR_3.bam
## SULT4A1 375.0 343.6 339.4 3.5 6.9 2.6
## MPPED1 157.8 158.4 162.6 0.7 3.0 2.6
## PRAME 0.0 0.0 0.0 568.9 467.3 519.2
## IGLC2 0.0 0.0 0.0 488.6 498.0 457.5
## IGLC3 0.0 0.0 0.0 809.7 313.8 688.0
## CDC45 2.6 1.0 0.0 155.0 152.5 149.9
## CLDN5 77.6 88.5 67.2 1.4 2.0 0.0
## PCAT14 0.0 0.0 1.2 139.8 154.4 155.1
## RP5-1119A7.17 53.0 57.6 51.9 0.0 0.0 0.0
## MYO18B 0.0 0.0 0.0 59.5 84.2 56.5
## RP3-323A16.1 0.0 0.0 1.2 51.9 76.2 53.1
## CACNG2 42.7 35.0 56.6 0.0 1.0 0.0
Question 3
Create a heatmap for to visualize gene expression for this dataset.
Solution
pheatmap(hbr_uhr_normalized_counts,scale="row")
Question 4
Create a data frame called annotation_df
that contains the sample and treatment group information that we will add to the legend for this heatmap.
Solution
annotation_df <- data.frame(sample=c("HBR_1.bam","HBR_2.bam","HBR_3.bam","UHR_1.bam","UHR_2.bam","UHR_3.bam"),treatment=c(rep("HBR",3),rep("UHR",3))) %>% column_to_rownames("sample")
annotation_df
## treatment
## HBR_1.bam HBR
## HBR_2.bam HBR
## HBR_3.bam HBR
## UHR_1.bam UHR
## UHR_2.bam UHR
## UHR_3.bam UHR
Question 5
Add the annotations for the legend and color the HBR samples orangered
and the UHR samples blue
. Also, add a title to the heatmap.
Solution
pheatmap(hbr_uhr_normalized_counts,scale="row", annotation_col =annotation_df,
annotation_colors =list(treatment=c(HBR="orangered",UHR="blue")),
main="Expression and clustering of top 12 DE genes")