Lesson 14: Differential Expression Analysis for Bulk RNA Sequencing: The Actual Analysis
Lesson 13 Review
In the previous class, participants filtered gene expression results for the hcc1395 data to remove those genes whose expression is 0 across all samples and also those do not have enough replicates in a biological condition with greater than 0 expression. Subsequent QC of the filtered gene expression showed that the normal samples clustered together while tumor samples clustered with each other (PCA and distance plot), which indicates that biology is driving the underlying differences between the samples in this study.
Learning Objectives
After this lesson, participants will be able to:
- Describe on a high level how RNA sequencing data is modeled.
- Understand the importance of normalization in differential expression analysis.
- Interpret differential expression analysis results.
Sign onto Biowulf
Before getting started using the participant's assigned Biowulf student account ID.
Next, change into the /data/user/hcc1395_b4b
folder.
Request an interactive session with 12 gb of RAM, and 10 gb of local temporary storage.
Load R
Run deg.R
The R script for differential expression analysis, deg.R
is stored in the b4b_scripts
folder. It will generate differential expression results (using DESeq2), normalized expression, as well as visualizations.
A Bit of Background on the Negative Binomial Distribution
DESeq2 and another popular differential expression analysis package edgeR model expression data generated from RNA sequencing using the negative binomial distribtution.
"We assume that the number of reads in sample j that are assigned to gene i can be modeled by a negative binomial (NB) distribution" -- DESeq2 paper "We assume the data can be summarized into a table of counts, with rows corresponding to genes (or tags or exons or transcripts) and columns to samples. For RNA-seq experiments, these may be counts at the exon, transcript or gene-level. We model the data as negative binomial (NB) distributed" -- edgeR paper
The rationale for modeling RNA sequencing expression data using negative binomial is that the assumption that mean equals variance under the Poisson distribution is not valid as variance is greater than the mean (see mean expression versus variance plot for the hcc1395 data below).
Note
The negative binomial distribution accommodates the unequal mean-variance relationship by modeling the expression data such that a both a mean and a variance parameter has to be estimated. It can be thought of that differential expression packages such as DESeq2 creates a model of and performs statistical testing for expected true value of gene expression. Statistical testing is not done directly on the input data.
Normalization
Normalization in RNA sequencing is important as this should place expression data for all samples under the same distribution. This process also removes technical variations due to sequencing depth (ie. more sequences generated from one sample as compared to another), RNA composition (ie. normal and tumor sample may differ in their composition of RNA), gene length, etc. For comparison between conditions, sequencing depth and RNA composition should be accounted for. See here for a summary of normalization methods and when it is appropriate to use them. DESeq2 uses median ratio normalization (see DESeq2 paper) which normalizes for sequencing depth and RNA composition.
To learn about the math behind median ratio normalization view the Stat Quest on DESeq2 normalization.
Note
Underneath the hood, given the raw expression table, DESeq2 will model the mean and variance of expression for each gene as well as perform normalization and calculate differential expression.
Perform Differential Expression Analysis
The script deg.R
in b4b_script
will be used perform differential expression analysis on the hcc1395 data. This script will use DESeq2 and takes the following as input.
- Filtered gene expression table, which is located in the folder
hcc1395_deg
and stored in the filehcc1395_gene_expression_filtered.csv
so the path constructed from the current working directory ofhcc1395_b4b
ishcc1395_deg/hcc1395_gene_expression_filtered.csv
. - The phenotype information, which is located in the current working directory of
hcc1395_b4b
and stored in filehcc1395_phenotypes.csv
. - Experimental factor of interest for the differential expression analysis. Here, it is Treatment and the goal is to find whether there are genes whose expressions differ between normal and tumor samples.
- Study name: all output will have this prepended (in this case, the study name is
hcc1395
). - Output directory for storing the results (ie.
hcc1395_deg
).
Rscript b4b_scripts/deg.R hcc1395_deg/hcc1395_gene_expression_filtered.csv hcc1395_phenotypes.csv Treatment hcc1395 hcc1395_deg
As DESeq2 is running, the following messages will appear and informs the user that it is performing the normalization, estimating gene expression variance, fitting the expression data to a generalized linear model and performing statistical testing.
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Interpreting Differential Expression Analysis Results
Change into hcc1395_deg
from hcc1395_b4b
.
List the contents. Use the -1
option to view directory contents 1 item per line.
The files generated from deg.R
are:
hcc1395_deg.csv
: contains the differential expression analysis results for each gene. This table is used to generate the following plots:hcc1395_volcano.png
hcc1395_filter_deg.csv
: contains the results for genes whose- log2 Fold Change is either greater than or equal to 4.5 with adjusted p-value less than 0.01.
- log2 Fold Change is either less than or equal to -4.5 with adjusted p-value less than 0.01.
- Normalized gene expression are stored in
hcc1395_normalized_counts.csv
and is used to generate quality control plots post-normalization. The plots include:hcc1395_normalized_pca.png
hcc1395_normalized_distance.png
hcc1395_normalized_box.png
hcc1395_normalized_density.png
- Normalized expression for genes appearing in the filtered differential analysis results (ie.
hcc1395_filtered_normalized_expression.csv
), which is used to generate the expression heatmap (ie.hcc1395_expression_heatmap.png
).
hcc1395_box.png
hcc1395_deg.csv
hcc1395_density.png
hcc1395_distance.png
hcc1395_expression_heatmap.png
hcc1395_filter_deg.csv
hcc1395_filtered_normalized_expression.csv
hcc1395_gene_expression_filtered.csv
hcc1395_mean_variance.png
hcc1395_normalized_box.png
hcc1395_normalized_counts.csv
hcc1395_normalized_density.png
hcc1395_normalized_distance.png
hcc1395_normalized_pca.png
hcc1395_pca.png
hcc1395_volcano.png
Take a look at differential expression analysis results stored in hcc1395_deg.csv
using column
, where:
-t
tells thecolumn
command create a table.-s
prompts for the field separator (comma in this case).|
is used to send the output ofcolumn
tohead
and to view the first 20 lines include the-20
option inhead
.
Resize Terminal or Command Prompt window if content from the differential expression analysis table does not fit the entire screen.s
The columns in hcc1395_deg.csv
can be explained as follows.
- Geneid: gene identifier.
- baseMean: mean expression across all samples for the corresponding gene.
- log2FoldChange: this is a metric for gene expression change between conditions. In the results below, this is calculated by taking the ratio between mean expression of the gene from the tumor samples and that for the normal samples and then taking the log2 of this ratio. A log2 fold change of 1 indicates an expression ratio between tumor and normal of 2, while a log2 fold change of -1 indicates an expression ratio between tumor and normal of 1/2.
- lfcSE: the standard error of the log2 fold change.
- pvalue: the uncorrected p-value of the likelihood of observing the effect of the size expression change (or larger) by chance alone. This p-value is not corrected for multiple comparisons.
- padj: Multiple comparison adjusted p-value.
- normalMean: average expression for a gene in the normal samples.
- tumorMean: average expression for a gene in the tumor samples.
Differential Expression Results
Geneid baseMean log2FoldChange lfcSE stat pvalue padj normalMean tumorMean
LA16c-3G11.7 1.405 -1.878 1.6 -1.174 0.241 NA 2.21 0.6
DUXAP8 556.167 8.763 0.548 15.985 1.63e-57 1.1e-56 2.55 1109.783
LL22NC03-N64E9.1 65.516 6.891 0.849 8.118 4.74e-16 1.5e-15 1.097 129.94
KCNMB3P1 6.339 -1.173 0.723 -1.623 0.105 0.15 8.783 3.89
XKR3 2.034 -0.778 1.26 -0.618 0.537 0.636 2.573 1.493
AC006548.28 2.586 0.453 1.047 0.432 0.666 0.746 2.183 2.99
CECR7 85.621 0.976 0.193 5.046 4.5e-07 1.05e-06 57.773 113.467
AC006946.16 28.073 1.322 0.335 3.945 7.99e-05 0.00016 16.03 40.113
AC006946.17 2.14 -0.062 1.128 -0.055 0.956 0.967 2.183 2.093
IL17RA 281.737 -0.694 0.103 -6.728 1.72e-11 4.79e-11 348.23 215.243
CECR6 2.253 0.569 1.155 0.493 0.622 0.711 1.81 2.7
AC006946.15 1.415 1.518 1.534 0.989 0.322 NA 0.737 2.093
CECR5 401.063 -1.79 0.09 -19.786 3.9e-87 4.09e-86 622.193 179.933
CECR2 3.109 0.173 0.935 0.185 0.854 0.89 2.923 3.293
SLC25A18 4.281 -0.062 0.806 -0.077 0.939 0.958 4.37 4.19
ATP6V1E1 1012.665 -0.723 0.054 -13.426 4.26e-41 2.3e-40 1261.083 764.247
BCL2L13 1133.492 -0.493 0.05 -9.784 1.32e-22 5.16e-22 1325.453 941.527
BID 713.738 -2.16 0.072 -29.844 1.04e-195 2.84e-194 1166.463 261.013
LINC00528 38.191 -5.972 0.757 -7.887 3.09e-15 9.58e-15 75.183 1.197
QC on Normalized Expression
It is a good idea to run QC on the normalized expression results to ensure that this step did not negatively alter the clustering of the samples and distribution of the gene expression.
Normalization of the expression data brought the tumor samples closer together on the PCA plot.
Normal samples were still closer together distance-wise as compared to the tumor samples according to the distance matrix.
Normalizing the filtered expression made the extremes (low and high expression) of the density plot overlap.
The 25th to 75th percentiles as well as median of gene expression of all samples are now more similar to each other after normalization.
Visualizations
Expression Heatmap
A heatmap depicts numerical data on a color scale and is often used to visualize gene expression. The plot below is constructed using the normalized expression table that contains a subset of genes that were determined to be significantly changed between the normal and tumor samples according to the criteria below:
- log2 Fold Change is either greater than or equal to 2 with adjusted p-value less than 0.01
- log2 Fold Change is either less than or equal to 2 with adjusted p-value less than 0.01
Note
The log2 fold change and p-value cutoff for statistically significant gene expression change is determined by the researcher.
Volcano Plot
A volcano plot helps scientists visualize change and statistical confidence of change. In RNA sequencing, this is log2 Fold Change plotted on the horizontal axis and -1og10 of the p-value on the vertical axis (adjusted p-value was used here).