Skip to content

Lesson 14: Differential Expression Analysis for Bulk RNA Sequencing: The Actual Analysis

Lesson 13 Review

In the previous class, participants filtered gene expression results for the hcc1395 data to remove those genes whose expression is 0 across all samples and also those do not have enough replicates in a biological condition with greater than 0 expression. Subsequent QC of the filtered gene expression showed that the normal samples clustered together while tumor samples clustered with each other (PCA and distance plot), which indicates that biology is driving the underlying differences between the samples in this study.

Learning Objectives

After this lesson, participants will be able to:

  • Describe on a high level how RNA sequencing data is modeled.
  • Understand the importance of normalization in differential expression analysis.
  • Interpret differential expression analysis results.

Sign onto Biowulf

Before getting started using the participant's assigned Biowulf student account ID.

ssh user@biowulf.nih.gov

Next, change into the /data/user/hcc1395_b4b folder.

cd /data/user/hcc1395_b4b

Request an interactive session with 12 gb of RAM, and 10 gb of local temporary storage.

sinteractive --mem=12gb --gres=lscratch:10 

Load R

module load R

Run deg.R

The R script for differential expression analysis, deg.R is stored in the b4b_scripts folder. It will generate differential expression results (using DESeq2), normalized expression, as well as visualizations.

A Bit of Background on the Negative Binomial Distribution

DESeq2 and another popular differential expression analysis package edgeR model expression data generated from RNA sequencing using the negative binomial distribtution.

"We assume that the number of reads in sample j that are assigned to gene i can be modeled by a negative binomial (NB) distribution" -- DESeq2 paper "We assume the data can be summarized into a table of counts, with rows corresponding to genes (or tags or exons or transcripts) and columns to samples. For RNA-seq experiments, these may be counts at the exon, transcript or gene-level. We model the data as negative binomial (NB) distributed" -- edgeR paper

The rationale for modeling RNA sequencing expression data using negative binomial is that the assumption that mean equals variance under the Poisson distribution is not valid as variance is greater than the mean (see mean expression versus variance plot for the hcc1395 data below).

Note

The negative binomial distribution accommodates the unequal mean-variance relationship by modeling the expression data such that a both a mean and a variance parameter has to be estimated. It can be thought of that differential expression packages such as DESeq2 creates a model of and performs statistical testing for expected true value of gene expression. Statistical testing is not done directly on the input data.

Normalization

Normalization in RNA sequencing is important as this should place expression data for all samples under the same distribution. This process also removes technical variations due to sequencing depth (ie. more sequences generated from one sample as compared to another), RNA composition (ie. normal and tumor sample may differ in their composition of RNA), gene length, etc. For comparison between conditions, sequencing depth and RNA composition should be accounted for. See here for a summary of normalization methods and when it is appropriate to use them. DESeq2 uses median ratio normalization (see DESeq2 paper) which normalizes for sequencing depth and RNA composition.

To learn about the math behind median ratio normalization view the Stat Quest on DESeq2 normalization.

Note

Underneath the hood, given the raw expression table, DESeq2 will model the mean and variance of expression for each gene as well as perform normalization and calculate differential expression.

Perform Differential Expression Analysis

The script deg.R in b4b_script will be used perform differential expression analysis on the hcc1395 data. This script will use DESeq2 and takes the following as input.

  • Filtered gene expression table, which is located in the folder hcc1395_deg and stored in the file hcc1395_gene_expression_filtered.csv so the path constructed from the current working directory of hcc1395_b4b is hcc1395_deg/hcc1395_gene_expression_filtered.csv.
  • The phenotype information, which is located in the current working directory of hcc1395_b4b and stored in file hcc1395_phenotypes.csv.
  • Experimental factor of interest for the differential expression analysis. Here, it is Treatment and the goal is to find whether there are genes whose expressions differ between normal and tumor samples.
  • Study name: all output will have this prepended (in this case, the study name is hcc1395).
  • Output directory for storing the results (ie. hcc1395_deg).
Rscript b4b_scripts/deg.R hcc1395_deg/hcc1395_gene_expression_filtered.csv hcc1395_phenotypes.csv Treatment hcc1395 hcc1395_deg

As DESeq2 is running, the following messages will appear and informs the user that it is performing the normalization, estimating gene expression variance, fitting the expression data to a generalized linear model and performing statistical testing.

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Interpreting Differential Expression Analysis Results

Change into hcc1395_deg from hcc1395_b4b.

cd hcc1395_deg

List the contents. Use the -1 option to view directory contents 1 item per line.

ls -1

The files generated from deg.R are:

  • hcc1395_deg.csv: contains the differential expression analysis results for each gene. This table is used to generate the following plots:
    • hcc1395_volcano.png
  • hcc1395_filter_deg.csv: contains the results for genes whose
    • log2 Fold Change is either greater than or equal to 4.5 with adjusted p-value less than 0.01.
    • log2 Fold Change is either less than or equal to -4.5 with adjusted p-value less than 0.01.
  • Normalized gene expression are stored in hcc1395_normalized_counts.csv and is used to generate quality control plots post-normalization. The plots include:
    • hcc1395_normalized_pca.png
    • hcc1395_normalized_distance.png
    • hcc1395_normalized_box.png
    • hcc1395_normalized_density.png
  • Normalized expression for genes appearing in the filtered differential analysis results (ie. hcc1395_filtered_normalized_expression.csv), which is used to generate the expression heatmap (ie. hcc1395_expression_heatmap.png).
hcc1395_box.png
hcc1395_deg.csv
hcc1395_density.png
hcc1395_distance.png
hcc1395_expression_heatmap.png
hcc1395_filter_deg.csv
hcc1395_filtered_normalized_expression.csv
hcc1395_gene_expression_filtered.csv
hcc1395_mean_variance.png
hcc1395_normalized_box.png
hcc1395_normalized_counts.csv
hcc1395_normalized_density.png
hcc1395_normalized_distance.png
hcc1395_normalized_pca.png
hcc1395_pca.png
hcc1395_volcano.png

Take a look at differential expression analysis results stored in hcc1395_deg.csv using column, where:

  • -t tells the column command create a table.
  • -s prompts for the field separator (comma in this case).
  • | is used to send the output of column to head and to view the first 20 lines include the -20 option in head.
column -t -s ',' hcc1395_deg.csv | head -20

Resize Terminal or Command Prompt window if content from the differential expression analysis table does not fit the entire screen.s

The columns in hcc1395_deg.csv can be explained as follows.

  • Geneid: gene identifier.
  • baseMean: mean expression across all samples for the corresponding gene.
  • log2FoldChange: this is a metric for gene expression change between conditions. In the results below, this is calculated by taking the ratio between mean expression of the gene from the tumor samples and that for the normal samples and then taking the log2 of this ratio. A log2 fold change of 1 indicates an expression ratio between tumor and normal of 2, while a log2 fold change of -1 indicates an expression ratio between tumor and normal of 1/2.
  • lfcSE: the standard error of the log2 fold change.
  • pvalue: the uncorrected p-value of the likelihood of observing the effect of the size expression change (or larger) by chance alone. This p-value is not corrected for multiple comparisons.
  • padj: Multiple comparison adjusted p-value.
  • normalMean: average expression for a gene in the normal samples.
  • tumorMean: average expression for a gene in the tumor samples.
Differential Expression Results
Geneid            baseMean   log2FoldChange  lfcSE  stat     pvalue     padj       normalMean  tumorMean
LA16c-3G11.7      1.405      -1.878          1.6    -1.174   0.241      NA         2.21        0.6
DUXAP8            556.167    8.763           0.548  15.985   1.63e-57   1.1e-56    2.55        1109.783
LL22NC03-N64E9.1  65.516     6.891           0.849  8.118    4.74e-16   1.5e-15    1.097       129.94
KCNMB3P1          6.339      -1.173          0.723  -1.623   0.105      0.15       8.783       3.89
XKR3              2.034      -0.778          1.26   -0.618   0.537      0.636      2.573       1.493
AC006548.28       2.586      0.453           1.047  0.432    0.666      0.746      2.183       2.99
CECR7             85.621     0.976           0.193  5.046    4.5e-07    1.05e-06   57.773      113.467
AC006946.16       28.073     1.322           0.335  3.945    7.99e-05   0.00016    16.03       40.113
AC006946.17       2.14       -0.062          1.128  -0.055   0.956      0.967      2.183       2.093
IL17RA            281.737    -0.694          0.103  -6.728   1.72e-11   4.79e-11   348.23      215.243
CECR6             2.253      0.569           1.155  0.493    0.622      0.711      1.81        2.7
AC006946.15       1.415      1.518           1.534  0.989    0.322      NA         0.737       2.093
CECR5             401.063    -1.79           0.09   -19.786  3.9e-87    4.09e-86   622.193     179.933
CECR2             3.109      0.173           0.935  0.185    0.854      0.89       2.923       3.293
SLC25A18          4.281      -0.062          0.806  -0.077   0.939      0.958      4.37        4.19
ATP6V1E1          1012.665   -0.723          0.054  -13.426  4.26e-41   2.3e-40    1261.083    764.247
BCL2L13           1133.492   -0.493          0.05   -9.784   1.32e-22   5.16e-22   1325.453    941.527
BID               713.738    -2.16           0.072  -29.844  1.04e-195  2.84e-194  1166.463    261.013
LINC00528         38.191     -5.972          0.757  -7.887   3.09e-15   9.58e-15   75.183      1.197

QC on Normalized Expression

It is a good idea to run QC on the normalized expression results to ensure that this step did not negatively alter the clustering of the samples and distribution of the gene expression.

Normalization of the expression data brought the tumor samples closer together on the PCA plot.

Normal samples were still closer together distance-wise as compared to the tumor samples according to the distance matrix.

Normalizing the filtered expression made the extremes (low and high expression) of the density plot overlap.

The 25th to 75th percentiles as well as median of gene expression of all samples are now more similar to each other after normalization.

Visualizations

Expression Heatmap

A heatmap depicts numerical data on a color scale and is often used to visualize gene expression. The plot below is constructed using the normalized expression table that contains a subset of genes that were determined to be significantly changed between the normal and tumor samples according to the criteria below:

  • log2 Fold Change is either greater than or equal to 2 with adjusted p-value less than 0.01
  • log2 Fold Change is either less than or equal to 2 with adjusted p-value less than 0.01

Note

The log2 fold change and p-value cutoff for statistically significant gene expression change is determined by the researcher.

Volcano Plot

A volcano plot helps scientists visualize change and statistical confidence of change. In RNA sequencing, this is log2 Fold Change plotted on the horizontal axis and -1og10 of the p-value on the vertical axis (adjusted p-value was used here).