Lesson 13: Differential Expression Analysis for Bulk RNA Sequencing: QC

Quick Review

Previously, in this Introduction to Bulk RNA Sequencing Analysis course series, participants learned to align data to reference to genome and estimate gene expression. The next 2 classes will address differential expression (DE) analysis where scientists will determine genes whose expression are statistically significantly changed between biological conditions.

Learning Objectives

After this class, participants will be able to:

Describe rationale for filtering low expression genes and filtering these out from the gene expression estimates table.
Perform quality control measures such as Principal Components Analysis on expression estimates and provide rationale for each QC measure.

This class will not introduce the R programming language but participants will run R scripts from the command line.

Sign onto Biowulf

Before getting started using the participant's assigned Biowulf student account ID.

ssh user@biowulf.nih.gov

Next, change into the /data/user/hcc1395_b4b folder.

cd /data/user/hcc1395_b4b

Request an interactive session with 12 gb of RAM, and 10 gb of local temporary storage.

sinteractive --mem=12gb --gres=lscratch:10

Load R

module load R

Background on R Scripts

The R scripts used for this class are in the folder b4b_scripts. Stay in /data/user/hcc1395_b4b and list the contents of b4b_scripts to learn what is available. These scripts were generated by BTEP and inspired by the Biostars Handbook's.

ls b4b_scripts

There are 3 R scripts.

deg.R will calculate differential gene expression using DESeq2.
filter_expression.R will filter low expressing genes from the expression estimates.
quality_check.R enables QC of raw expression estimates.

deg.R  filter_expression.R  quality_check.R

Warning

These R scripts are for class and demonstration purposes. Do not take these and blindly use to analyze data as modification will be needed to meet the challenges of each study.

This session will use filter_expression.R and quality_check.R. To run R scripts the terminal, users will start with the RScript command follow by the script name and then arguments.

Stay in /data/user/hcc1395_b4b and create a folder called hcc1395_deg to store the differential expression analysis results.

mkdir hcc1395_deg

Note

This course will not teach participants how to script in R.

Filter Low Expressing Genes

It's common practice to filter low expression genes as these may represent noise. In these exercises, the criteria below will be used to determine which genes will be removed from the expression estimates stored in the folder hcc1395_featurecounts as file hcc1395_gene_expression.csv.

If a gene has 0 expression across all samples, then it is removed.
If a 2 out of 3 samples in a treatment group has a gene expression of 0, then it is removed (ie. if a gene for a particular condition has 2 sample with 0 expression and 1 sample with expression of 10, then it is removed).

Filtering criteria is up to the researcher. Other methods include filtering based on the variance of expression of a gene across samples.

Issuing one of the following will pull the arguments needed for filter_expression.R.

Rscript filter_expression.R

or

Rscript filter_expression.R -h

or

Rscript filter_expression.R --help

Regardless of method, users will see the following tips on how to run the R scripts, in this case filter_expression.R.

The first line is a description of the script followed by usage. The required arguments are listed as well.

This R script filters low expressing genes from RNA sequencing data.
Usage: Rscript filter_expression.R arguments

Use Rscript filter_expression.R -h or Rscript filter_expression.R --help to get help.

The following arguments are required.

Argument 1: File name of expression results in CSV format.
    *The first column should contain gene identifiers followed by columns containing expression estimates for each sample.
Argument 2: The integer number of samples in each group that needs to have expression greater than 0.
Argument 3: File name of a phenotypes table in CSV format. This file must contain exactly these column names:
    *Column - this contains the sample column headings from the expression table provided as Argument 1.
    *Sample - these are the shortened sample names and will used to make results and visualizations look better.
    *Treatment - these are the biological conditions to which the samples belong.
Argument 4: Study name! This will be prepended to the outputs.
Argument 5: Output directory! Outputs will be written here.

Stay in the /data/user/hcc1395_b4b folder for the exercises below.

To run filter_expression.R from the /data/user/hcc1395_b4b folder, use the following command construct where:

b4b_scripts/filter_expression.R is the path for the filtered_expression.R script since it is being run from /data/user/hcc1395_b4b.
hcc1395_featurecounts/hcc1395_gene_expression.csv is the path for the gene expression table as referenced from /data/user/hcc1395_b4b. The expression table is stored in the folder hcc1395_featurecounts and file hcc1395_gene_expression.csv
hcc1395_phenotypes.csv is the path to the phenotypes table as referenced from /data/user/hcc1395_b4b.
The study name will be hcc1395 and this will be prepended to all output files.
The output folder is hcc1395_deg as referenced from /data/user/hcc1395_b4b.

Rscript b4b_scripts/filter_expression.R hcc1395_featurecounts/hcc1395_gene_expression.csv 2 hcc1395_phenotypes.csv hcc1395 hcc1395_deg

List the contents of hcc1395_deg from /data/user/hcc1395_b4b to check that a filterd expression CSV file is present.

ls hcc1395_deg

hcc1395_gene_expression_filtered.csv

Use wc -l to compare number of lines in the unfiltered expression table versus the filtered.

wc -l hcc1395_featurecounts/hcc1395_gene_expression.csv

wc -l hcc1395_deg/hcc1395_gene_expression_filtered.csv

The unfiltered expression table has 1336 lines while the filtered table has 593 lines. Thus, 743 genes were removed as a result of filtering based on the criteria set. The command expr can be used to perform arithmetic operations in Unix command line and the construct below subtracts 593 (number of lines in the filtered expression table) from the number lines in the unfiltered expression table (1336).

expr 1336 - 593

Quality Check

Below are the quality checks that will be performed on the filtered gene expression table.

Principal Components Analysis (PCA): This transforms high dimensional data such as those derived from RNA sequencing so that researchers can see how study variables cluster together. The result of PCA is that the original data is projected onto a set of perpendicular axes where each axis accounts for a percentage of variance in the data. To learn the math behind PCA, see https://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf. Given that this example dataset has normal and tumor samples, it is expected the samples under each condition would separate along the principal component axis that accounts for the most variance in the dataset. Thus, indicating it is the biology that differentiates normal and tumor samples.
Box and density plots enable scientists to visualize the distribution of expression data and check for outliers. Ideally, the expression distribution for all samples should be roughly the same for differential expression analysis (see h3abionet). These two visuals are also useful to determine whether a specific normalization method worked well.
Distance plot shows the distance between samples. It is expected that samples within condition will be closer to each other (distance wise) than those samples from other conditions.

To run quality_check.R, use the construct below. Remember, to run a R script from the command line, start with the command Rscript, followed by:

Script name, which is quality_check.R. However, because quality_check.R is in the folder b4b_script, it is referenced using b4b_script/quality_check.R from the current folder of /data/user/hcc1395_b4b.
The name of expression table. Here, the filtered expression table hcc1395_gene_expression_filtered.csv in hcc1395_deg will be used. As referenced from /data/user/hcc1395_b4b, the path would be hcc1395_deg/hcc1395_gene_expression_filtered.csv.
The study name (ie. hcc1395), which will be prepended to all output files.
Directory in which to write the output (ie. hcc1395_deg).

Rscript b4b_scripts/quality_check.R hcc1395_deg/hcc1395_gene_expression_filtered.csv hcc1395 hcc1395_deg

ls hcc1395_deg

hcc1395_box.png      hcc1395_distance.png                  hcc1395_mean_variance.png
hcc1395_density.png  hcc1395_gene_expression_filtered.csv  hcc1395_pca.png

Principal Components Results

The Principal Components plot below graphs the sample along the PC1 and PC2 axes, which account for the highest and second highest variances in the data, respectively. PC1 explains 67.5% of variance while PC2 explains 8.7%. It is clear that the normal and tumor samples separate along PC1, which indicates that it is the biology that is differentiating these samples between these groups. Within group samples separation along PC2 could be due to differences between samples of same group or maybe batch effect (although batch information is not available).

Distance Plot Results

The distance plot is also a good tool for determining if the samples within groups are clustered to together. The legend scale indicates the distance and in this plot (the smaller the number the smaller the distance). It is expected that samples within the same group be closer to each other as compared to samples from a different biological condition. This QC step passes for the filtered gene expression table.

Expression Distribution

Even without normalization, the filtered expression distribution are very similar between the samples with the median and 25th to 75th percentiles higher in the tumor samples.