Lesson 13: Differential Expression Analysis for Bulk RNA Sequencing: QC
Quick Review
Previously, in this Introduction to Bulk RNA Sequencing Analysis course series, participants learned to align data to reference to genome and estimate gene expression. The next 2 classes will address differential expression (DE) analysis where scientists will determine genes whose expression are statistically significantly changed between biological conditions.
Learning Objectives
After this class, participants will be able to:
- Describe rationale for filtering low expression genes and filtering these out from the gene expression estimates table.
- Perform quality control measures such as Principal Components Analysis on expression estimates and provide rationale for each QC measure.
This class will not introduce the R programming language but participants will run R scripts from the command line.
Sign onto Biowulf
Before getting started using the participant's assigned Biowulf student account ID.
Next, change into the /data/user/hcc1395_b4b
folder.
Request an interactive session with 12 gb of RAM, and 10 gb of local temporary storage.
Load R
Background on R Scripts
The R scripts used for this class are in the folder b4b_scripts
. Stay in /data/user/hcc1395_b4b
and list the contents of b4b_scripts
to learn what is available. These scripts were generated by BTEP and inspired by the Biostars Handbook's.
There are 3 R scripts.
deg.R
will calculate differential gene expression using DESeq2.filter_expression.R
will filter low expressing genes from the expression estimates.quality_check.R
enables QC of raw expression estimates.
Warning
These R scripts are for class and demonstration purposes. Do not take these and blindly use to analyze data as modification will be needed to meet the challenges of each study.
This session will use filter_expression.R
and quality_check.R
. To run R scripts the terminal, users will start with the RScript
command follow by the script name and then arguments.
Stay in /data/user/hcc1395_b4b
and create a folder called hcc1395_deg
to store the differential expression analysis results.
Note
This course will not teach participants how to script in R.
Filter Low Expressing Genes
It's common practice to filter low expression genes as these may represent noise. In these exercises, the criteria below will be used to determine which genes will be removed from the expression estimates stored in the folder hcc1395_featurecounts
as file hcc1395_gene_expression.csv
.
- If a gene has 0 expression across all samples, then it is removed.
- If a 2 out of 3 samples in a treatment group has a gene expression of 0, then it is removed (ie. if a gene for a particular condition has 2 sample with 0 expression and 1 sample with expression of 10, then it is removed).
Filtering criteria is up to the researcher. Other methods include filtering based on the variance of expression of a gene across samples.
Issuing one of the following will pull the arguments needed for filter_expression.R
.
or
or
Regardless of method, users will see the following tips on how to run the R scripts, in this case filter_expression.R
.
The first line is a description of the script followed by usage. The required arguments are listed as well.
This R script filters low expressing genes from RNA sequencing data.
Usage: Rscript filter_expression.R arguments
Use Rscript filter_expression.R -h or Rscript filter_expression.R --help to get help.
The following arguments are required.
Argument 1: File name of expression results in CSV format.
*The first column should contain gene identifiers followed by columns containing expression estimates for each sample.
Argument 2: The integer number of samples in each group that needs to have expression greater than 0.
Argument 3: File name of a phenotypes table in CSV format. This file must contain exactly these column names:
*Column - this contains the sample column headings from the expression table provided as Argument 1.
*Sample - these are the shortened sample names and will used to make results and visualizations look better.
*Treatment - these are the biological conditions to which the samples belong.
Argument 4: Study name! This will be prepended to the outputs.
Argument 5: Output directory! Outputs will be written here.
Stay in the /data/user/hcc1395_b4b
folder for the exercises below.
To run filter_expression.R
from the /data/user/hcc1395_b4b
folder, use the following command construct where:
b4b_scripts/filter_expression.R
is the path for thefiltered_expression.R
script since it is being run from/data/user/hcc1395_b4b
.hcc1395_featurecounts/hcc1395_gene_expression.csv
is the path for the gene expression table as referenced from/data/user/hcc1395_b4b
. The expression table is stored in the folderhcc1395_featurecounts
and filehcc1395_gene_expression.csv
hcc1395_phenotypes.csv
is the path to the phenotypes table as referenced from/data/user/hcc1395_b4b
.- The study name will be
hcc1395
and this will be prepended to all output files. - The output folder is
hcc1395_deg
as referenced from/data/user/hcc1395_b4b
.
Rscript b4b_scripts/filter_expression.R hcc1395_featurecounts/hcc1395_gene_expression.csv 2 hcc1395_phenotypes.csv hcc1395 hcc1395_deg
List the contents of hcc1395_deg
from /data/user/hcc1395_b4b
to check that a filterd expression CSV file is present.
Use wc -l
to compare number of lines in the unfiltered expression table versus the filtered.
The unfiltered expression table has 1336 lines while the filtered table has 593 lines. Thus, 743 genes were removed as a result of filtering based on the criteria set. The command expr
can be used to perform arithmetic operations in Unix command line and the construct below subtracts 593 (number of lines in the filtered expression table) from the number lines in the unfiltered expression table (1336).
Quality Check
Below are the quality checks that will be performed on the filtered gene expression table.
- Principal Components Analysis (PCA): This transforms high dimensional data such as those derived from RNA sequencing so that researchers can see how study variables cluster together. The result of PCA is that the original data is projected onto a set of perpendicular axes where each axis accounts for a percentage of variance in the data. To learn the math behind PCA, see https://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf. Given that this example dataset has normal and tumor samples, it is expected the samples under each condition would separate along the principal component axis that accounts for the most variance in the dataset. Thus, indicating it is the biology that differentiates normal and tumor samples.
- Box and density plots enable scientists to visualize the distribution of expression data and check for outliers. Ideally, the expression distribution for all samples should be roughly the same for differential expression analysis (see h3abionet). These two visuals are also useful to determine whether a specific normalization method worked well.
- Distance plot shows the distance between samples. It is expected that samples within condition will be closer to each other (distance wise) than those samples from other conditions.
To run quality_check.R
, use the construct below. Remember, to run a R script from the command line, start with the command Rscript
, followed by:
- Script name, which is
quality_check.R
. However, becausequality_check.R
is in the folderb4b_script
, it is referenced usingb4b_script/quality_check.R
from the current folder of/data/user/hcc1395_b4b
. - The name of expression table. Here, the filtered expression table
hcc1395_gene_expression_filtered.csv
inhcc1395_deg
will be used. As referenced from/data/user/hcc1395_b4b
, the path would behcc1395_deg/hcc1395_gene_expression_filtered.csv
. - The study name (ie.
hcc1395
), which will be prepended to all output files. - Directory in which to write the output (ie.
hcc1395_deg
).
Rscript b4b_scripts/quality_check.R hcc1395_deg/hcc1395_gene_expression_filtered.csv hcc1395 hcc1395_deg
hcc1395_box.png hcc1395_distance.png hcc1395_mean_variance.png
hcc1395_density.png hcc1395_gene_expression_filtered.csv hcc1395_pca.png
Principal Components Results
The Principal Components plot below graphs the sample along the PC1 and PC2 axes, which account for the highest and second highest variances in the data, respectively. PC1 explains 67.5% of variance while PC2 explains 8.7%. It is clear that the normal and tumor samples separate along PC1, which indicates that it is the biology that is differentiating these samples between these groups. Within group samples separation along PC2 could be due to differences between samples of same group or maybe batch effect (although batch information is not available).
Distance Plot Results
The distance plot is also a good tool for determining if the samples within groups are clustered to together. The legend scale indicates the distance and in this plot (the smaller the number the smaller the distance). It is expected that samples within the same group be closer to each other as compared to samples from a different biological condition. This QC step passes for the filtered gene expression table.
Expression Distribution
Even without normalization, the filtered expression distribution are very similar between the samples with the median and 25th to 75th percentiles higher in the tumor samples.