Lesson 17: RNA sequencing review 2
Learning objectives
This lesson will serve as comprehensive review of Module 2. We will spend
- roughly the first hour reviewing the Module 2 material
- the second hour answering specific questions from the poll in Lesson 16 and other questions you may have about the course or your own data
Review of RNA sequencing concepts
- Purpose of RNA sequencing and what biological questions can RNA sequencing answer
- Experimental considerations
- Sample preparation
- Replicates
- Technical noise
- Read depth
- More depth for low expression genes
- More depth for low fold differences
- Read length
- Longer reads have higher chance of mapping uniquely
- Duplicate genes may require longer reads to distinguish
- Paired or unpaired
- Paired help mapping across splice junctions
- Replicates
- RNA quality
- Sample preparation
Review of RNA sequencing concepts
- Purpose of RNA sequencing and what biological questions can RNA sequencing answer
- Experimental considerations
- Sample preparation
- Replicates
- Technical noise
- Read depth
- More depth for low expression genes
- More depth for low fold differences
- Read length
- Longer reads have higher chance of mapping uniquely
- Duplicate genes may require longer reads to distinguish
- Paired or unpaired
- Paired help mapping across splice junctions
- Replicates
- RNA quality
- Sample preparation
RNA sequencing analysis considerations
For this portion of the review, let's create a directory in the ~/biostar_class folder called review1.
Check your present working directory (PWD)
pwd
If you are not in the ~/biostar_class
cd ~/biostar_class
Create the review1 folder
mkdir review1
Change in to the review1 folder
cd review1
What are the files do we need for RNA sequencing analysis?
- Reference genome or transcriptome
- Annotation files (gff or gtf) that tells us the genomic features (ie. gene, transcript, etc.)
- Raw sequencing files in FASTQ (or fq) format
Where is my data
Where is the RNA sequencing data that we will be using for this course? Remember we will be demonstrating the steps of RNA sequencing in class using the Human Brain Reference (HBR) and Universal Human Reference (UHR). For practice, we will be using the Golden Snidget dataset from the Biostars handbook.
Answer
Check the root -> data folder so
cd /data
HBR and UHR raw sequencing data are in the RNASEQ folder
ls RNASEQ
HBR and UHR reference genome and annotation files are in the refs folder
ls refs
Golden Snidget data are in the folder golden
tree golden
Where and what are the tools that we are using for RNA sequencing analysis
Answer
Tools that we will be using for RNA sequencing analysis in this course series include command line applications for raw data quality assessment, data cleanup, trimming, alignment, etc. We will also be running several R scripts on the command line to obtain differential expression and to visualize gene expression in our datasets. Note that once you have completed this course, you will actually need to learn R and use the differential expression packages in R rather than relying on the Biostar helper scripts.
If we listed the contens of the /usr/miniconda3/bin folder, we can see some of the applications that we will be running, such as FASTQC.
ls /miniconda3/bin
The R helper scripts are located in /usr/local/code. Alternatively, because this folder has exported as an environmental variable (CODE) for us so we can also refrence it using $CODE.
ls /usr/local/code
ls $CODE
RNA sequencing data quality control
- What tool(s) are available to assess quality of our raw sequencing data
Answer
FASTQC will generate quality assessment reports for each FASTQ file separately (ie. if you have 12 files, 12 separate FASTQC reports will be generated).
fastqc /data/RNASEQ *.fastq.gz
MultiQC will merge multiple FASTQC reports into one
multiqc .
RNA sequencing - cleanup of FASTQ files
What tools are available for quality and adapter trimming?
Answer
Trimmomatic and BBDuk
Visualizing genomic data
What tool do we use to visualize genomic data?
Answer
Integrative Genome Viewer (IGV)
Aligning raw sequencing data to genome or transcriptome
After assessing the quality of our raw sequencing data and performing cleanup if necessary the step that follows alignment the raw sequencing data to a genome or transcriptome. What tools can we use?
Answer
For alignment of RNA sequencing data, we need to use a splice aware aligner. For mapping to genome, we can a program known as HISAT2, while salmon can be used to align the raw sequencing data to a reference transcriptome. STAR is another popular genome based aligner.
What do we need to do the reference genome or transcriptome before we can use it for alignment and why?
Answer
We need to index the genome or transcriptome. Indexing is like creating a table of contents for a book, which helps to make searching more efficient (ie. searching the entire book versus searching a part of the book).
To index a reference genome using HISAT2, we can use
hisat2-build
To index a reference transcriptome using salmon
salmon index
After alignment, what file format is our alignment results stored in.
Answer
Our alignment results are stored in SAM (sequence alignment mapped) format. This format is tab delimited, which means that the columns in the file are separated by tabs. This alignment output is human readable and we can view in Excel. It provides various information about our alignment inluding
- Mapping position
- Matches, mismatches, insertions, deletions (see CIGAR string)
- Mapping quality
We know that the aligners write the output into a human readable SAM file. But, the applications that we will use for downstream analysis requires a file that is machine readable. What do we do in this case?
Answer
We can use samtools to convert the SAM file to a machine readable format known as a BAM. When converting SAM to BAM file we usually want to sort it by genomic coordinate so we would use samtools sort. After the BAM file has been created, we will need to use samtools index to index the BAM file.
Visualizing alignment results
What tool can we use to visualize the alignment results? What files will we need to use?
Answer
Integrative Genome Viewer (IGV) and we will need our reference genome, optionally an annotation file (gtf or gff), indexed BAM file.
Obtaining expression counts
Next, what application can we use to obtain an expression counts table?
Answer
We can use featureCounts if aligned the reads to the genome. We will need our annotation file and either SAM or BAM files.
If we aligned to a transcriptome using salmon then we can use R packages such as tximport. In this course series, we used this package through the Biostars R helper script combine_transcripts.r.
Differential expression
After obtaining the expression counts, we can proceed to differential expression analysis. What tool did we use for this in this course and what are alternatives.
Answer
We used DESeq2 to obtain differential expression in this class throught R helper script deseq2.r. Alternatives for differential expression analysis include edgeR and limma-voom.
Visualizing gene expression
What is one way we could use to visualize gene expression?
Answer
We can use a heatmap which plots gene expression values on a color scale and allows us to discern gene expression clusters that specific to sample or treatment groups.