Lesson 17: RNA sequencing review 2

Learning objectives

This lesson will serve as comprehensive review of Module 2. We will spend

roughly the first hour reviewing the Module 2 material
the second hour answering specific questions from the poll in Lesson 16 and other questions you may have about the course or your own data

Review of RNA sequencing concepts

Purpose of RNA sequencing and what biological questions can RNA sequencing answer
Experimental considerations
- Sample preparation
  - Replicates
  - Technical noise
- Read depth
  - More depth for low expression genes
  - More depth for low fold differences
- Read length
  - Longer reads have higher chance of mapping uniquely
  - Duplicate genes may require longer reads to distinguish
- Paired or unpaired
  - Paired help mapping across splice junctions
- Replicates
- RNA quality

Review of RNA sequencing concepts

Purpose of RNA sequencing and what biological questions can RNA sequencing answer
Experimental considerations
- Sample preparation
  - Replicates
  - Technical noise
- Read depth
  - More depth for low expression genes
  - More depth for low fold differences
- Read length
  - Longer reads have higher chance of mapping uniquely
  - Duplicate genes may require longer reads to distinguish
- Paired or unpaired
  - Paired help mapping across splice junctions
- Replicates
- RNA quality

RNA sequencing analysis considerations

For this portion of the review, let's create a directory in the ~/biostar_class folder called review1.

Check your present working directory (PWD)

pwd

If you are not in the ~/biostar_class

cd ~/biostar_class

Create the review1 folder

mkdir review1

Change in to the review1 folder

cd review1

What are the files do we need for RNA sequencing analysis?

Reference genome or transcriptome
Annotation files (gff or gtf) that tells us the genomic features (ie. gene, transcript, etc.)
Raw sequencing files in FASTQ (or fq) format

Where is my data

Where is the RNA sequencing data that we will be using for this course? Remember we will be demonstrating the steps of RNA sequencing in class using the Human Brain Reference (HBR) and Universal Human Reference (UHR). For practice, we will be using the Golden Snidget dataset from the Biostars handbook.

Answer

Check the root -> data folder so

cd /data

HBR and UHR raw sequencing data are in the RNASEQ folder

ls RNASEQ

HBR and UHR reference genome and annotation files are in the refs folder

ls refs

Golden Snidget data are in the folder golden

tree golden

Where and what are the tools that we are using for RNA sequencing analysis

Answer

Tools that we will be using for RNA sequencing analysis in this course series include command line applications for raw data quality assessment, data cleanup, trimming, alignment, etc. We will also be running several R scripts on the command line to obtain differential expression and to visualize gene expression in our datasets. Note that once you have completed this course, you will actually need to learn R and use the differential expression packages in R rather than relying on the Biostar helper scripts.

If we listed the contens of the /usr/miniconda3/bin folder, we can see some of the applications that we will be running, such as FASTQC.

ls /miniconda3/bin

The R helper scripts are located in /usr/local/code. Alternatively, because this folder has exported as an environmental variable (CODE) for us so we can also refrence it using $CODE.

ls /usr/local/code

ls $CODE

RNA sequencing data quality control

What tool(s) are available to assess quality of our raw sequencing data

Answer

FASTQC will generate quality assessment reports for each FASTQ file separately (ie. if you have 12 files, 12 separate FASTQC reports will be generated).

fastqc /data/RNASEQ *.fastq.gz

MultiQC will merge multiple FASTQC reports into one

multiqc .

RNA sequencing - cleanup of FASTQ files

What tools are available for quality and adapter trimming?

Answer

Trimmomatic and BBDuk

Visualizing genomic data

What tool do we use to visualize genomic data?

Answer

Integrative Genome Viewer (IGV)

Aligning raw sequencing data to genome or transcriptome

After assessing the quality of our raw sequencing data and performing cleanup if necessary the step that follows alignment the raw sequencing data to a genome or transcriptome. What tools can we use?

Answer

For alignment of RNA sequencing data, we need to use a splice aware aligner. For mapping to genome, we can a program known as HISAT2, while salmon can be used to align the raw sequencing data to a reference transcriptome. STAR is another popular genome based aligner.

What do we need to do the reference genome or transcriptome before we can use it for alignment and why?

Answer

We need to index the genome or transcriptome. Indexing is like creating a table of contents for a book, which helps to make searching more efficient (ie. searching the entire book versus searching a part of the book).

To index a reference genome using HISAT2, we can use

hisat2-build

To index a reference transcriptome using salmon

salmon index

After alignment, what file format is our alignment results stored in.

Answer

Our alignment results are stored in SAM (sequence alignment mapped) format. This format is tab delimited, which means that the columns in the file are separated by tabs. This alignment output is human readable and we can view in Excel. It provides various information about our alignment inluding

Mapping position
Matches, mismatches, insertions, deletions (see CIGAR string)
Mapping quality

We know that the aligners write the output into a human readable SAM file. But, the applications that we will use for downstream analysis requires a file that is machine readable. What do we do in this case?

Answer

We can use samtools to convert the SAM file to a machine readable format known as a BAM. When converting SAM to BAM file we usually want to sort it by genomic coordinate so we would use samtools sort. After the BAM file has been created, we will need to use samtools index to index the BAM file.

Visualizing alignment results

What tool can we use to visualize the alignment results? What files will we need to use?

Answer

Integrative Genome Viewer (IGV) and we will need our reference genome, optionally an annotation file (gtf or gff), indexed BAM file.

Obtaining expression counts

Next, what application can we use to obtain an expression counts table?

Answer

We can use featureCounts if aligned the reads to the genome. We will need our annotation file and either SAM or BAM files.

If we aligned to a transcriptome using salmon then we can use R packages such as tximport. In this course series, we used this package through the Biostars R helper script combine_transcripts.r.

Differential expression

After obtaining the expression counts, we can proceed to differential expression analysis. What tool did we use for this in this course and what are alternatives.

Answer

We used DESeq2 to obtain differential expression in this class throught R helper script deseq2.r. Alternatives for differential expression analysis include edgeR and limma-voom.

Visualizing gene expression

What is one way we could use to visualize gene expression?

Answer

We can use a heatmap which plots gene expression values on a color scale and allows us to discern gene expression clusters that specific to sample or treatment groups.