Alignment

RNASeq Mapping Challenges

The majority of mRNA derived from eukaryotes is the result of splicing together discontinuous exons, and this creates specific challenges for the alignment of RNASEQ data.

alignment

protocol

Mapping Challenges

Reads not perfect
Duplicate molecules (PCR artifacts skew quantitation)
Multimapped reads - Some regions of the genome are thus classified as unmappable
Aligners try very hard to align all reads, therefore fewest artifacts occur when all possible genomic locations are provides (genome over transcriptome)

RNASeq Mapping Solutions

There are a number of specific solutions that have been devised to address the issues created by attempting to map mRNA to DNA genomes. Each of these has its advantages and disadvantages.

Align against the transcriptome
- Many/All transcriptomes are incomplete
- Can only measure known genes
- Won’t detect non-coding RNAs
- Can’t look at splicing variants
- Can’t detect fusion genes or structure variants
De novo assembly of RNASeq reads
- Largely used for uncharacterized genomes
Align against the genome using a splice-aware aligner
- Most versatile solution
Pseudo-Aligner - quasi mappers (Salmon and Kalisto)
- New class of programs - blazingly fast
- Map to transcriptome (not genome) and does quantitation
- Surprisingly accurate except for very low abundance signals
- With bootstrapping can give confidence values

The complexity of the problem of accurately mapping millions of reads against large genomes can be appreciated by looking at a time line of the development of different mapping programs.

Common Aligners

Most alignment algorithms rely on the construction of auxiliary data structures, called indices, which are made for the sequence reads, the reference genome sequence, or both. Mapping algorithms can largely be grouped into two categories based on properties of their indices: algorithms based on hash tables, and algorithms based on the Burrows-Wheeler transform

Bowtie2 BWA/BWA-mem STAR
HISAT
HISAT2
TopHat
TopHat2

Tools for mapping high-throughput sequencing data Nuno A. Fonseca Johan Rung Alvis Brazma John C. Marioni Author Notes Bioinformatics, Volume 28, Issue 24, 1 December 2012, Pages 3169–3177, https://doi.org/10.1093/bioinformatics/bts605

To Align or not to Align

Aligners typically align against the entire genome and provide a output where the results can be visibly inspected (bam file via IGV). The must be used for detecting novel genes/transcripts. Quantitation of aligned reads to specific genes is typically done by separate program

PseudoAligners assign reads to the most appropriate transcript... can’t find novel genes/transcripts or other anomalies. Generally much faster than aligner and are likely more accurate (Recent improvements in salmon have increased its accuracy, at the expense of being somewhat slower than the original)

Typical Questions about alignment

What is the best aligner to use?
What Genome version should I use?
What Genome annotation should I use?

Answers

STAR - (Salmon or Kallisto) - subjective
Depends ! most recent or best annotated
GeneCode with caveats - know what is being annotated and what is not and how it effects your results

Questions not asked

What parameters should I use?

Answers

Most programs have lots of optional parameters that can tweak the results, but most are set to defaults that should work in most common situations.  (Don’t touch what you don’t understand - especially if it gets you, your favorite answer)

Special Consideration for Alternate Splicing Events

To add to the mRNA mapping problem is the existance of alternate splicing events. Attempting to identify alternate splicing in RNASEQ data is not something for the novice to attempt! .... get professional help alt_splice

Post Alignment QC Programs

RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. “Basic modules” quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while “RNA-seq specific modules” investigate sequencing saturation status of both splicing junction detection and expression estimation, mapped reads clipping profile, mapped reads distribution, coverage uniformity over gene body, reproducibility, strand specificity and splice junction annotation.

MultiQC is a modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

Picard Tools - RNAseqMetrics is a module that produces metrics about the alignment of RNA-seq reads within a SAM file to genes

RSeQC example of plot types

Post Alignment Cleanup Programs

Picard is a set of command line tools for manipulating high- throughput sequencing (HTS) data and formats such as SAM/ BAM/CRAM and VCF. (mark pcr duplicates)

Samtools provide various utilities for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.

BamTools is a command-line toolkit for reading, writing, and manipulating BAM (genome alignment) files.