Data Analysis Overview

RNASEQ - Data Analysis WorkFlow

Mostly Computational intensive task requiring signigicant computer hardware.

Quality Control
- Sample quality and consistency
- Is Trimming appropriate - quality/adaptors
Alignment/Mapping
- Reference Target (Sequence and annotation) Alignment Program
- Alignment Parameters
- Mark Duplicates
- Post-Alignment Quality Assurance
Quantification *Counting Method and Parameters

Genearlly less computational intensive task doable on a personal computer.

Quantification
- Differential Expression - statistics
Visualization
- Visual inspection - IGV
- Data representation - scatter, violin plots, heat-maps
Biological Meaning
- Gene Set Enrichment
- Pathway Analysis

Computational Considerations THE GOOD NEWS

For the most part the computational aspects have been taken care of for you.

(no need to develop new algorithms or code).

There are pre-built workflows that can automate many of the processes involved, and facilitate reproducibility.

Computational Considerations THE BAD NEWS

Like most of NGS data analysis, the complexity of RNA- Seq data analysis revolves around data and information management and the dealing with “unexpected” issues.

Consider the simplest experiment (Two conditions three replicates) 6-12 fastq starting files

6-12 quality control files

6-12 fastq files post trimming of adaptors 6 bam file, and 6 bam index files

6 gene count files

36-48 files minimum (big files)

Computational Considerations The Challenges

There is no single best method for RNA-Seq data analysis - it depends on your definition of best, and even then it varies over time and with the particular goals and specifics of a given experiment

It’s for this reason that you should learn enough about the process to make “sensible choices” and to know when the results are reasonable and correct.  

Treating an RNA-Seq (or any NGS) analysis as a black box is a “recipe for disaster” (or at least bad science). That’s not to say that you need to know the particulars of every algorithm involved in a workflow, but you should know the steps involved and what assumptions and/or limitations are build into the whole workflow

Computational Prerequisites

These are considered appropriate if you are planning on doing all the data analysis yourself.

High performance Linux computer (multi core, high memory, and plenty of storage)
Familiarity with the “command line” and at least one programming/scripting language.
Basic knowledge of how to install software
Basic knowledge of R and/or statistical programming Basic knowledge of Statistics and model building