23. RNA Seq Example copy

This page contains content taken directly from the Biostar Handbook (Istvan Albert).

Always remember to activate the class bioinformatics environment.

conda activate bioinfo

For this data analysis, we will be using:

Two commercially available RNA samples.

Universal Human Reference (UHR) is total RNA isolated from a diverse set of 10 cancer cell lines. Human Brain Reference (HBR) is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old. The data was produced in three replicates for each condition.

In addition to the biological samples the also contains two so called spike-in mixes: ERCC Mix 1 and ERCC Mix2. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies).

So to summarize, this data consists of the following:

UHR + ERCC Mix1, Replicate 1, UHR_1
UHR + ERCC Mix1, Replicate 2, UHR_2
UHR + ERCC Mix1, Replicate 3, UHR_3
HBR + ERCC Mix2, Replicate 1, HBR_1
HBR + ERCC Mix2, Replicate 2, HBR_2
HBR + ERCC Mix2, Replicate 3, HBR_3

Download and decompress the example data (all in one step):

mkdir rnaseq
cd rnaseq
curl -s http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz | tar zxv

Directories: * "reads" contains the sequencing reads * "refs" contains genome and annotation information

using the "-1" or "minus one" option with "ls" displays one entry per line.

ls -1 reads

gets this,

HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq

and

ls -1 refs

shows this.

22.fa
22.gtf
ERCC92.fa
ERCC92.gtf

where "22.fa" is chr22 in fasta format, "22.gtf" is the genome annotation file for chr22, "ERCC92.fa" is the sequence information for the ERCC (spike-in) mixture, and "ERCC92.gtf" is the genome annotation file for the ERCC mixture.

References: Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith. An alternative tutorial is available online at https://github.com/griffithlab/rnaseq_tutorial/wiki, this has been updated at rnabio.org

Alignment based RNA-Seq of control samples

We will analyze these data by doing: 1. Alignment (hisat2) 2. Quantification (featureCounts) 3. Differential expression (DESeq)

Use of a spike-in control

Using a spike-in allows us to determine how well we can measure and reproduce data with known properties. We are using ERCC ExFold RNA Spike-In Control Mix

Download and view the expected differential expressions.

curl -O http://data.biostarhandbook.com/rnaseq/ERCC/ERCC-datasheet.csv

You can open this csv (comma-separated values) file with MS Excel. Screen Shot 2020-08-27 at 2 26 45 PM

1. Alignment - Align an RNA-seq sample

Let's align an RNA-Seq sample using the "splice aware" aligner hisat2. First we will need to create the indices. Use this format: hisat2-build REFERENCE_GENOME INDEX_PREFIX

Like this:

hisat2-build refs/ERCC92.fa refs/ERCC92.fa

where "refs/ERCC92.fa" is the reference genome and also the prefix we will use for the index that is created. Check to see all the index files created.

ls
22.fa           ERCC92.fa.1.ht2 ERCC92.fa.4.ht2 ERCC92.fa.7.ht2
22.gtf          ERCC92.fa.2.ht2 ERCC92.fa.5.ht2 ERCC92.fa.8.ht2
ERCC92.fa       ERCC92.fa.3.ht2 ERCC92.fa.6.ht2 ERCC92.gtf

Now we will run the splice aligner and create BAM files.

mkdir -p bam
hisat2 refs/ERCC92.fa -1 reads/UHR_1_R1.fq -2 reads/UHR_1_R2.fq | samtools sort > bam/UHR_1.bam 
227392 reads; of these:
  227392 (100.00%) were paired; of these:
    109242 (48.04%) aligned concordantly 0 times
    118150 (51.96%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    109242 pairs aligned concordantly 0 times; of these:
      174 (0.16%) aligned discordantly 1 time
    ----
    109068 pairs aligned 0 times concordantly or discordantly; of these:
      218136 mates make up the pairs; of these:
        218124 (99.99%) aligned 0 times
        12 (0.01%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
52.04% overall alignment rate

Instead of doing each sample by itself, we can automate the process using the "parallel" command.

Use the "nano" editor and create a file (names.txt) of all the sample names like this.

UHR_1
UHR_2
UHR_3
HBR_1
HBR_2
HBR_3

and run all the samples in parallel

cat names.txt | parallel "hisat2 refs/ERCC92.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"

Next, we will estimate the abundance for a single sample. "featureCounts" is a program that counts reads to genomic features such as genes, exons, promoters and genomic bins.

-a (annotations file)

-o (output file)

-g (specify attribute type, gene_id by default, here we change it to gene_name)

where "bam/HBR_1.bam" is an input file, a list of BAM format files.

featureCounts -a refs/ERCC92.gtf -g gene_name -o single_counts.txt bam/HBR_1.bam

Let's look at the results. The first 6 columns contain feature specific information. The rest of the columns hold the read counts that overlap with that feature.

less single_counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "single_counts.txt" "bam/HBR_1.bam" 
Geneid  Chr     Start   End     Strand  Length  bam/HBR_1.bam
ERCC-00002      ERCC-00002      1       1061    +       1061    37892
ERCC-00003      ERCC-00003      1       1023    +       1023    2904
ERCC-00004      ERCC-00004      1       523     +       523     910
ERCC-00009      ERCC-00009      1       984     +       984     638
ERCC-00012      ERCC-00012      1       994     +       994     0
ERCC-00013      ERCC-00013      1       808     +       808     0
ERCC-00014      ERCC-00014      1       1957    +       1957    26
ERCC-00016      ERCC-00016      1       844     +       844     0
ERCC-00017      ERCC-00017      1       1136    +       1136    0
ERCC-00019      ERCC-00019      1       644     +       644     6
ERCC-00022      ERCC-00022      1       751     +       751     298
ERCC-00024      ERCC-00024      1       536     +       536     0
ERCC-00025      ERCC-00025      1       1994    +       1994    160
ERCC-00028      ERCC-00028      1       1130    +       1130    4
ERCC-00031      ERCC-00031      1       1138    +       1138    0
ERCC-00033      ERCC-00033      1       2022    +       2022    0
ERCC-00034      ERCC-00034      1       1019    +       1019    16
ERCC-00035      ERCC-00035      1       1130    +       1130    94

To find the sequences with most hits, we can sort by column 7.

cat single_counts.txt | sort -rn -k 7 | head

where...

-r is to sort in reverse order (highest to lowest)
-n is a numeric sort
-k specifies which field (7th column)

ERCC-00002  ERCC-00002  1   1061    +   1061    37892
ERCC-00074  ERCC-00074  1   522 +   522 20982
ERCC-00096  ERCC-00096  1   1107    +   1107    20708
ERCC-00130  ERCC-00130  1   1059    +   1059    8088
ERCC-00113  ERCC-00113  1   840 +   840 7096
ERCC-00046  ERCC-00046  1   522 +   522 3154
ERCC-00003  ERCC-00003  1   1023    +   1023    2904
ERCC-00171  ERCC-00171  1   505 +   505 1836
ERCC-00043  ERCC-00043  1   1023    +   1023    1714
ERCC-00111  ERCC-00111  1   994 +   994 1046

2. Quantification - Estimate abundance for all samples

By using the metacharacter asterisk "*" we can run feature counts on all the HBR and UHR samples in one command line.

featureCounts -a refs/ERCC92.gtf -g gene_name -o counts.txt bam/HBR*.bam bam/UHR*.bam

and check the results file. Now we can see the read counts for each of the samples.

less counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "counts.txt" "bam
/HBR_1.bam" "bam/HBR_2.bam" "bam/HBR_3.bam" "bam/UHR_1.bam" "bam/UHR_2.bam" "bam/UHR_3.bam" 
Geneid  Chr     Start   End     Strand  Length  bam/HBR_1.bam   bam/HBR_2.bam   bam/HBR_3.bam   bam/UHR_1.bam   bam/UH
R_2.bam   bam/UHR_3.bam
ERCC-00002      ERCC-00002      1       1061    +       1061    37892   47258   42234   39986   25978   33998
ERCC-00003      ERCC-00003      1       1023    +       1023    2904    3170    3038    3488    2202    2680
ERCC-00004      ERCC-00004      1       523     +       523     910     1078    996     9200    6678    7396
ERCC-00009      ERCC-00009      1       984     +       984     638     778     708     1384    954     1108
ERCC-00012      ERCC-00012      1       994     +       994     0       0       0       2       0       0
ERCC-00013      ERCC-00013      1       808     +       808     0       0       0       4       4       0
ERCC-00014      ERCC-00014      1       1957    +       1957    26      20      8       20      4       16
ERCC-00016      ERCC-00016      1       844     +       844     0       0       0       0       0       0
ERCC-00017      ERCC-00017      1       1136    +       1136    0       0       0       0       0       2
ERCC-00019      ERCC-00019      1       644     +       644     6       10      4       26      22      24
ERCC-00022      ERCC-00022      1       751     +       751     298     328     326     402     198     306

To remove the intermediate columns and just have sample counts (keep columns 1, and 7-12) and output to "simple_counts.txt".

cat counts.txt | cut -f 1,7-12 > simple_counts.txt

looks like this:

Geneid  bam/HBR_1.bam   bam/HBR_2.bam   bam/HBR_3.bam   bam/UHR_1.bam   bam/UHR_2.bam   bam/UHR_3.bam
ERCC-00002      37892   47258   42234   39986   25978   33998
ERCC-00003      2904    3170    3038    3488    2202    2680
ERCC-00004      910     1078    996     9200    6678    7396
ERCC-00009      638     778     708     1384    954     1108

3. Differential Expression - Now we can find the differential expression

Retrieve R "helper" scripts developed for Biostars environment.

curl -O http://data.biostarhandbook.com/rnaseq/code/deseq1.r
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq2.r
curl -O http://data.biostarhandbook.com/rnaseq/code/edger.r

Each of these scripts can be run at the command line. Additional information: * DESeq * DESeq2 * edgeR

These scripts are used like this.

cat mydata.txt | Rscript do-something.r > output.txt

where "do-something.r" is deseq1.r, deseq2.r or edgeR.r

like this:

cat simple_counts.txt | Rscript deseq1.r 3x3 > results_deseq1.txt
cat simple_counts.txt | Rscript deseq2.r 3x3 > results_deseq2.txt
cat simple_counts.txt | Rscript edger.r 3x3 > results_edger.txt

A "3x3" design is where the first 3 columns correspond to the first condition, and the second 3 columns correspond to the second condition.

The results will look something like this:

id             baseMean   baseMeanA     baseMeanB   foldChange  log2FoldChange    pval       padj
ERCC-00130      29681        10455        48907        4.67        2.22         1.16e-88    9.10e-87
ERCC-00108        808          264         1352        5.10        2.35         2.40e-62    9.39e-61
ERCC-00136       1898          615         3180        5.16        2.36         2.80e-58    7.30e-57

Typically you get a column (though the naming will depend on the tool you use) for each of the following:

id: Gene or transcript name that the differential expression is computed for,
baseMean: The average normalized value across all samples,
baseMeanA, baseMeanB: The average normalized gene expression for each condition,
foldChange: The ratio baseMeanB/baseMeanA,
log2FoldChange: log2 transform of foldChange. When we apply a 2-based logarithm the values become symmetrical around 0. A log2 fold change of 1 means a doubling of the expression level, a log2 fold change of -1 shows show a halving of the expression level.
pval: The probability that this effect is observed by chance,
padj: The adjusted probability that this effect is observed by chance.

You may use pval if you already selected your target gene before evaluating these results.

You have to use padj in all other cases as this adjusted value corrects for the so-called multiple testing error - it accounts for the many alternatives and their chances of influencing the results that we see.

In order to determine if our results are close to the expected outcomes, we need to compare the values we generated with the data in the spreadsheet of expected values (that we already downloaded).

We need to do some cutting and pasting to join these two data sets together for easier comparison. To use the "cut" command:

cut -f (field, or column number)

cut -d (sets field separator or delimiter, in this case a comma ",")

cat ERCC-datasheet.csv | grep ERCC-00 | cut -f 1,5,6 -d , | sort > table1

then use sed to remove the line breaks in table1.

sed 's/\n/ /' table1 > table1.txt

repeat to create table 2, and then paste

cat results_edger.txt | grep ERCC- | sort | cut -f 1,5,6 > table2
paste table1.txt table2 > compare.txt

What is sed?

The SED command in UNIX stands for stream editor, it is used to perform functions on files like searching, search and replace, insertion or deletion. In the example shown here, it is substituting a space for a line return "\n".

How about paste?

The "paste" command is used to join files horizontally, tab-delimited, to standard output.

You can open this table in MS Excel. It will look something like this. Left columns are expected data, right columns are observed.

Column 2 is expected fold change, column 5 is observed fold change. Columns 3 and 4 are log2FoldChange values.

ERCC-00130  4   2   ERCC-00130  4.677514604 2.225742158
ERCC-00131  4   2   ERCC-00131  3.816822907 1.93237225
ERCC-00134  4   2   ERCC-00134  0   #NAME?
ERCC-00136  4   2   ERCC-00136  5.165989328 2.369044663
ERCC-00137  0.5 -1  ERCC-00137  Inf Inf
ERCC-00138  1   0   ERCC-00138  0.83061159  -0.267754093
ERCC-00142  1   0   ERCC-00142  Inf Inf
ERCC-00143  0.67    -0.58   ERCC-00143  2.691503567 1.428412338
ERCC-00144  4   2   ERCC-00144  2.36278659  1.240489329
ERCC-00145  0.67    -0.58   ERCC-00145  0.811214009 -0.301845528
ERCC-00147  4   2   ERCC-00147  NA  NA

The results are summarized in an Excel table that you too can download ERCC-results.csv. You can open this file with MS Excel.

Screen Shot 2020-09-03 at 12 06 38 PM

In summary, for a fold change of 4 or higher, transcripts expressing over as much as 1000x coverage differences were reliably detected. Note how transcripts with 3000 copies were detected as reliably as transcripts with seven reads, though clearly, the accuracy of fold change detection varies.

Note how for data with insufficient measures we get infinite or NaN (Not a Number) values are indicating a division by zero or other numeric overflows.