23. RNA Seq Example copy
This page contains content taken directly from the Biostar Handbook (Istvan Albert).
Always remember to activate the class bioinformatics environment.
conda activate bioinfo
For this data analysis, we will be using:
Two commercially available RNA samples.
Universal Human Reference (UHR) is total RNA isolated from a diverse set of 10 cancer cell lines. Human Brain Reference (HBR) is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old. The data was produced in three replicates for each condition.
In addition to the biological samples the also contains two so called spike-in mixes: ERCC Mix 1 and ERCC Mix2. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies).
So to summarize, this data consists of the following:
- UHR + ERCC Mix1, Replicate 1, UHR_1
- UHR + ERCC Mix1, Replicate 2, UHR_2
- UHR + ERCC Mix1, Replicate 3, UHR_3
- HBR + ERCC Mix2, Replicate 1, HBR_1
- HBR + ERCC Mix2, Replicate 2, HBR_2
- HBR + ERCC Mix2, Replicate 3, HBR_3
Download and decompress the example data (all in one step):
mkdir rnaseq
cd rnaseq
curl -s http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz | tar zxv
Directories: * "reads" contains the sequencing reads * "refs" contains genome and annotation information
using the "-1" or "minus one" option with "ls" displays one entry per line.
ls -1 reads
HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq
and
ls -1 refs
22.fa
22.gtf
ERCC92.fa
ERCC92.gtf
where "22.fa" is chr22 in fasta format, "22.gtf" is the genome annotation file for chr22, "ERCC92.fa" is the sequence information for the ERCC (spike-in) mixture, and "ERCC92.gtf" is the genome annotation file for the ERCC mixture.
References: Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith. An alternative tutorial is available online at https://github.com/griffithlab/rnaseq_tutorial/wiki, this has been updated at rnabio.org
Alignment based RNA-Seq of control samples
We will analyze these data by doing: 1. Alignment (hisat2) 2. Quantification (featureCounts) 3. Differential expression (DESeq)
Use of a spike-in control
Using a spike-in allows us to determine how well we can measure and reproduce data with known properties. We are using ERCC ExFold RNA Spike-In Control Mix
Download and view the expected differential expressions.
curl -O http://data.biostarhandbook.com/rnaseq/ERCC/ERCC-datasheet.csv

1. Alignment - Align an RNA-seq sample
Let's align an RNA-Seq sample using the "splice aware" aligner hisat2. First we will need to create the indices. Use this format: hisat2-build REFERENCE_GENOME INDEX_PREFIX
Like this:
hisat2-build refs/ERCC92.fa refs/ERCC92.fa
ls
22.fa ERCC92.fa.1.ht2 ERCC92.fa.4.ht2 ERCC92.fa.7.ht2
22.gtf ERCC92.fa.2.ht2 ERCC92.fa.5.ht2 ERCC92.fa.8.ht2
ERCC92.fa ERCC92.fa.3.ht2 ERCC92.fa.6.ht2 ERCC92.gtf
mkdir -p bam
hisat2 refs/ERCC92.fa -1 reads/UHR_1_R1.fq -2 reads/UHR_1_R2.fq | samtools sort > bam/UHR_1.bam
227392 reads; of these:
227392 (100.00%) were paired; of these:
109242 (48.04%) aligned concordantly 0 times
118150 (51.96%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
109242 pairs aligned concordantly 0 times; of these:
174 (0.16%) aligned discordantly 1 time
----
109068 pairs aligned 0 times concordantly or discordantly; of these:
218136 mates make up the pairs; of these:
218124 (99.99%) aligned 0 times
12 (0.01%) aligned exactly 1 time
0 (0.00%) aligned >1 times
52.04% overall alignment rate
Instead of doing each sample by itself, we can automate the process using the "parallel" command.
Use the "nano" editor and create a file (names.txt) of all the sample names like this.
UHR_1
UHR_2
UHR_3
HBR_1
HBR_2
HBR_3
cat names.txt | parallel "hisat2 refs/ERCC92.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"
-a (annotations file)
-o (output file)
-g (specify attribute type, gene_id by default, here we change it to gene_name)
where "bam/HBR_1.bam" is an input file, a list of BAM format files.
featureCounts -a refs/ERCC92.gtf -g gene_name -o single_counts.txt bam/HBR_1.bam
less single_counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "single_counts.txt" "bam/HBR_1.bam"
Geneid Chr Start End Strand Length bam/HBR_1.bam
ERCC-00002 ERCC-00002 1 1061 + 1061 37892
ERCC-00003 ERCC-00003 1 1023 + 1023 2904
ERCC-00004 ERCC-00004 1 523 + 523 910
ERCC-00009 ERCC-00009 1 984 + 984 638
ERCC-00012 ERCC-00012 1 994 + 994 0
ERCC-00013 ERCC-00013 1 808 + 808 0
ERCC-00014 ERCC-00014 1 1957 + 1957 26
ERCC-00016 ERCC-00016 1 844 + 844 0
ERCC-00017 ERCC-00017 1 1136 + 1136 0
ERCC-00019 ERCC-00019 1 644 + 644 6
ERCC-00022 ERCC-00022 1 751 + 751 298
ERCC-00024 ERCC-00024 1 536 + 536 0
ERCC-00025 ERCC-00025 1 1994 + 1994 160
ERCC-00028 ERCC-00028 1 1130 + 1130 4
ERCC-00031 ERCC-00031 1 1138 + 1138 0
ERCC-00033 ERCC-00033 1 2022 + 2022 0
ERCC-00034 ERCC-00034 1 1019 + 1019 16
ERCC-00035 ERCC-00035 1 1130 + 1130 94
cat single_counts.txt | sort -rn -k 7 | head
- -r is to sort in reverse order (highest to lowest)
- -n is a numeric sort
- -k specifies which field (7th column)
ERCC-00002 ERCC-00002 1 1061 + 1061 37892
ERCC-00074 ERCC-00074 1 522 + 522 20982
ERCC-00096 ERCC-00096 1 1107 + 1107 20708
ERCC-00130 ERCC-00130 1 1059 + 1059 8088
ERCC-00113 ERCC-00113 1 840 + 840 7096
ERCC-00046 ERCC-00046 1 522 + 522 3154
ERCC-00003 ERCC-00003 1 1023 + 1023 2904
ERCC-00171 ERCC-00171 1 505 + 505 1836
ERCC-00043 ERCC-00043 1 1023 + 1023 1714
ERCC-00111 ERCC-00111 1 994 + 994 1046
2. Quantification - Estimate abundance for all samples
By using the metacharacter asterisk "*" we can run feature counts on all the HBR and UHR samples in one command line.
featureCounts -a refs/ERCC92.gtf -g gene_name -o counts.txt bam/HBR*.bam bam/UHR*.bam
less counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "counts.txt" "bam
/HBR_1.bam" "bam/HBR_2.bam" "bam/HBR_3.bam" "bam/UHR_1.bam" "bam/UHR_2.bam" "bam/UHR_3.bam"
Geneid Chr Start End Strand Length bam/HBR_1.bam bam/HBR_2.bam bam/HBR_3.bam bam/UHR_1.bam bam/UH
R_2.bam bam/UHR_3.bam
ERCC-00002 ERCC-00002 1 1061 + 1061 37892 47258 42234 39986 25978 33998
ERCC-00003 ERCC-00003 1 1023 + 1023 2904 3170 3038 3488 2202 2680
ERCC-00004 ERCC-00004 1 523 + 523 910 1078 996 9200 6678 7396
ERCC-00009 ERCC-00009 1 984 + 984 638 778 708 1384 954 1108
ERCC-00012 ERCC-00012 1 994 + 994 0 0 0 2 0 0
ERCC-00013 ERCC-00013 1 808 + 808 0 0 0 4 4 0
ERCC-00014 ERCC-00014 1 1957 + 1957 26 20 8 20 4 16
ERCC-00016 ERCC-00016 1 844 + 844 0 0 0 0 0 0
ERCC-00017 ERCC-00017 1 1136 + 1136 0 0 0 0 0 2
ERCC-00019 ERCC-00019 1 644 + 644 6 10 4 26 22 24
ERCC-00022 ERCC-00022 1 751 + 751 298 328 326 402 198 306
To remove the intermediate columns and just have sample counts (keep columns 1, and 7-12) and output to "simple_counts.txt".
cat counts.txt | cut -f 1,7-12 > simple_counts.txt
Geneid bam/HBR_1.bam bam/HBR_2.bam bam/HBR_3.bam bam/UHR_1.bam bam/UHR_2.bam bam/UHR_3.bam
ERCC-00002 37892 47258 42234 39986 25978 33998
ERCC-00003 2904 3170 3038 3488 2202 2680
ERCC-00004 910 1078 996 9200 6678 7396
ERCC-00009 638 778 708 1384 954 1108
3. Differential Expression - Now we can find the differential expression
Retrieve R "helper" scripts developed for Biostars environment.
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq1.r
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq2.r
curl -O http://data.biostarhandbook.com/rnaseq/code/edger.r
These scripts are used like this.
cat mydata.txt | Rscript do-something.r > output.txt
like this:
cat simple_counts.txt | Rscript deseq1.r 3x3 > results_deseq1.txt
cat simple_counts.txt | Rscript deseq2.r 3x3 > results_deseq2.txt
cat simple_counts.txt | Rscript edger.r 3x3 > results_edger.txt
A "3x3" design is where the first 3 columns correspond to the first condition, and the second 3 columns correspond to the second condition.
The results will look something like this:
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
ERCC-00130 29681 10455 48907 4.67 2.22 1.16e-88 9.10e-87
ERCC-00108 808 264 1352 5.10 2.35 2.40e-62 9.39e-61
ERCC-00136 1898 615 3180 5.16 2.36 2.80e-58 7.30e-57
- id: Gene or transcript name that the differential expression is computed for,
- baseMean: The average normalized value across all samples,
- baseMeanA, baseMeanB: The average normalized gene expression for each condition,
- foldChange: The ratio baseMeanB/baseMeanA,
- log2FoldChange: log2 transform of foldChange. When we apply a 2-based logarithm the values become symmetrical around 0. A log2 fold change of 1 means a doubling of the expression level, a log2 fold change of -1 shows show a halving of the expression level.
- pval: The probability that this effect is observed by chance,
- padj: The adjusted probability that this effect is observed by chance.
You may use pval if you already selected your target gene before evaluating these results.
You have to use padj in all other cases as this adjusted value corrects for the so-called multiple testing error - it accounts for the many alternatives and their chances of influencing the results that we see.
In order to determine if our results are close to the expected outcomes, we need to compare the values we generated with the data in the spreadsheet of expected values (that we already downloaded).
We need to do some cutting and pasting to join these two data sets together for easier comparison. To use the "cut" command:
cut -f (field, or column number)
cut -d (sets field separator or delimiter, in this case a comma ",")
cat ERCC-datasheet.csv | grep ERCC-00 | cut -f 1,5,6 -d , | sort > table1
then use sed to remove the line breaks in table1.
sed 's/\n/ /' table1 > table1.txt
cat results_edger.txt | grep ERCC- | sort | cut -f 1,5,6 > table2
paste table1.txt table2 > compare.txt
What is sed?
The SED command in UNIX stands for stream editor, it is used to perform functions on files like searching, search and replace, insertion or deletion. In the example shown here, it is substituting a space for a line return "\n".
How about paste?
The "paste" command is used to join files horizontally, tab-delimited, to standard output.
You can open this table in MS Excel. It will look something like this. Left columns are expected data, right columns are observed.
Column 2 is expected fold change, column 5 is observed fold change. Columns 3 and 4 are log2FoldChange values.
ERCC-00130 4 2 ERCC-00130 4.677514604 2.225742158
ERCC-00131 4 2 ERCC-00131 3.816822907 1.93237225
ERCC-00134 4 2 ERCC-00134 0 #NAME?
ERCC-00136 4 2 ERCC-00136 5.165989328 2.369044663
ERCC-00137 0.5 -1 ERCC-00137 Inf Inf
ERCC-00138 1 0 ERCC-00138 0.83061159 -0.267754093
ERCC-00142 1 0 ERCC-00142 Inf Inf
ERCC-00143 0.67 -0.58 ERCC-00143 2.691503567 1.428412338
ERCC-00144 4 2 ERCC-00144 2.36278659 1.240489329
ERCC-00145 0.67 -0.58 ERCC-00145 0.811214009 -0.301845528
ERCC-00147 4 2 ERCC-00147 NA NA
In summary, for a fold change of 4 or higher, transcripts expressing over as much as 1000x coverage differences were reliably detected. Note how transcripts with 3000 copies were detected as reliably as transcripts with seven reads, though clearly, the accuracy of fold change detection varies.
Note how for data with insufficient measures we get infinite or NaN (Not a Number) values are indicating a division by zero or other numeric overflows.