B4b 2022 rnaseq jw
This page uses content directly from the Biostar Handbook by Istvan Albert.
Obtain RNA-seq test data.
The test data consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). The UHR is total RNA isolated from a diverse set of 10 cancer cell lines. The HBR is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old.
In addition, a spike-in control was used. Specifically we added an aliquot of the ERCC ExFold RNA Spike-In Control Mixes to each sample. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies). This range allows us to test the degree to which the RNA-seq assay (including all laboratory and analysis steps) accurately reflects the relative abundance of transcript species within a sample. There are two 'mixes' of these transcripts to allow an assessment of differential expression output between samples if you put one mix in each of your two comparisons. In our case, Mix1 was added to the UHR sample, and Mix2 was added to the HBR sample. We also have 3 complete experimental replicates for each sample. This allows us to assess the technical variability of our overall process of producing RNA-seq data in the lab.
For all libraries we prepared low-throughput (Set A) TruSeq Stranded Total RNA Sample Prep Kit libraries with Ribo-Zero Gold to remove both cytoplasmic and mitochondrial rRNA. Triplicate, indexed libraries were made starting with 100ng Agilent/Strategene Universal Human Reference total RNA and 100ng Ambion Human Brain Reference total RNA. The Universal Human Reference replicates received 2 ul of 1:1000 ERCC Mix 1. The Human Brain Reference replicates received 1:1000 ERCC Mix 2. The libraries were quantified with KAPA Library Quantification qPCR and adjusted to the appropriate concentration for sequencing. The triplicate, indexed libraries were then pooled prior to sequencing. Each pool of three replicate libraries were sequenced across 2 lanes of a HiSeq 2000 using paired-end sequence chemistry with 100bp read lengths.
So to summarize we have: - UHR + ERCC Spike-In Mix1, Replicate 1 - UHR + ERCC Spike-In Mix1, Replicate 2 - UHR + ERCC Spike-In Mix1, Replicate 3 - HBR + ERCC Spike-In Mix2, Replicate 1 - HBR + ERCC Spike-In Mix2, Replicate 2 - HBR + ERCC Spike-In Mix2, Replicate 3
Each data set has a corresponding pair of FastQ files (read 1 and read 2 of paired end reads). The reads are paired-end 101-mers generated on an Illumina HiSeq instrument. The test data has been pre-filtered for reads that appear to map to chromosome 22.This page contains content taken directly from the Biostar Handbook (Istvan Albert).
Always remember to activate the class bioinformatics environment.
conda activate bioinfo
For this data analysis, we will be using:
Two commercially available RNA samples.
Universal Human Reference (UHR) is total RNA isolated from a diverse set of 10 cancer cell lines. Human Brain Reference (HBR) is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old. The data was produced in three replicates for each condition.
In addition to the biological samples the also contains two so called spike-in mixes: ERCC Mix 1 and ERCC Mix2. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies).
So to summarize, this data consists of the following:
- UHR + ERCC Mix1, Replicate 1, UHR_1
- UHR + ERCC Mix1, Replicate 2, UHR_2
- UHR + ERCC Mix1, Replicate 3, UHR_3
- HBR + ERCC Mix2, Replicate 1, HBR_1
- HBR + ERCC Mix2, Replicate 2, HBR_2
- HBR + ERCC Mix2, Replicate 3, HBR_3
Download and decompress the example data (all in one step):
mkdir rnaseq
cd rnaseq
curl -s http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz | tar zxv
Directories: * "reads" contains the sequencing reads * "refs" contains genome and annotation information
using the "-1" or "minus one" option with "ls" displays one entry per line.
ls -1 reads
HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq
and
ls -1 refs
22.fa
22.gtf
ERCC92.fa
ERCC92.gtf
where "22.fa" is chr22 in fasta format, "22.gtf" is the genome annotation file for chr22, "ERCC92.fa" is the sequence information for the ERCC (spike-in) mixture, and "ERCC92.gtf" is the genome annotation file for the ERCC mixture.
References: Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith. An alternative tutorial is available online at https://github.com/griffithlab/rnaseq_tutorial/wiki, this has been updated at rnabio.org
Alignment based RNA-Seq of control samples
We will analyze these data by doing: 1. Alignment (hisat2) 2. Quantification (featureCounts) 3. Differential expression (DESeq)
Use of a spike-in control
Using a spike-in allows us to determine how well we can measure and reproduce data with known properties. We are using ERCC ExFold RNA Spike-In Control Mix
Download and view the expected differential expressions.
curl -O http://data.biostarhandbook.com/rnaseq/ERCC/ERCC-datasheet.csv

1. Alignment - Align an RNA-seq sample
Let's align an RNA-Seq sample using the "splice aware" aligner hisat2. First we will need to create the indices. Use this format: hisat2-build REFERENCE_GENOME INDEX_PREFIX
Like this:
hisat2-build refs/ERCC92.fa refs/ERCC92.fa
ls
22.fa ERCC92.fa.1.ht2 ERCC92.fa.4.ht2 ERCC92.fa.7.ht2
22.gtf ERCC92.fa.2.ht2 ERCC92.fa.5.ht2 ERCC92.fa.8.ht2
ERCC92.fa ERCC92.fa.3.ht2 ERCC92.fa.6.ht2 ERCC92.gtf
mkdir -p bam
hisat2 refs/ERCC92.fa -1 reads/UHR_1_R1.fq -2 reads/UHR_1_R2.fq | samtools sort > bam/UHR_1.bam
227392 reads; of these:
227392 (100.00%) were paired; of these:
109242 (48.04%) aligned concordantly 0 times
118150 (51.96%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
109242 pairs aligned concordantly 0 times; of these:
174 (0.16%) aligned discordantly 1 time
----
109068 pairs aligned 0 times concordantly or discordantly; of these:
218136 mates make up the pairs; of these:
218124 (99.99%) aligned 0 times
12 (0.01%) aligned exactly 1 time
0 (0.00%) aligned >1 times
52.04% overall alignment rate
Instead of doing each sample by itself, we can automate the process using the "parallel" command.
Use the "nano" editor and create a file (names.txt) of all the sample names like this.
UHR_1
UHR_2
UHR_3
HBR_1
HBR_2
HBR_3
cat names.txt | parallel "hisat2 refs/ERCC92.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"
-a (annotations file)
-o (output file)
-g (specify attribute type, gene_id by default, here we change it to gene_name)
where "bam/HBR_1.bam" is an input file, a list of BAM format files.
featureCounts -a refs/ERCC92.gtf -g gene_name -o single_counts.txt bam/HBR_1.bam
less single_counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "single_counts.txt" "bam/HBR_1.bam"
Geneid Chr Start End Strand Length bam/HBR_1.bam
ERCC-00002 ERCC-00002 1 1061 + 1061 37892
ERCC-00003 ERCC-00003 1 1023 + 1023 2904
ERCC-00004 ERCC-00004 1 523 + 523 910
ERCC-00009 ERCC-00009 1 984 + 984 638
ERCC-00012 ERCC-00012 1 994 + 994 0
ERCC-00013 ERCC-00013 1 808 + 808 0
ERCC-00014 ERCC-00014 1 1957 + 1957 26
ERCC-00016 ERCC-00016 1 844 + 844 0
ERCC-00017 ERCC-00017 1 1136 + 1136 0
ERCC-00019 ERCC-00019 1 644 + 644 6
ERCC-00022 ERCC-00022 1 751 + 751 298
ERCC-00024 ERCC-00024 1 536 + 536 0
ERCC-00025 ERCC-00025 1 1994 + 1994 160
ERCC-00028 ERCC-00028 1 1130 + 1130 4
ERCC-00031 ERCC-00031 1 1138 + 1138 0
ERCC-00033 ERCC-00033 1 2022 + 2022 0
ERCC-00034 ERCC-00034 1 1019 + 1019 16
ERCC-00035 ERCC-00035 1 1130 + 1130 94
cat single_counts.txt | sort -rn -k 7 | head
- -r is to sort in reverse order (highest to lowest)
- -n is a numeric sort
- -k specifies which field (7th column)
ERCC-00002 ERCC-00002 1 1061 + 1061 37892
ERCC-00074 ERCC-00074 1 522 + 522 20982
ERCC-00096 ERCC-00096 1 1107 + 1107 20708
ERCC-00130 ERCC-00130 1 1059 + 1059 8088
ERCC-00113 ERCC-00113 1 840 + 840 7096
ERCC-00046 ERCC-00046 1 522 + 522 3154
ERCC-00003 ERCC-00003 1 1023 + 1023 2904
ERCC-00171 ERCC-00171 1 505 + 505 1836
ERCC-00043 ERCC-00043 1 1023 + 1023 1714
ERCC-00111 ERCC-00111 1 994 + 994 1046
2. Quantification - Estimate abundance for all samples
By using the metacharacter asterisk "*" we can run feature counts on all the HBR and UHR samples in one command line.
featureCounts -a refs/ERCC92.gtf -g gene_name -o counts.txt bam/HBR*.bam bam/UHR*.bam
less counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "counts.txt" "bam
/HBR_1.bam" "bam/HBR_2.bam" "bam/HBR_3.bam" "bam/UHR_1.bam" "bam/UHR_2.bam" "bam/UHR_3.bam"
Geneid Chr Start End Strand Length bam/HBR_1.bam bam/HBR_2.bam bam/HBR_3.bam bam/UHR_1.bam bam/UH
R_2.bam bam/UHR_3.bam
ERCC-00002 ERCC-00002 1 1061 + 1061 37892 47258 42234 39986 25978 33998
ERCC-00003 ERCC-00003 1 1023 + 1023 2904 3170 3038 3488 2202 2680
ERCC-00004 ERCC-00004 1 523 + 523 910 1078 996 9200 6678 7396
ERCC-00009 ERCC-00009 1 984 + 984 638 778 708 1384 954 1108
ERCC-00012 ERCC-00012 1 994 + 994 0 0 0 2 0 0
ERCC-00013 ERCC-00013 1 808 + 808 0 0 0 4 4 0
ERCC-00014 ERCC-00014 1 1957 + 1957 26 20 8 20 4 16
ERCC-00016 ERCC-00016 1 844 + 844 0 0 0 0 0 0
ERCC-00017 ERCC-00017 1 1136 + 1136 0 0 0 0 0 2
ERCC-00019 ERCC-00019 1 644 + 644 6 10 4 26 22 24
ERCC-00022 ERCC-00022 1 751 + 751 298 328 326 402 198 306
To remove the intermediate columns and just have sample counts (keep columns 1, and 7-12) and output to "simple_counts.txt".
cat counts.txt | cut -f 1,7-12 > simple_counts.txt
Geneid bam/HBR_1.bam bam/HBR_2.bam bam/HBR_3.bam bam/UHR_1.bam bam/UHR_2.bam bam/UHR_3.bam
ERCC-00002 37892 47258 42234 39986 25978 33998
ERCC-00003 2904 3170 3038 3488 2202 2680
ERCC-00004 910 1078 996 9200 6678 7396
ERCC-00009 638 778 708 1384 954 1108
3. Differential Expression - Now we can find the differential expression
Retrieve R "helper" scripts developed for Biostars environment.
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq1.r
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq2.r
curl -O http://data.biostarhandbook.com/rnaseq/code/edger.r
These scripts are used like this.
cat mydata.txt | Rscript do-something.r > output.txt
like this:
cat simple_counts.txt | Rscript deseq1.r 3x3 > results_deseq1.txt
cat simple_counts.txt | Rscript deseq2.r 3x3 > results_deseq2.txt
cat simple_counts.txt | Rscript edger.r 3x3 > results_edger.txt
A "3x3" design is where the first 3 columns correspond to the first condition, and the second 3 columns correspond to the second condition.
The results will look something like this:
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
ERCC-00130 29681 10455 48907 4.67 2.22 1.16e-88 9.10e-87
ERCC-00108 808 264 1352 5.10 2.35 2.40e-62 9.39e-61
ERCC-00136 1898 615 3180 5.16 2.36 2.80e-58 7.30e-57
- id: Gene or transcript name that the differential expression is computed for,
- baseMean: The average normalized value across all samples,
- baseMeanA, baseMeanB: The average normalized gene expression for each condition,
- foldChange: The ratio baseMeanB/baseMeanA,
- log2FoldChange: log2 transform of foldChange. When we apply a 2-based logarithm the values become symmetrical around 0. A log2 fold change of 1 means a doubling of the expression level, a log2 fold change of -1 shows show a halving of the expression level.
- pval: The probability that this effect is observed by chance,
- padj: The adjusted probability that this effect is observed by chance.
You may use pval if you already selected your target gene before evaluating these results.
You have to use padj in all other cases as this adjusted value corrects for the so-called multiple testing error - it accounts for the many alternatives and their chances of influencing the results that we see.
In order to determine if our results are close to the expected outcomes, we need to compare the values we generated with the data in the spreadsheet of expected values (that we already downloaded).
We need to do some cutting and pasting to join these two data sets together for easier comparison. To use the "cut" command:
cut -f (field, or column number)
cut -d (sets field separator or delimiter, in this case a comma ",")
cat ERCC-datasheet.csv | grep ERCC-00 | cut -f 1,5,6 -d , | sort > table1
then use sed to remove the line breaks in table1.
sed 's/\n/ /' table1 > table1.txt
cat results_edger.txt | grep ERCC- | sort | cut -f 1,5,6 > table2
paste table1.txt table2 > compare.txt
What is sed?
The SED command in UNIX stands for stream editor, it is used to perform functions on files like searching, search and replace, insertion or deletion. In the example shown here, it is substituting a space for a line return "\n".
How about paste?
The "paste" command is used to join files horizontally, tab-delimited, to standard output.
You can open this table in MS Excel. It will look something like this. Left columns are expected data, right columns are observed.
Column 2 is expected fold change, column 5 is observed fold change. Columns 3 and 4 are log2FoldChange values.
ERCC-00130 4 2 ERCC-00130 4.677514604 2.225742158
ERCC-00131 4 2 ERCC-00131 3.816822907 1.93237225
ERCC-00134 4 2 ERCC-00134 0 #NAME?
ERCC-00136 4 2 ERCC-00136 5.165989328 2.369044663
ERCC-00137 0.5 -1 ERCC-00137 Inf Inf
ERCC-00138 1 0 ERCC-00138 0.83061159 -0.267754093
ERCC-00142 1 0 ERCC-00142 Inf Inf
ERCC-00143 0.67 -0.58 ERCC-00143 2.691503567 1.428412338
ERCC-00144 4 2 ERCC-00144 2.36278659 1.240489329
ERCC-00145 0.67 -0.58 ERCC-00145 0.811214009 -0.301845528
ERCC-00147 4 2 ERCC-00147 NA NA
In summary, for a fold change of 4 or higher, transcripts expressing over as much as 1000x coverage differences were reliably detected. Note how transcripts with 3000 copies were detected as reliably as transcripts with seven reads, though clearly, the accuracy of fold change detection varies.
Note how for data with insufficient measures we get infinite or NaN (Not a Number) values are indicating a division by zero or other numeric overflows.This page contains content taken directly from the Biostar Handbook by Istvan Albert.
Always remember to start the bioinformatics environment.
conda activate bioinfo
We will be analyzing differential expression of genes on Chr22 from the UHR/HBR dataset that we've worked with before.
Steps: 1. Alignment 2. Quantification 3. Differential Expression 4. Gene Set Enrichment Analysis (GO)
Remember our Chr22 data is in:
rnaseq/reads
rnaseq/refs
1. Let's do the alignment
First we need to create the indices for Chr22.
hisat2-build refs/22.fa refs/22.fa
hisat2-build reference_genome prefix_for_indices
Now we will run the splice aligner and create BAM files. We will align all reads at once using the "parallel" command.
cat names.txt | parallel "hisat2 refs/22.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"
parallel samtools index ::: *.bam

2. Next we will estimate the abundance for all samples.
featureCounts -a refs/22.gtf -g gene_name -o 22counts.txt bam/HBR*.bam bam/UHR*.bam
less 22counts.txt
cat 22counts.txt | cut -f 1,7-12 > 22simple_counts.txt
less 22simple_counts.txt
3. Differential Expression
We will use the R helper scripts that we used before.
cat 22simple_counts.txt | Rscript deseq2.r 3x3 > 22results_deseq2.txt
cat 22results_deseq2.txt | awk ' $8 < 0.5 { print $0 } ' > diffgenes2.txt
cat diffgenes2.txt | wc -l
What is AWK? (source https://www.geeksforgeeks.org/awk-command-unixlinux-examples/)
A scripting language that can be used for manipulating data and generating reports.
Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions.
awk ' $8 < 0.5 { print $0 } '
$0 represents the whole line
0.5 means selecting genes that have changed at 0.5 level
We can also open this file in MS Excel.
4. Gene Set Enrichment Analysis
“Gene set enrichment analysis” refers to the process of discovering the common characteristics potentially present in a list of genes. When these characteristics are GO terms, the process is called “functional enrichment.”
Warning Overall GO enrichment is surprisingly subjective and different methods will produce different results.
Gene set enrichment analysis with the GOrilla web site
Open results.txt (22results_deseq2.txt) in MS Excel.
Order the gene symbols from lowest to highest p-value (adjusted).
Copy just the gene symbols and paste into GOrilla web site.
GOrilla results:
Choose to visualize the output in REViGO. If the website does not produce an image (Scatterplot & Table) you can "Make R script for plotting" and produce a pdf.
Other choices for GSEA * DAVID * ermineJ * ToppFun
This page contains content directly from The Biostar Handbook.
Always remember to start the bioinformatics environment.
conda activate bioinfo
Kallisto and Salmon are software packages that can quantify transcript abundances. We must know the transcriptome of the organism to perform classification based RNA-Seq.
Classification-based alignment with kallisto
First we need to install kallisto into our bioinformatics environment.
conda install kallisto
To build an index for the ERCC control sequences...
kallisto index -i refs/ERCC92.idx refs/ERCC92.fa
[build] loading fasta file refs/ERCC92.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
from 91 target sequences
[build] warning: replaced 1 non-ACGUT characters in the input sequence
with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ... done
[build] creating equivalence classes ... done
[build] target de Bruijn graph has 92 contigs and contains 77828 k-mers
kallisto quant -i refs/ERCC92.idx -o results reads/HBR_1_R1.fq reads/HBR_1_R2.fq
[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 92
[index] number of k-mers: 77,828
[index] number of equivalence classes: 92
[quant] running in paired-end mode
[quant] will process pair 1: reads/HBR_1_R1.fq
reads/HBR_1_R2.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 118,571 reads, 55,972 reads pseudoaligned
[quant] estimated average fragment length: 170.53
[ em] quantifying the abundances ... done
[ em] the Expectation-Maximization algorithm ran for 52 rounds
- abundance.tsv
- abundance.h5
- run_info.json
The abundance.tsv is a tab delimited file that, for each transcript lists the target, the length, the corrected effective length, the counts and the TPM value.
target_id length eff_length est_counts tpm
ERCC-00002 1061 891.47 18946 243187
ERCC-00003 1023 853.47 1452 19467.4
ERCC-00004 523 353.47 455 14729.5
ERCC-00009 984 814.47 319 4481.72
ERCC-00012 994 824.47 0 0
ERCC-00013 808 638.47 0 0
ERCC-00014 1957 1787.47 13 83.2212
ERCC-00016 844 674.47 0 0
ERCC-00017 1136 966.47 0 0
ERCC-00019 644 474.47 3 72.3505
ERCC-00022 751 581.47 149 2932.16
ERCC-00024 536 366.47 0 0
ERCC-00025 1994 1824.47 80 501.745
ERCC-00028 1130 960.47 2 23.8273
ERCC-00031 1138 968.47 0 0
ERCC-00033 2022 1852.47 0 0
ERCC-00034 1019 849.47 8 107.763
ERCC-00035 1130 960.47 47 559.943
ERCC-00039 740 570.47 1 20.0584
We can use the parallel command to quantify all samples at once.
# Make directory for BAM files
mkdir -p results
#Create root names for each file
parallel -j 1 echo {1}_{2} ::: UHR HBR ::: 1 2 3 > names.txt
#Run all samples in parallel
cat names.txt | parallel "kallisto quant -i refs/ERCC92.idx -o results/{} reads/{}_R1.fq reads/{}_R2.fq"
Copy the abundance files into directories named for each sample
cat names.txt | parallel cp results/{}/abundance.tsv results/{}.abundance.tsv
ls -1 results/*.tsv
results/HBR_1.abundance.tsv
results/HBR_2.abundance.tsv
results/HBR_3.abundance.tsv
results/UHR_1.abundance.tsv
results/UHR_2.abundance.tsv
results/UHR_3.abundance.tsv
paste results/H*.tsv results/U*.tsv | cut -f 1,4,9,14,19,24,29 > counts.txt
cat counts.txt | Rscript deseq1.r 3x3 > results.txt
The End.This page contains content taken directly from the Biostars Handbook by Istvan Albert.
Remember to activate the bioinformatics environment.
conda activate bioinfo
curl http://data.biostarhandbook.com/books/rnaseq/code/install-conda.txt | xargs conda install -y
chmod +x ~/bin/*.r
conda install -y libopenblas==0.3.7
To make these scripts universally accessible and simpler to use, we recommend saving them into the ~/bin folder and making them executable. You can achieve that from the command line with:
mkdir -p ~/bin
# Download the scripts into the ~/bin folder.
curl http://data.biostarhandbook.com/books/rnaseq/code/deseq1.r > ~/bin/deseq1.r
curl http://data.biostarhandbook.com/books/rnaseq/code/deseq2.r > ~/bin/deseq2.r
curl http://data.biostarhandbook.com/books/rnaseq/code/edger.r > ~/bin/edger.r
curl http://data.biostarhandbook.com/books/rnaseq/code/heatmap.r > ~/bin/heatmap.r
# Make them executable
chmod +x ~/bin/*.r
deseq1.r
Error: The experimental design must be specified as NxM
Execution halted
cat counts.txt | Rscript ~/bin/edger.r 3x3 > output.txt
cat norm-matrix-edgeR.txt | Rscript draw-heatmap.r > output.pdf
It should look something like this.
Double-click IGV icon on desktop.
Genomes -> Load Genome from Server (Human hg18)
File -> Load from Server -> Available Datasets -> Body Map 2.0 (Illumina HiSeq) -> Merged 50 bp and 75 bp -> Reads -> heart, liver