B4b 2022 rnaseq jw

This page uses content directly from the Biostar Handbook by Istvan Albert.

Obtain RNA-seq test data.

The test data consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). The UHR is total RNA isolated from a diverse set of 10 cancer cell lines. The HBR is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old.

In addition, a spike-in control was used. Specifically we added an aliquot of the ERCC ExFold RNA Spike-In Control Mixes to each sample. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies). This range allows us to test the degree to which the RNA-seq assay (including all laboratory and analysis steps) accurately reflects the relative abundance of transcript species within a sample. There are two 'mixes' of these transcripts to allow an assessment of differential expression output between samples if you put one mix in each of your two comparisons. In our case, Mix1 was added to the UHR sample, and Mix2 was added to the HBR sample. We also have 3 complete experimental replicates for each sample. This allows us to assess the technical variability of our overall process of producing RNA-seq data in the lab.

For all libraries we prepared low-throughput (Set A) TruSeq Stranded Total RNA Sample Prep Kit libraries with Ribo-Zero Gold to remove both cytoplasmic and mitochondrial rRNA. Triplicate, indexed libraries were made starting with 100ng Agilent/Strategene Universal Human Reference total RNA and 100ng Ambion Human Brain Reference total RNA. The Universal Human Reference replicates received 2 ul of 1:1000 ERCC Mix 1. The Human Brain Reference replicates received 1:1000 ERCC Mix 2. The libraries were quantified with KAPA Library Quantification qPCR and adjusted to the appropriate concentration for sequencing. The triplicate, indexed libraries were then pooled prior to sequencing. Each pool of three replicate libraries were sequenced across 2 lanes of a HiSeq 2000 using paired-end sequence chemistry with 100bp read lengths.

So to summarize we have: - UHR + ERCC Spike-In Mix1, Replicate 1 - UHR + ERCC Spike-In Mix1, Replicate 2 - UHR + ERCC Spike-In Mix1, Replicate 3 - HBR + ERCC Spike-In Mix2, Replicate 1 - HBR + ERCC Spike-In Mix2, Replicate 2 - HBR + ERCC Spike-In Mix2, Replicate 3

Each data set has a corresponding pair of FastQ files (read 1 and read 2 of paired end reads). The reads are paired-end 101-mers generated on an Illumina HiSeq instrument. The test data has been pre-filtered for reads that appear to map to chromosome 22.This page contains content taken directly from the Biostar Handbook (Istvan Albert).

Always remember to activate the class bioinformatics environment.

conda activate bioinfo

For this data analysis, we will be using:

Two commercially available RNA samples.

Universal Human Reference (UHR) is total RNA isolated from a diverse set of 10 cancer cell lines. Human Brain Reference (HBR) is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old. The data was produced in three replicates for each condition.

In addition to the biological samples the also contains two so called spike-in mixes: ERCC Mix 1 and ERCC Mix2. The spike-in consists of 92 transcripts that are present in known concentrations across a wide abundance range (from very few copies to many copies).

So to summarize, this data consists of the following:

UHR + ERCC Mix1, Replicate 1, UHR_1
UHR + ERCC Mix1, Replicate 2, UHR_2
UHR + ERCC Mix1, Replicate 3, UHR_3
HBR + ERCC Mix2, Replicate 1, HBR_1
HBR + ERCC Mix2, Replicate 2, HBR_2
HBR + ERCC Mix2, Replicate 3, HBR_3

Download and decompress the example data (all in one step):

mkdir rnaseq
cd rnaseq
curl -s http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar.gz | tar zxv

Directories: * "reads" contains the sequencing reads * "refs" contains genome and annotation information

using the "-1" or "minus one" option with "ls" displays one entry per line.

ls -1 reads

gets this,

HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq

and

ls -1 refs

shows this.

22.fa
22.gtf
ERCC92.fa
ERCC92.gtf

where "22.fa" is chr22 in fasta format, "22.gtf" is the genome annotation file for chr22, "ERCC92.fa" is the sequence information for the ERCC (spike-in) mixture, and "ERCC92.gtf" is the genome annotation file for the ERCC mixture.

References: Informatics for RNA-seq: A web resource for analysis on the cloud. 11(8):e1004393. PLoS Computational Biology (2015) by Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith. An alternative tutorial is available online at https://github.com/griffithlab/rnaseq_tutorial/wiki, this has been updated at rnabio.org

Alignment based RNA-Seq of control samples

We will analyze these data by doing: 1. Alignment (hisat2) 2. Quantification (featureCounts) 3. Differential expression (DESeq)

Use of a spike-in control

Using a spike-in allows us to determine how well we can measure and reproduce data with known properties. We are using ERCC ExFold RNA Spike-In Control Mix

Download and view the expected differential expressions.

curl -O http://data.biostarhandbook.com/rnaseq/ERCC/ERCC-datasheet.csv

You can open this csv (comma-separated values) file with MS Excel. Screen Shot 2020-08-27 at 2 26 45 PM

1. Alignment - Align an RNA-seq sample

Let's align an RNA-Seq sample using the "splice aware" aligner hisat2. First we will need to create the indices. Use this format: hisat2-build REFERENCE_GENOME INDEX_PREFIX

Like this:

hisat2-build refs/ERCC92.fa refs/ERCC92.fa

where "refs/ERCC92.fa" is the reference genome and also the prefix we will use for the index that is created. Check to see all the index files created.

ls
22.fa           ERCC92.fa.1.ht2 ERCC92.fa.4.ht2 ERCC92.fa.7.ht2
22.gtf          ERCC92.fa.2.ht2 ERCC92.fa.5.ht2 ERCC92.fa.8.ht2
ERCC92.fa       ERCC92.fa.3.ht2 ERCC92.fa.6.ht2 ERCC92.gtf

Now we will run the splice aligner and create BAM files.

mkdir -p bam
hisat2 refs/ERCC92.fa -1 reads/UHR_1_R1.fq -2 reads/UHR_1_R2.fq | samtools sort > bam/UHR_1.bam 
227392 reads; of these:
  227392 (100.00%) were paired; of these:
    109242 (48.04%) aligned concordantly 0 times
    118150 (51.96%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    109242 pairs aligned concordantly 0 times; of these:
      174 (0.16%) aligned discordantly 1 time
    ----
    109068 pairs aligned 0 times concordantly or discordantly; of these:
      218136 mates make up the pairs; of these:
        218124 (99.99%) aligned 0 times
        12 (0.01%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
52.04% overall alignment rate

Instead of doing each sample by itself, we can automate the process using the "parallel" command.

Use the "nano" editor and create a file (names.txt) of all the sample names like this.

UHR_1
UHR_2
UHR_3
HBR_1
HBR_2
HBR_3

and run all the samples in parallel

cat names.txt | parallel "hisat2 refs/ERCC92.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"

Next, we will estimate the abundance for a single sample. "featureCounts" is a program that counts reads to genomic features such as genes, exons, promoters and genomic bins.

-a (annotations file)

-o (output file)

-g (specify attribute type, gene_id by default, here we change it to gene_name)

where "bam/HBR_1.bam" is an input file, a list of BAM format files.

featureCounts -a refs/ERCC92.gtf -g gene_name -o single_counts.txt bam/HBR_1.bam

Let's look at the results. The first 6 columns contain feature specific information. The rest of the columns hold the read counts that overlap with that feature.

less single_counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "single_counts.txt" "bam/HBR_1.bam" 
Geneid  Chr     Start   End     Strand  Length  bam/HBR_1.bam
ERCC-00002      ERCC-00002      1       1061    +       1061    37892
ERCC-00003      ERCC-00003      1       1023    +       1023    2904
ERCC-00004      ERCC-00004      1       523     +       523     910
ERCC-00009      ERCC-00009      1       984     +       984     638
ERCC-00012      ERCC-00012      1       994     +       994     0
ERCC-00013      ERCC-00013      1       808     +       808     0
ERCC-00014      ERCC-00014      1       1957    +       1957    26
ERCC-00016      ERCC-00016      1       844     +       844     0
ERCC-00017      ERCC-00017      1       1136    +       1136    0
ERCC-00019      ERCC-00019      1       644     +       644     6
ERCC-00022      ERCC-00022      1       751     +       751     298
ERCC-00024      ERCC-00024      1       536     +       536     0
ERCC-00025      ERCC-00025      1       1994    +       1994    160
ERCC-00028      ERCC-00028      1       1130    +       1130    4
ERCC-00031      ERCC-00031      1       1138    +       1138    0
ERCC-00033      ERCC-00033      1       2022    +       2022    0
ERCC-00034      ERCC-00034      1       1019    +       1019    16
ERCC-00035      ERCC-00035      1       1130    +       1130    94

To find the sequences with most hits, we can sort by column 7.

cat single_counts.txt | sort -rn -k 7 | head

where...

-r is to sort in reverse order (highest to lowest)
-n is a numeric sort
-k specifies which field (7th column)

ERCC-00002  ERCC-00002  1   1061    +   1061    37892
ERCC-00074  ERCC-00074  1   522 +   522 20982
ERCC-00096  ERCC-00096  1   1107    +   1107    20708
ERCC-00130  ERCC-00130  1   1059    +   1059    8088
ERCC-00113  ERCC-00113  1   840 +   840 7096
ERCC-00046  ERCC-00046  1   522 +   522 3154
ERCC-00003  ERCC-00003  1   1023    +   1023    2904
ERCC-00171  ERCC-00171  1   505 +   505 1836
ERCC-00043  ERCC-00043  1   1023    +   1023    1714
ERCC-00111  ERCC-00111  1   994 +   994 1046

2. Quantification - Estimate abundance for all samples

By using the metacharacter asterisk "*" we can run feature counts on all the HBR and UHR samples in one command line.

featureCounts -a refs/ERCC92.gtf -g gene_name -o counts.txt bam/HBR*.bam bam/UHR*.bam

and check the results file. Now we can see the read counts for each of the samples.

less counts.txt
# Program:featureCounts v2.0.0; Command:"featureCounts" "-a" "refs/ERCC92.gtf" "-g" "gene_name" "-o" "counts.txt" "bam
/HBR_1.bam" "bam/HBR_2.bam" "bam/HBR_3.bam" "bam/UHR_1.bam" "bam/UHR_2.bam" "bam/UHR_3.bam" 
Geneid  Chr     Start   End     Strand  Length  bam/HBR_1.bam   bam/HBR_2.bam   bam/HBR_3.bam   bam/UHR_1.bam   bam/UH
R_2.bam   bam/UHR_3.bam
ERCC-00002      ERCC-00002      1       1061    +       1061    37892   47258   42234   39986   25978   33998
ERCC-00003      ERCC-00003      1       1023    +       1023    2904    3170    3038    3488    2202    2680
ERCC-00004      ERCC-00004      1       523     +       523     910     1078    996     9200    6678    7396
ERCC-00009      ERCC-00009      1       984     +       984     638     778     708     1384    954     1108
ERCC-00012      ERCC-00012      1       994     +       994     0       0       0       2       0       0
ERCC-00013      ERCC-00013      1       808     +       808     0       0       0       4       4       0
ERCC-00014      ERCC-00014      1       1957    +       1957    26      20      8       20      4       16
ERCC-00016      ERCC-00016      1       844     +       844     0       0       0       0       0       0
ERCC-00017      ERCC-00017      1       1136    +       1136    0       0       0       0       0       2
ERCC-00019      ERCC-00019      1       644     +       644     6       10      4       26      22      24
ERCC-00022      ERCC-00022      1       751     +       751     298     328     326     402     198     306

To remove the intermediate columns and just have sample counts (keep columns 1, and 7-12) and output to "simple_counts.txt".

cat counts.txt | cut -f 1,7-12 > simple_counts.txt

looks like this:

Geneid  bam/HBR_1.bam   bam/HBR_2.bam   bam/HBR_3.bam   bam/UHR_1.bam   bam/UHR_2.bam   bam/UHR_3.bam
ERCC-00002      37892   47258   42234   39986   25978   33998
ERCC-00003      2904    3170    3038    3488    2202    2680
ERCC-00004      910     1078    996     9200    6678    7396
ERCC-00009      638     778     708     1384    954     1108

3. Differential Expression - Now we can find the differential expression

Retrieve R "helper" scripts developed for Biostars environment.

curl -O http://data.biostarhandbook.com/rnaseq/code/deseq1.r
curl -O http://data.biostarhandbook.com/rnaseq/code/deseq2.r
curl -O http://data.biostarhandbook.com/rnaseq/code/edger.r

Each of these scripts can be run at the command line. Additional information: * DESeq * DESeq2 * edgeR

These scripts are used like this.

cat mydata.txt | Rscript do-something.r > output.txt

where "do-something.r" is deseq1.r, deseq2.r or edgeR.r

like this:

cat simple_counts.txt | Rscript deseq1.r 3x3 > results_deseq1.txt
cat simple_counts.txt | Rscript deseq2.r 3x3 > results_deseq2.txt
cat simple_counts.txt | Rscript edger.r 3x3 > results_edger.txt

A "3x3" design is where the first 3 columns correspond to the first condition, and the second 3 columns correspond to the second condition.

The results will look something like this:

id             baseMean   baseMeanA     baseMeanB   foldChange  log2FoldChange    pval       padj
ERCC-00130      29681        10455        48907        4.67        2.22         1.16e-88    9.10e-87
ERCC-00108        808          264         1352        5.10        2.35         2.40e-62    9.39e-61
ERCC-00136       1898          615         3180        5.16        2.36         2.80e-58    7.30e-57

Typically you get a column (though the naming will depend on the tool you use) for each of the following:

id: Gene or transcript name that the differential expression is computed for,
baseMean: The average normalized value across all samples,
baseMeanA, baseMeanB: The average normalized gene expression for each condition,
foldChange: The ratio baseMeanB/baseMeanA,
log2FoldChange: log2 transform of foldChange. When we apply a 2-based logarithm the values become symmetrical around 0. A log2 fold change of 1 means a doubling of the expression level, a log2 fold change of -1 shows show a halving of the expression level.
pval: The probability that this effect is observed by chance,
padj: The adjusted probability that this effect is observed by chance.

You may use pval if you already selected your target gene before evaluating these results.

You have to use padj in all other cases as this adjusted value corrects for the so-called multiple testing error - it accounts for the many alternatives and their chances of influencing the results that we see.

In order to determine if our results are close to the expected outcomes, we need to compare the values we generated with the data in the spreadsheet of expected values (that we already downloaded).

We need to do some cutting and pasting to join these two data sets together for easier comparison. To use the "cut" command:

cut -f (field, or column number)

cut -d (sets field separator or delimiter, in this case a comma ",")

cat ERCC-datasheet.csv | grep ERCC-00 | cut -f 1,5,6 -d , | sort > table1

then use sed to remove the line breaks in table1.

sed 's/\n/ /' table1 > table1.txt

repeat to create table 2, and then paste

cat results_edger.txt | grep ERCC- | sort | cut -f 1,5,6 > table2
paste table1.txt table2 > compare.txt

What is sed?

The SED command in UNIX stands for stream editor, it is used to perform functions on files like searching, search and replace, insertion or deletion. In the example shown here, it is substituting a space for a line return "\n".

How about paste?

The "paste" command is used to join files horizontally, tab-delimited, to standard output.

You can open this table in MS Excel. It will look something like this. Left columns are expected data, right columns are observed.

Column 2 is expected fold change, column 5 is observed fold change. Columns 3 and 4 are log2FoldChange values.

ERCC-00130  4   2   ERCC-00130  4.677514604 2.225742158
ERCC-00131  4   2   ERCC-00131  3.816822907 1.93237225
ERCC-00134  4   2   ERCC-00134  0   #NAME?
ERCC-00136  4   2   ERCC-00136  5.165989328 2.369044663
ERCC-00137  0.5 -1  ERCC-00137  Inf Inf
ERCC-00138  1   0   ERCC-00138  0.83061159  -0.267754093
ERCC-00142  1   0   ERCC-00142  Inf Inf
ERCC-00143  0.67    -0.58   ERCC-00143  2.691503567 1.428412338
ERCC-00144  4   2   ERCC-00144  2.36278659  1.240489329
ERCC-00145  0.67    -0.58   ERCC-00145  0.811214009 -0.301845528
ERCC-00147  4   2   ERCC-00147  NA  NA

The results are summarized in an Excel table that you too can download ERCC-results.csv. You can open this file with MS Excel.

Screen Shot 2020-09-03 at 12 06 38 PM

In summary, for a fold change of 4 or higher, transcripts expressing over as much as 1000x coverage differences were reliably detected. Note how transcripts with 3000 copies were detected as reliably as transcripts with seven reads, though clearly, the accuracy of fold change detection varies.

Note how for data with insufficient measures we get infinite or NaN (Not a Number) values are indicating a division by zero or other numeric overflows.This page contains content taken directly from the Biostar Handbook by Istvan Albert.

Always remember to start the bioinformatics environment.

conda activate bioinfo

We will be analyzing differential expression of genes on Chr22 from the UHR/HBR dataset that we've worked with before.

Steps: 1. Alignment 2. Quantification 3. Differential Expression 4. Gene Set Enrichment Analysis (GO)

Remember our Chr22 data is in:

rnaseq/reads
rnaseq/refs

1. Let's do the alignment

First we need to create the indices for Chr22.

hisat2-build refs/22.fa refs/22.fa

where "refs/22.fa" is the reference genome, and also the prefix we will use for the index created.

hisat2-build reference_genome prefix_for_indices

Now we will run the splice aligner and create BAM files. We will align all reads at once using the "parallel" command.

cat names.txt | parallel "hisat2 refs/22.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"

If we want to look at these bam files on IGV, we also need to create the .bai (indices) files

parallel samtools index ::: *.bam

2. Next we will estimate the abundance for all samples.

featureCounts -a refs/22.gtf -g gene_name -o 22counts.txt bam/HBR*.bam bam/UHR*.bam

Let's check the results file.

less 22counts.txt

We can remove some of the columns so we just have the sample counts.

cat 22counts.txt | cut -f 1,7-12 > 22simple_counts.txt

and look at the results file.

less 22simple_counts.txt

3. Differential Expression

We will use the R helper scripts that we used before.

cat 22simple_counts.txt | Rscript deseq2.r 3x3 > 22results_deseq2.txt

The file 22results_deseq2.txt contains the genes sorted by their adjusted p-values (last column). Imposing a filter to this column, for example selecting genes that have changed at 0.5 level:

cat 22results_deseq2.txt | awk ' $8 < 0.5 { print $0 } ' > diffgenes2.txt
cat diffgenes2.txt | wc -l

What is AWK? (source https://www.geeksforgeeks.org/awk-command-unixlinux-examples/)

A scripting language that can be used for manipulating data and generating reports.

Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions.

awk ' $8 < 0.5 { print $0 } '

$8 represents the eighth column of the data

$0 represents the whole line

0.5 means selecting genes that have changed at 0.5 level

We can also open this file in MS Excel.

4. Gene Set Enrichment Analysis

“Gene set enrichment analysis” refers to the process of discovering the common characteristics potentially present in a list of genes. When these characteristics are GO terms, the process is called “functional enrichment.”

Warning Overall GO enrichment is surprisingly subjective and different methods will produce different results.

Gene set enrichment analysis with the GOrilla web site

Open results.txt (22results_deseq2.txt) in MS Excel.

Order the gene symbols from lowest to highest p-value (adjusted).

Copy just the gene symbols and paste into GOrilla web site.

GOrilla results:

Screen Shot 2020-09-08 at 12 26 06 PM

Screen Shot 2020-09-08 at 12 26 22 PM Choose to visualize the output in REViGO. If the website does not produce an image (Scatterplot & Table) you can "Make R script for plotting" and produce a pdf.

Screen Shot 2020-09-08 at 11 03 45 AM

Other choices for GSEA * DAVID * ermineJ * ToppFun

This page contains content directly from The Biostar Handbook.

Always remember to start the bioinformatics environment.

conda activate bioinfo

Pseudoalignment-based methods identify locations in the genome using patterns rather than via alignment type algorithms. It operates much faster than alignments so you can analyze massive datasets with less resources.

Kallisto and Salmon are software packages that can quantify transcript abundances. We must know the transcriptome of the organism to perform classification based RNA-Seq.

Classification-based alignment with kallisto

First we need to install kallisto into our bioinformatics environment.

conda install kallisto

To build an index for the ERCC control sequences...

kallisto index -i refs/ERCC92.idx refs/ERCC92.fa

[build] loading fasta file refs/ERCC92.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 91 target sequences
[build] warning: replaced 1 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 92 contigs and contains 77828 k-mers

Then we can quantify the transcripts.

kallisto quant -i refs/ERCC92.idx -o results reads/HBR_1_R1.fq reads/HBR_1_R2.fq 

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 92
[index] number of k-mers: 77,828
[index] number of equivalence classes: 92
[quant] running in paired-end mode
[quant] will process pair 1: reads/HBR_1_R1.fq
                             reads/HBR_1_R2.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 118,571 reads, 55,972 reads pseudoaligned
[quant] estimated average fragment length: 170.53
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

A kallisto run produces three files:

abundance.tsv
abundance.h5
run_info.json

The abundance.tsv is a tab delimited file that, for each transcript lists the target, the length, the corrected effective length, the counts and the TPM value.

target_id       length  eff_length      est_counts      tpm
ERCC-00002      1061    891.47  18946   243187
ERCC-00003      1023    853.47  1452    19467.4
ERCC-00004      523     353.47  455     14729.5
ERCC-00009      984     814.47  319     4481.72
ERCC-00012      994     824.47  0       0
ERCC-00013      808     638.47  0       0
ERCC-00014      1957    1787.47 13      83.2212
ERCC-00016      844     674.47  0       0
ERCC-00017      1136    966.47  0       0
ERCC-00019      644     474.47  3       72.3505
ERCC-00022      751     581.47  149     2932.16
ERCC-00024      536     366.47  0       0
ERCC-00025      1994    1824.47 80      501.745
ERCC-00028      1130    960.47  2       23.8273
ERCC-00031      1138    968.47  0       0
ERCC-00033      2022    1852.47 0       0
ERCC-00034      1019    849.47  8       107.763
ERCC-00035      1130    960.47  47      559.943
ERCC-00039      740     570.47  1       20.0584

We can use the parallel command to quantify all samples at once.

# Make directory for BAM files
mkdir -p results

#Create root names for each file
parallel -j 1 echo {1}_{2} ::: UHR HBR ::: 1 2 3 > names.txt

#Run all samples in parallel
cat names.txt | parallel "kallisto quant -i refs/ERCC92.idx -o results/{} reads/{}_R1.fq reads/{}_R2.fq"

Copy the abundance files into directories named for each sample

cat names.txt | parallel cp results/{}/abundance.tsv results/{}.abundance.tsv

and check to see file names

ls -1 results/*.tsv

which prints

results/HBR_1.abundance.tsv
results/HBR_2.abundance.tsv
results/HBR_3.abundance.tsv
results/UHR_1.abundance.tsv
results/UHR_2.abundance.tsv
results/UHR_3.abundance.tsv

Merge all the counts (abundance) into a single count file, and use it as input to downstream differential expression analysis (deseq, edger).

paste results/H*.tsv results/U*.tsv | cut -f 1,4,9,14,19,24,29 > counts.txt

and run the differential expression

cat counts.txt | Rscript deseq1.r 3x3 > results.txt

You can use this file for downstream analysis as done before

The End.This page contains content taken directly from the Biostars Handbook by Istvan Albert.

Remember to activate the bioinformatics environment.

conda activate bioinfo

Install the statistical packages we will need for the analysis,

curl http://data.biostarhandbook.com/books/rnaseq/code/install-conda.txt | xargs conda install -y

and modify the permissions.

chmod +x ~/bin/*.r

Temporary note for MacOS only. Due to an installation error by conda the wrong version of a package is installed. On MacOS a temporary fix is to run:

conda install -y libopenblas==0.3.7

We have written scripts that make use of several different R based methods. We have devoted a substantial effort to standardizing both the usage and the results produced by these different methods.

To make these scripts universally accessible and simpler to use, we recommend saving them into the ~/bin folder and making them executable. You can achieve that from the command line with:

mkdir -p ~/bin

# Download the scripts into the ~/bin folder.
curl http://data.biostarhandbook.com/books/rnaseq/code/deseq1.r > ~/bin/deseq1.r
curl http://data.biostarhandbook.com/books/rnaseq/code/deseq2.r > ~/bin/deseq2.r
curl http://data.biostarhandbook.com/books/rnaseq/code/edger.r > ~/bin/edger.r
curl http://data.biostarhandbook.com/books/rnaseq/code/heatmap.r > ~/bin/heatmap.r

# Make them executable
chmod +x ~/bin/*.r

Test that the code works by typing:

deseq1.r

it should print something similar:

Error: The experimental design must be specified as NxM
Execution halted

Run the edger.r script

cat counts.txt | Rscript ~/bin/edger.r 3x3 > output.txt

both deseq1.r and deseq2.r give error messages.

cat norm-matrix-edgeR.txt | Rscript draw-heatmap.r > output.pdf

Open "output.pdf" with a pdf reader such as Acrobat.

It should look something like this.

Screen Shot 2020-07-16 at 10 42 24 AM Double-click IGV icon on desktop.

Genomes -> Load Genome from Server (Human hg18)

File -> Load from Server -> Available Datasets -> Body Map 2.0 (Illumina HiSeq) -> Merged 50 bp and 75 bp -> Reads -> heart, liver

Screen Shot 2020-05-05 at 4 14 57 PM

Screen Shot 2020-05-05 at 4 48 41 PM

Screen Shot 2020-05-05 at 5 01 25 PM