24. RNA Seq and Gene Enrichment Analysis copy

This page contains content taken directly from the Biostar Handbook by Istvan Albert.

Always remember to start the bioinformatics environment.

conda activate bioinfo

We will be analyzing differential expression of genes on Chr22 from the UHR/HBR dataset that we've worked with before.

Steps: 1. Alignment 2. Quantification 3. Differential Expression 4. Gene Set Enrichment Analysis (GO)

Remember our Chr22 data is in:

rnaseq/reads
rnaseq/refs

1. Let's do the alignment

First we need to create the indices for Chr22.

hisat2-build refs/22.fa refs/22.fa

where "refs/22.fa" is the reference genome, and also the prefix we will use for the index created.

hisat2-build reference_genome prefix_for_indices

Now we will run the splice aligner and create BAM files. We will align all reads at once using the "parallel" command.

cat names.txt | parallel "hisat2 refs/22.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"

If we want to look at these bam files on IGV, we also need to create the .bai (indices) files

parallel samtools index ::: *.bam

2. Next we will estimate the abundance for all samples.

featureCounts -a refs/22.gtf -g gene_name -o 22counts.txt bam/HBR*.bam bam/UHR*.bam

Let's check the results file.

less 22counts.txt

We can remove some of the columns so we just have the sample counts.

cat 22counts.txt | cut -f 1,7-12 > 22simple_counts.txt

and look at the results file.

less 22simple_counts.txt

3. Differential Expression

We will use the R helper scripts that we used before.

cat 22simple_counts.txt | Rscript deseq2.r 3x3 > 22results_deseq2.txt

The file 22results_deseq2.txt contains the genes sorted by their adjusted p-values (last column). Imposing a filter to this column, for example selecting genes that have changed at 0.5 level:

cat 22results_deseq2.txt | awk ' $8 < 0.5 { print $0 } ' > diffgenes2.txt
cat diffgenes2.txt | wc -l

What is AWK? (source https://www.geeksforgeeks.org/awk-command-unixlinux-examples/)

A scripting language that can be used for manipulating data and generating reports.

Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions.

awk ' $8 < 0.5 { print $0 } '

$8 represents the eighth column of the data

$0 represents the whole line

0.5 means selecting genes that have changed at 0.5 level

We can also open this file in MS Excel.

4. Gene Set Enrichment Analysis

“Gene set enrichment analysis” refers to the process of discovering the common characteristics potentially present in a list of genes. When these characteristics are GO terms, the process is called “functional enrichment.”

Warning Overall GO enrichment is surprisingly subjective and different methods will produce different results.

Gene set enrichment analysis with the GOrilla web site

Open results.txt (22results_deseq2.txt) in MS Excel.

Order the gene symbols from lowest to highest p-value (adjusted).

Copy just the gene symbols and paste into GOrilla web site.

GOrilla results:

Screen Shot 2020-09-08 at 12 26 06 PM

Screen Shot 2020-09-08 at 12 26 22 PM Choose to visualize the output in REViGO. If the website does not produce an image (Scatterplot & Table) you can "Make R script for plotting" and produce a pdf.

Screen Shot 2020-09-08 at 11 03 45 AM

Other choices for GSEA * DAVID * ermineJ * ToppFun