24. RNA Seq and Gene Enrichment Analysis copy
This page contains content taken directly from the Biostar Handbook by Istvan Albert.
Always remember to start the bioinformatics environment.
conda activate bioinfo
We will be analyzing differential expression of genes on Chr22 from the UHR/HBR dataset that we've worked with before.
Steps: 1. Alignment 2. Quantification 3. Differential Expression 4. Gene Set Enrichment Analysis (GO)
Remember our Chr22 data is in:
rnaseq/reads
rnaseq/refs
1. Let's do the alignment
First we need to create the indices for Chr22.
hisat2-build refs/22.fa refs/22.fa
hisat2-build reference_genome prefix_for_indices
Now we will run the splice aligner and create BAM files. We will align all reads at once using the "parallel" command.
cat names.txt | parallel "hisat2 refs/22.fa -1 reads/{}_R1.fq -2 reads/{}_R2.fq | samtools sort > bam/{}.bam"
parallel samtools index ::: *.bam

2. Next we will estimate the abundance for all samples.
featureCounts -a refs/22.gtf -g gene_name -o 22counts.txt bam/HBR*.bam bam/UHR*.bam
less 22counts.txt
cat 22counts.txt | cut -f 1,7-12 > 22simple_counts.txt
less 22simple_counts.txt
3. Differential Expression
We will use the R helper scripts that we used before.
cat 22simple_counts.txt | Rscript deseq2.r 3x3 > 22results_deseq2.txt
cat 22results_deseq2.txt | awk ' $8 < 0.5 { print $0 } ' > diffgenes2.txt
cat diffgenes2.txt | wc -l
What is AWK? (source https://www.geeksforgeeks.org/awk-command-unixlinux-examples/)
A scripting language that can be used for manipulating data and generating reports.
Awk is a utility that enables a programmer to write tiny but effective programs in the form of statements that define text patterns that are to be searched for in each line of a document and the action that is to be taken when a match is found within a line. Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then performs the associated actions.
awk ' $8 < 0.5 { print $0 } '
$0 represents the whole line
0.5 means selecting genes that have changed at 0.5 level
We can also open this file in MS Excel.
4. Gene Set Enrichment Analysis
“Gene set enrichment analysis” refers to the process of discovering the common characteristics potentially present in a list of genes. When these characteristics are GO terms, the process is called “functional enrichment.”
Warning Overall GO enrichment is surprisingly subjective and different methods will produce different results.
Gene set enrichment analysis with the GOrilla web site
Open results.txt (22results_deseq2.txt) in MS Excel.
Order the gene symbols from lowest to highest p-value (adjusted).
Copy just the gene symbols and paste into GOrilla web site.
GOrilla results:
Choose to visualize the output in REViGO. If the website does not produce an image (Scatterplot & Table) you can "Make R script for plotting" and produce a pdf.