25. Classification based RNA Seq of control samples copy

This page contains content directly from The Biostar Handbook.

Always remember to start the bioinformatics environment.

conda activate bioinfo

Pseudoalignment-based methods identify locations in the genome using patterns rather than via alignment type algorithms. It operates much faster than alignments so you can analyze massive datasets with less resources.

Kallisto and Salmon are software packages that can quantify transcript abundances. We must know the transcriptome of the organism to perform classification based RNA-Seq.

Classification-based alignment with kallisto

First we need to install kallisto into our bioinformatics environment.

conda install kallisto

To build an index for the ERCC control sequences...

kallisto index -i refs/ERCC92.idx refs/ERCC92.fa

[build] loading fasta file refs/ERCC92.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
        from 91 target sequences
[build] warning: replaced 1 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 92 contigs and contains 77828 k-mers

Then we can quantify the transcripts.

kallisto quant -i refs/ERCC92.idx -o results reads/HBR_1_R1.fq reads/HBR_1_R2.fq 

[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 92
[index] number of k-mers: 77,828
[index] number of equivalence classes: 92
[quant] running in paired-end mode
[quant] will process pair 1: reads/HBR_1_R1.fq
                             reads/HBR_1_R2.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 118,571 reads, 55,972 reads pseudoaligned
[quant] estimated average fragment length: 170.53
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

A kallisto run produces three files:

abundance.tsv
abundance.h5
run_info.json

The abundance.tsv is a tab delimited file that, for each transcript lists the target, the length, the corrected effective length, the counts and the TPM value.

target_id       length  eff_length      est_counts      tpm
ERCC-00002      1061    891.47  18946   243187
ERCC-00003      1023    853.47  1452    19467.4
ERCC-00004      523     353.47  455     14729.5
ERCC-00009      984     814.47  319     4481.72
ERCC-00012      994     824.47  0       0
ERCC-00013      808     638.47  0       0
ERCC-00014      1957    1787.47 13      83.2212
ERCC-00016      844     674.47  0       0
ERCC-00017      1136    966.47  0       0
ERCC-00019      644     474.47  3       72.3505
ERCC-00022      751     581.47  149     2932.16
ERCC-00024      536     366.47  0       0
ERCC-00025      1994    1824.47 80      501.745
ERCC-00028      1130    960.47  2       23.8273
ERCC-00031      1138    968.47  0       0
ERCC-00033      2022    1852.47 0       0
ERCC-00034      1019    849.47  8       107.763
ERCC-00035      1130    960.47  47      559.943
ERCC-00039      740     570.47  1       20.0584

We can use the parallel command to quantify all samples at once.

# Make directory for BAM files
mkdir -p results

#Create root names for each file
parallel -j 1 echo {1}_{2} ::: UHR HBR ::: 1 2 3 > names.txt

#Run all samples in parallel
cat names.txt | parallel "kallisto quant -i refs/ERCC92.idx -o results/{} reads/{}_R1.fq reads/{}_R2.fq"

Copy the abundance files into directories named for each sample

cat names.txt | parallel cp results/{}/abundance.tsv results/{}.abundance.tsv

and check to see file names

ls -1 results/*.tsv

which prints

results/HBR_1.abundance.tsv
results/HBR_2.abundance.tsv
results/HBR_3.abundance.tsv
results/UHR_1.abundance.tsv
results/UHR_2.abundance.tsv
results/UHR_3.abundance.tsv

Merge all the counts (abundance) into a single count file, and use it as input to downstream differential expression analysis (deseq, edger).

paste results/H*.tsv results/U*.tsv | cut -f 1,4,9,14,19,24,29 > counts.txt

and run the differential expression

cat counts.txt | Rscript deseq1.r 3x3 > results.txt

You can use this file for downstream analysis as done before

The End.