25. Classification based RNA Seq of control samples copy
This page contains content directly from The Biostar Handbook.
Always remember to start the bioinformatics environment.
conda activate bioinfo
Kallisto and Salmon are software packages that can quantify transcript abundances. We must know the transcriptome of the organism to perform classification based RNA-Seq.
Classification-based alignment with kallisto
First we need to install kallisto into our bioinformatics environment.
conda install kallisto
To build an index for the ERCC control sequences...
kallisto index -i refs/ERCC92.idx refs/ERCC92.fa
[build] loading fasta file refs/ERCC92.fa
[build] k-mer length: 31
[build] warning: clipped off poly-A tail (longer than 10)
from 91 target sequences
[build] warning: replaced 1 non-ACGUT characters in the input sequence
with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ... done
[build] creating equivalence classes ... done
[build] target de Bruijn graph has 92 contigs and contains 77828 k-mers
kallisto quant -i refs/ERCC92.idx -o results reads/HBR_1_R1.fq reads/HBR_1_R2.fq
[quant] fragment length distribution will be estimated from the data
[index] k-mer length: 31
[index] number of targets: 92
[index] number of k-mers: 77,828
[index] number of equivalence classes: 92
[quant] running in paired-end mode
[quant] will process pair 1: reads/HBR_1_R1.fq
reads/HBR_1_R2.fq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 118,571 reads, 55,972 reads pseudoaligned
[quant] estimated average fragment length: 170.53
[ em] quantifying the abundances ... done
[ em] the Expectation-Maximization algorithm ran for 52 rounds
- abundance.tsv
- abundance.h5
- run_info.json
The abundance.tsv is a tab delimited file that, for each transcript lists the target, the length, the corrected effective length, the counts and the TPM value.
target_id length eff_length est_counts tpm
ERCC-00002 1061 891.47 18946 243187
ERCC-00003 1023 853.47 1452 19467.4
ERCC-00004 523 353.47 455 14729.5
ERCC-00009 984 814.47 319 4481.72
ERCC-00012 994 824.47 0 0
ERCC-00013 808 638.47 0 0
ERCC-00014 1957 1787.47 13 83.2212
ERCC-00016 844 674.47 0 0
ERCC-00017 1136 966.47 0 0
ERCC-00019 644 474.47 3 72.3505
ERCC-00022 751 581.47 149 2932.16
ERCC-00024 536 366.47 0 0
ERCC-00025 1994 1824.47 80 501.745
ERCC-00028 1130 960.47 2 23.8273
ERCC-00031 1138 968.47 0 0
ERCC-00033 2022 1852.47 0 0
ERCC-00034 1019 849.47 8 107.763
ERCC-00035 1130 960.47 47 559.943
ERCC-00039 740 570.47 1 20.0584
We can use the parallel command to quantify all samples at once.
# Make directory for BAM files
mkdir -p results
#Create root names for each file
parallel -j 1 echo {1}_{2} ::: UHR HBR ::: 1 2 3 > names.txt
#Run all samples in parallel
cat names.txt | parallel "kallisto quant -i refs/ERCC92.idx -o results/{} reads/{}_R1.fq reads/{}_R2.fq"
Copy the abundance files into directories named for each sample
cat names.txt | parallel cp results/{}/abundance.tsv results/{}.abundance.tsv
ls -1 results/*.tsv
results/HBR_1.abundance.tsv
results/HBR_2.abundance.tsv
results/HBR_3.abundance.tsv
results/UHR_1.abundance.tsv
results/UHR_2.abundance.tsv
results/UHR_3.abundance.tsv
paste results/H*.tsv results/U*.tsv | cut -f 1,4,9,14,19,24,29 > counts.txt
cat counts.txt | Rscript deseq1.r 3x3 > results.txt
The End.