SRA downloads to Biowulf copy
How to download data from the Sequence Read Archive (NCBI/SRA) to your account on NIH HPC Biowulf
You will need:
- active, unlocked Biowulf account (hpc.nih.gov)
- active Globus account for transferring files OR set up a file transfer program (Filezilla) for Mac or WinSCP for Windows PC.
- program to establish a connection to Biowulf. On Mac, use the "Terminal" program. Windows PC users must download and use the PuTTy tool.
- If you are not on the NIH campus you will need to establish a VPN connection before connecting to Biowulf.
Okay, let's get started.
ssh username@biowulf.nih.gov
move to your data directory
cd /data/username
sinteractive
module load edirect
esearch -db sra -query PRJNA578488
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>NCID_1_14468081_130.14.22.76_9001_1593523761_1281219475_0MetA0_S_MegaStore</WebEnv>
<QueryKey>1</QueryKey>
<Count>67</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
esearch -db sra -query PRJNA578488 | efetch -format runinfo > runinfo.csv
To get just the SRR numbers in this file, you can use the cut command.
cat runinfo.csv | cut -f 1 -d',' | grep SRR | head > runids.txt
Running SRA toolkit on Biowulf
There are instructions for running SRA-Toolkit on Biowulf here ([https://hpc.nih.gov/apps/sratoolkit.html])(https://hpc.nih.gov/apps/sratoolkit.html).
To start with, we will start up an interactive node using the "sinteractive" command. You can do this in another terminal window (on Mac), or you can relinquish your first interactive node to start this one. This time we are setting the scratch and cpus on the interactive node.
We will set the scratch space (temporary files) at 20 and the cpus at 6.
sinteractive --gres=lscratch:20 --cpus-per-task=6
then load the module
module load sratoolkit
fasterq-dump -p -t /scratch/username --split-files SRR10314042
spots read : 34,743,967
reads read : 69,487,934
reads written : 34,743,967
reads 0-length : 34,743,967
To retrieve all the files at once, you can create a swarm job (nano set.swarm).
fasterq-dump --split-files SRR10314107
fasterq-dump --split-files SRR10314097
fasterq-dump --split-files SRR10314049
fasterq-dump --split-files SRR10314073
fasterq-dump --split-files SRR10314056
fasterq-dump --split-files SRR10314055
fasterq-dump --split-files SRR10314054
fasterq-dump --split-files SRR11612729
fasterq-dump --split-files SRR10314053
fasterq-dump --split-files SRR10314052
fasterq-dump --split-files SRR10314051
fasterq-dump --split-files SRR10314045
fasterq-dump --split-files SRR10314044
fasterq-dump --split-files SRR10314043
fasterq-dump --split-files SRR10314042
swarm -f set.swarm --module sratoolkit --gres=lscratch:100 -g 4 -t 6
Quality control with fastqc/multiqc
Create fastqc.swarm file with text editor (nano).
fastqc -o output SRR10314042.fastq
fastqc -o output SRR10314049.fastq
fastqc -o output SRR10314054.fastq
fastqc -o output SRR10314097.fastq
fastqc -o output SRR10314043.fastq
fastqc -o output SRR10314051.fastq
fastqc -o output SRR10314055.fastq
fastqc -o output SRR10314107.fastq
fastqc -o output SRR10314044.fastq
fastqc -o output SRR10314052.fastq
fastqc -o output SRR10314056.fastq
fastqc -o output SRR11612729_1.fastq
fastqc -o output SRR10314045.fastq
fastqc -o output SRR10314053.fastq
fastqc -o output SRR10314073.fastq
fastqc -o output SRR11612729_2.fastq
swarm -f fastqc.swarm -g 10 --module fastqc
scp username@biowulf.nih.gov:/pathtofile/file.html .
Run multi qc as a batch job. Create file with text editor (nano) named "multiqc.sh".
#!/bin/bash
module load multiqc || exit 1
multiqc --title 'Dow data multiqc' /data/stonelakeak/projects/C_Stuelten/set_SRR/full_set/two
sbatch --cpus-per-task=2 --mem=4g multiqc.sh
See multiqc results. Bring to local machine. Establish an scp connection from a new terminal window.
scp username@biowulf.nih.gov:/pathtofile/multiqc_report.html .
Align with splice aware aligner like STAR.
get an interactive node
sinteractive --cpus-per-task=12 --mem=30g --gres=lscratch:20
module load STAR
mkdir -p bam/rnaseq_STAR
GENOME=/fdb/STAR_current/UCSC/mm10/genes-100
STAR --runThreadN 12 --genomeDir $GENOME --sjdbOverhang 100 --readFilesIn filename.fastq --readFilesCommand cat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix bam/rnaseq_STAR
Jul 16 12:11:02 ..... started STAR run
Jul 16 12:11:02 ..... loading genome
Jul 16 12:11:36 ..... started mapping
Jul 16 12:14:24 ..... finished mapping
Jul 16 12:14:26 ..... started sorting BAM
Jul 16 12:14:26 ..... finished successfully
Things to consider:
STAR 2-pass mode
--sjdbGTFfile is the path to the file with annotated transcripts in standard GTF format, STAR extracts splice junctions from this file, improves accuracy of mapping. Using annotations is highly recommended whenever they are available.
--genomeFastaFiles are fasta files with genome reference sequences Why is this not specified in example?
--genomeDir the genome indices (on Biowulf)
--sjdbOverhang specifies the length of genomic sequence around the annotated junction to be used in constructin g the splic junctions database. Should be equal to ReadLength-1. In most cases, the default value of 100 will work as well as the ideal value.
Chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files.
STAR can also utilize annotations formatted as a list of splice junctions coordinates in a text file: --sjdbFileChrStartEnd /path/to/sjdbFile.txt
Multiple samples can be mapped in one run with a single output.
STAR can output alignments directly in binary BAM format, thus saving time on converting SAM files to BAM. It can also sort BAM files by coordinates.
--quantMode TranscriptomeSAM option will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file. These can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress.
--quantMode GeneCounts option STAR will count number reads per gene while mapping. This option requires annotations (GTF, GFF with -sjdbGTFfile option _ used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into READSPerGene.out.tab file.
2-pass mapping. For the most sensitive novel junction discovery , it is recommended to run STAR in the 2-pass mode. Allows to detect more splices reads mapping to novel junctions.
--varVCFfile option is used to input VCF file with personal variants. Only SNVs are supported a the moment.
Splice Junctions. SJ.out.tab contains high confidence collapsed splice junctions in tag-delimited format --outSJfilter
Check log files for any error messages.
Log.out
Log.progress.out
Log.final.out
Results are sorted by coordinate. Create index (.bai). (samtools index)
samtools index rnaseq_STARAligned.sortedByCoord.out.bam
View with IGV (on Biowulf using NoMachine).
ssh -Y username@biowulf.nih.gov
cd results_dir
sinteractive --mem=8g --gres=lscratch:5
module load IGV IGVTools
igv -mem 8g
IGV: File, Load from file "rnaseq_STARAligned.sortedByCoord.out.bam"
Next...(?)
Condense all fastq into one fastq.gz file and run all with STAR.