Skip to content

SRA downloads to Biowulf copy

How to download data from the Sequence Read Archive (NCBI/SRA) to your account on NIH HPC Biowulf

You will need:

  1. active, unlocked Biowulf account (hpc.nih.gov)
  2. active Globus account for transferring files OR set up a file transfer program (Filezilla) for Mac or WinSCP for Windows PC.
  3. program to establish a connection to Biowulf. On Mac, use the "Terminal" program. Windows PC users must download and use the PuTTy tool.
  4. If you are not on the NIH campus you will need to establish a VPN connection before connecting to Biowulf.

Okay, let's get started.

ssh username@biowulf.nih.gov
You'll automatically be placed in your home directory (/home/username), but you need to move to your data directory (/data/username) which is much larger and used for downloading and analyzing data.

move to your data directory

cd /data/username
and start an interactive node (we can't work on the Biowulf login node.)
sinteractive
Let's get some information about the data you want to download. We'll need to load the NCBI edirect module first.
module load edirect
and then we can query for info on this project.
esearch -db sra -query PRJNA578488
How many sequencing runs are there?
<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>NCID_1_14468081_130.14.22.76_9001_1593523761_1281219475_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>67</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>
Let's download the project data into a .csv (comma-separated values) file.

esearch -db sra -query PRJNA578488 | efetch -format runinfo > runinfo.csv

To get just the SRR numbers in this file, you can use the cut command.

cat runinfo.csv | cut -f 1 -d',' | grep SRR | head > runids.txt
If needed, use Globus/WinSCP/Filezilla/scp to move the file to your local machine and open it with MS Excel.


Running SRA toolkit on Biowulf

There are instructions for running SRA-Toolkit on Biowulf here ([https://hpc.nih.gov/apps/sratoolkit.html])(https://hpc.nih.gov/apps/sratoolkit.html).

To start with, we will start up an interactive node using the "sinteractive" command. You can do this in another terminal window (on Mac), or you can relinquish your first interactive node to start this one. This time we are setting the scratch and cpus on the interactive node.

We will set the scratch space (temporary files) at 20 and the cpus at 6.

sinteractive --gres=lscratch:20 --cpus-per-task=6

then load the module

module load sratoolkit
and run the job
fasterq-dump -p -t /scratch/username --split-files SRR10314042
results are:
spots read      : 34,743,967
reads read      : 69,487,934
reads written   : 34,743,967
reads 0-length  : 34,743,967
Remember to use --split-files to get forward and reverse reads in separate files (for downstream analysis).

To retrieve all the files at once, you can create a swarm job (nano set.swarm).

fasterq-dump --split-files SRR10314107
fasterq-dump --split-files SRR10314097
fasterq-dump --split-files SRR10314049
fasterq-dump --split-files SRR10314073
fasterq-dump --split-files SRR10314056
fasterq-dump --split-files SRR10314055
fasterq-dump --split-files SRR10314054
fasterq-dump --split-files SRR11612729
fasterq-dump --split-files SRR10314053
fasterq-dump --split-files SRR10314052
fasterq-dump --split-files SRR10314051
fasterq-dump --split-files SRR10314045
fasterq-dump --split-files SRR10314044
fasterq-dump --split-files SRR10314043
fasterq-dump --split-files SRR10314042
And run the swarm file like this.
swarm -f set.swarm --module sratoolkit --gres=lscratch:100 -g 4 -t 6

Quality control with fastqc/multiqc

Create fastqc.swarm file with text editor (nano).

fastqc -o output SRR10314042.fastq
fastqc -o output SRR10314049.fastq
fastqc -o output SRR10314054.fastq
fastqc -o output SRR10314097.fastq
fastqc -o output SRR10314043.fastq
fastqc -o output SRR10314051.fastq
fastqc -o output SRR10314055.fastq
fastqc -o output SRR10314107.fastq
fastqc -o output SRR10314044.fastq
fastqc -o output SRR10314052.fastq
fastqc -o output SRR10314056.fastq
fastqc -o output SRR11612729_1.fastq
fastqc -o output SRR10314045.fastq
fastqc -o output SRR10314053.fastq
fastqc -o output SRR10314073.fastq
fastqc -o output SRR11612729_2.fastq
Run at the command line.
swarm -f fastqc.swarm -g 10 --module fastqc
See results. Bring to local machine. Establish an scp connection from a new terminal window.
scp username@biowulf.nih.gov:/pathtofile/file.html .

Run multi qc as a batch job. Create file with text editor (nano) named "multiqc.sh".

#!/bin/bash
module load multiqc || exit 1
multiqc --title 'Dow data multiqc' /data/stonelakeak/projects/C_Stuelten/set_SRR/full_set/two
Submit the batch job using the Slurm sbatch command
sbatch --cpus-per-task=2 --mem=4g multiqc.sh

See multiqc results. Bring to local machine. Establish an scp connection from a new terminal window.

scp username@biowulf.nih.gov:/pathtofile/multiqc_report.html .

Align with splice aware aligner like STAR.

get an interactive node

sinteractive --cpus-per-task=12 --mem=30g --gres=lscratch:20
module load STAR
mkdir -p bam/rnaseq_STAR
GENOME=/fdb/STAR_current/UCSC/mm10/genes-100
and run STAR.
STAR --runThreadN 12 --genomeDir $GENOME --sjdbOverhang 100 --readFilesIn filename.fastq --readFilesCommand cat --outSAMtype BAM SortedByCoordinate --outFileNamePrefix bam/rnaseq_STAR
This will output to screen.
Jul 16 12:11:02 ..... started STAR run

Jul 16 12:11:02 ..... loading genome

Jul 16 12:11:36 ..... started mapping

Jul 16 12:14:24 ..... finished mapping

Jul 16 12:14:26 ..... started sorting BAM

Jul 16 12:14:26 ..... finished successfully

Things to consider:

STAR 2-pass mode

--sjdbGTFfile is the path to the file with annotated transcripts in standard GTF format, STAR extracts splice junctions from this file, improves accuracy of mapping. Using annotations is highly recommended whenever they are available.

--genomeFastaFiles are fasta files with genome reference sequences Why is this not specified in example?

--genomeDir the genome indices (on Biowulf)

--sjdbOverhang specifies the length of genomic sequence around the annotated junction to be used in constructin g the splic junctions database. Should be equal to ReadLength-1. In most cases, the default value of 100 will work as well as the ideal value.

Chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence files.

STAR can also utilize annotations formatted as a list of splice junctions coordinates in a text file: --sjdbFileChrStartEnd /path/to/sjdbFile.txt

Multiple samples can be mapped in one run with a single output.

STAR can output alignments directly in binary BAM format, thus saving time on converting SAM files to BAM. It can also sort BAM files by coordinates.

--quantMode TranscriptomeSAM option will output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file. These can be used with various transcript quantification software that require reads to be mapped to transcriptome, such as RSEM or eXpress.

--quantMode GeneCounts option STAR will count number reads per gene while mapping. This option requires annotations (GTF, GFF with -sjdbGTFfile option _ used at the genome generation step, or at the mapping step. STAR outputs read counts per gene into READSPerGene.out.tab file.

2-pass mapping. For the most sensitive novel junction discovery , it is recommended to run STAR in the 2-pass mode. Allows to detect more splices reads mapping to novel junctions.

--varVCFfile option is used to input VCF file with personal variants. Only SNVs are supported a the moment.

Splice Junctions. SJ.out.tab contains high confidence collapsed splice junctions in tag-delimited format --outSJfilter

Check log files for any error messages.

Log.out
Log.progress.out
Log.final.out

Results are sorted by coordinate. Create index (.bai). (samtools index)

samtools index rnaseq_STARAligned.sortedByCoord.out.bam

View with IGV (on Biowulf using NoMachine).

ssh -Y username@biowulf.nih.gov
cd results_dir
sinteractive --mem=8g --gres=lscratch:5
Open a connection to biowulf.nih.gov with "NoMachine". Minimize this window and work at the command line.
module load IGV IGVTools
igv -mem 8g
On IGV GUI, choose Genome "Mouse(mm10)".

IGV: File, Load from file "rnaseq_STARAligned.sortedByCoord.out.bam"

Next...(?)

Condense all fastq into one fastq.gz file and run all with STAR.