Lesson 5: Downloading data from the SRA

Lesson 4 Review:

The majority of computational tasks on Biowulf should be submitted as jobs: sbatch or swarm
the SRA-toolkit can be used to retrieve data from the Sequence Read Archive

Learning Objectives:

Download data from the SRA with fastq-dump
- split files into forward and reverse reads
- download part, not all, of the data
Compare fastq-dump to fasterq-dump
Introduce prefetch
Grab SRA run info and run accession information
Learn to automate data retrieval with the parallel command
Learn about alternative download options

Get Started

Connect to Helix

Helix is the dedicated data management and file transfer node.

ssh username@helix.nih.gov

username = NIH/Biowulf login username. Remember to use the student account username here.

Type in your password at the prompt. The cursor will not move as you type your password!

Congrats! You are now connected to Helix. Your current working directory will be your home directory.

Navigate to your Module_1 directory

cd /data/$USER/Module_1

If you do not have a Module_1 directory, use the following:

cp -R /data/classes/BTEP/B4B_2025/Module_1 /data/$USER
cd /data/$USER/Module_1

Create a lesson directory

Here, we will download data from the NCBI/SRA. Let's create a directory named sra_data to store the data we are using today.

mkdir sra_data
cd sra_data

A brief word about sequence formats

FASTA – commonly used text format for downstream analysis such as sequence similarity searches

head ../sample.fasta

>NC_012920.1 Homo sapiens mitochondrion, complete genome
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG
GTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC
CTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTA
ATTAATTAATGCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATC
ATAACAAAAAATTTCCACCAAACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCA
AACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC
TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAATCTCATCAATA
CAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAACCAACCAAACCCC
AAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTC

FASTQ – output format from NGS technologies with quality scores

head ../sample.fastq

@SRR1553607.1 1 length=202
GTTAGCGTTGTTGATCGCGACGCAACAACTGGTAAAGAATCTGGAAGAAGGATATCAGTTCAAACGCTCA
AGCGAGATGATGGATATTTTTGAACGACTCATCATTCGTCACATCCCGTTCAACCTGATTGAGTCGTTCA
AAAATATCCATCATCTCGCTTGAGCGTTTGAACTGATATCCTTCTTCCAGATTCTTTACCAG
+SRR1553607.1 1 length=202
BB@FFFFFHHHHHJJJJJJJJJJJJJJJJJJJGHIJJJJJJJJJHHHHHFFFFFEEEEEEEEEDDDDDDD
DDDDDDDDDDEDDDDEDEEEDDDDDDDDDDDCCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJ
JJJJJJJJJJJJJJJIJJJIJJHIJJHHHFFFFFEDDEFEEEDDDDDDDDDDDDDEDDDDDD
@SRR1553607.2 2 length=202
GGTGTAAGCACAGTACTCGGCCCACATCGCCTTTGTGTTAATGAAGTTTGGGTATCAACTTTCATCCCCA
ATCTTCCGTGGAAGGAGTATGTTCCGTCAATCCGTACATGTTAACCCTGGGCCTGTTACTTGGGCCTTCA
CGGGTTCTAATACAAATTGACGGAACATACTCCTTCCACGGAAGATTGGGGATGAAAGTTGA
+SRR1553607.2 2 length=202
?@;BDDDDFHFHFFFGIIGHIIJJGJIGIJIIIIGDGGGHEIGJIIIGIIHJ5@FGHJJIEGGEEHHFFF
FFFEEDEDCBB?CCDDD CDDDDECDCB@>C@@CFFFDFHFHHHJJJJJJIJJJJIJJJGJJIJIJJJJJ
JJJJGHIJJIIIIEIJIJJIJIJGHHFFFFFFEEEDEEDDDDDDBDDDDDDDDDDDDDDDDD
@SRR1553607.3 3 length=202
GGTCATCGGCGGTCTCTGGGTCGCCGGGACCGTTGGAGAAGAAGACTCCGTCCACGCCGAGTGCCTGGAT
CTCCTCGAAGGTGACCGTCGCGGGCAGCACGACCCTGCACCTGCGCGACAAGGGCGCGATGCGGATGGGC
ATCTTCTCCGGACCCGCCGCTGAGGCGAGCGAGGCCGAGCTGCTCGAAACGGTCCGCTCCGC

What is the SRA?

The SRA (Sequence Read Archive) at NCBI is a large, public database of DNA sequencing data. The repository holds "short reads" generated by high-throughput next-generation sequencing, usually less than 1,000 bp.

We will download data from the SRA using the command line package sratoolkit. We will also explore how you can download these data from NCBI via a web browser.

Helpful documentation for the SRA:
1. SRA handbook at NCBI
2. SRA Toolkit Documentation (NCBI)

SRA terms to know

BioProject (Prefix: PRJNA, example: PRJNA257197): provides a description of the research project associated with the sequencing study. It includes information such as the study’s abstract, links to publications (if available), and links to project data, including experiments, PubMed, and GEO datasets.
BioSample (Prefix: SAMN): contains sample-specific details. It could include details such as disease information, sex, response to drug, race, and other relevant attributes.
Sample Accession (Prefix: SRS): A Sample is an object imported from BioSample, which contains metadata describing the physical sample upon which a sequencing experiment was performed.
Experiment (Prefix: SRX): an object that contains metadata describing the library preparation, the sequencing technique, instrument used, and other experiment-specific details.
Run (Prefix: SRR): A Run is an object that contains the actual sequencing data for a particular sequencing experiment. Experiments may have multiple Runs. -- Decoding SRA Accessions: A Primer on Navigating NCBI's Sequence Read Archive

Using `fastq-dump`

fastq-dump and fasterq-dump are modules in the SRA Toolkit and can be used to download FASTQ-formatted data. Both download the data in SRA format and convert it to FASTQ format.

If working on a high performance computing system such as Biowulf, either submit a script to cluster, use an interactive session, or connect to helix. Make sure to do module load sratoolkit prior to issuing either the fastq-dump or fasterq-dump commands. Alternatively, install the SRA Tool kit on local computer (see https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit).

module load sratoolkit

fastq-dump SRR1553607

creates the file:

SRR1553607.fastq

Check the file to make sure it is in fastq format. How would you do this? What is FASTQ format?

What are "spots"?

Spots are a legacy term referring to locations on the flow cell for Illumina sequencers. All of the bases for a single location constitute the spot including technical reads (e.g., adapters, primers, barcodes, etc.) and biological reads (forward, reverse). In general, you can think of a spot as you do a read. For more information on spots, see the linked discussion on Biostars. When downloading "spots", always split the spots into the original files using:

--split-files

For paired-end reads, we need to separate the data into two different files like this:

fastq-dump --split-files SRR1553607

which creates

SRR1553607_1.fastq
SRR1553607_2.fastq

Split 3

There is an additional option --split-3 that will split the reads into forward and reverse files and a third file with unmatched reads. Since many bioinformatic programs require matching paired end reads, **this is the preferred option. **

fastq-dump first downloads the data in SRA format, then converts it to FASTQ. If we want to work with a subset of the data, for example the first 10,000 reads, we can use -X:

fastq-dump --split-3 -X 10000 SRR1553607 --outdir SRR1553607_subset

We have additionally included --outdir to save our sub-setted data to a different directory, so that we do not overwrite our previous downloads.

To see other fastq-dump options, use

fastq-dump --help

Other options of interest:
--gzip or bzip2 allows you to compress the reads
--skip-technical allows you to only download biological reads

To generate an XML report on the data that shows us the "spot" or read count, and the count of bases "base_count", including the size of the data file:

which yields this data in XML format:

where we can see the spot count, the base count, and the number of bases corresponding to quality scores.

Side bar: (XML) is Extensible Markup Language, a document format that is both human and machine-readable. id="__codelineno-16-1" name="__codelineno-16-1">sra-stat --xml --quick SRR1553607 id="__codelineno-17-1" name="__codelineno-17-1"><Run accession="SRR1553607" spot_count="203445" base_count="41095890" base_count_bio="41095890" cmp_base_count="41095890"> <Member member_name="A" spot_count="203445" base_count="41095890" base_count_bio="41095890" cmp_base_count="41095890" library="NM042.2_r1" sample="NM042.2_r1"/> <Size value="25452196" units="bytes"/> <AlignInfo> </AlignInfo> <QualityCount> <Quality value="2" count="3523771"/> <Quality value="5" count="40006"/> <Quality value="6" count="30160"/> <Quality value="7" count="79961"/> <Quality value="8" count="80987"/> <Quality value="9" count="21904"/> <Quality value="10" count="63700"/> <Quality value="11" count="23695"/> <Quality value="12" count="42479"/> <Quality value="13" count="24910"/> <Quality value="14" count="22036"/> <Quality value="15" count="135747"/> <Quality value="16" count="56174"/> <Quality value="17" count="81514"/> <Quality value="18" count="123393"/> <Quality value="19" count="79256"/> <Quality value="20" count="269807"/> <Quality value="21" count="68295"/> <Quality value="22" count="125196"/> <Quality value="23" count="267733"/> <Quality value="24" count="287688"/> <Quality value="25" count="296721"/> <Quality value="26" count="338062"/> <Quality value="27" count="507890"/> <Quality value="28" count="226423"/> <Quality value="29" count="569612"/> <Quality value="30" count="813014"/> <Quality value="31" count="1309505"/> <Quality value="32" count="812951"/> <Quality value="33" count="2398417"/> <Quality value="34" count="2324453"/> <Quality value="35" count="10399039"/> <Quality value="36" count="1250313"/> <Quality value="37" count="2681177"/> <Quality value="38" count="1210457"/> <Quality value="39" count="2744683"/> <Quality value="40" count="2268539"/> <Quality value="41" count="5496222"/> </QualityCount> <Databases> <Database> <Table name="SEQUENCE"> <Statistics source="meta"> <Rows count="203445"/> <Elements count="41095890"/> </Statistics> </Table> </Database> </Databases> </Run> id="__codelineno-18-1" name="__codelineno-18-1">spot_count="203445" base_count="41095890" XML aims for: "simplicity, generality and usability across the Internet".

Using seqkit stat

This tool cannot be used on Helix.

We can get similar information from our downloaded files using a program called seqkit. If you get a chance, grab an interactive session on Biowulf and check out seqkit.

module load seqkit

First, let's see what options are available?

seqkit --help

We can see that seqkit stats provides "simple statistics of FASTA/Q files". Let's try it out.

seqkit stat SRR1553607_1.fastq

`fasterq-dump`

We have already seen fasterq-dump in action (See Lesson 4). fasterq-dump is faster than fastq-dump because it uses multi-threading (default --threads 6). --split-3 and --skip-technical are defaults with fasterq-dump, so you do not need to worry about specifying how you want the files to be split.

However, you can not grab a subset of the file like you can with fastq-dump nor can you compress the file during download, so fasterq-dump is not necessarily a replacement for fastq-dump.

Using `prefetch`

Both fastq-dump and fasterq-dump are faster when following prefetch, and fasterq-dump paired with prefetch is the fastest way to pull the files from the SRA. There are alternative options, which we will mention later.

prefetch will first download all of the necessary files to your working directory. Runs are downloaded in the SRA format (compressed).

mkdir prefetch
cd prefetch
prefetch SRR1553607 
ls -l

fasterq-dump will then convert the files to the fastq format.

fasterq-dump  -p   -t /scratch/$USER  SRR1553607

What does -p do? How can we find out?

Note

We are not on a computational node, so we will use scratch instead of lscratch.

Navigating NCBI SRA

Where can we get an accession list or run information for a particular project?

Navigating the NCBI website can be challenging. From a user perspective, nothing is as straight forward as you would expect. For the sequence read archive (SRA), there are fortunately some options. There are the convuluted e-utilites, which can grab run information without navigating to the NCBI website, or there is the SRA Run Selector. The Run Selector is nice because you can filter the results for a given BioProject and obtain only the accessions that interest you in a semi user-friendly format. We can search directly from the SRA Run Selector, or simply go to the main NCBI website and begin our search from there. Let's see the latter example, as this is likely the primary way you will attempt to find data in the future.

Step 1: Start from the NCBI homepage. Type the BioProject ID (PRJNA257197) into the search field.

Step 1

Step 2: In the search results, select the entries next to the SRA.

Step 2

Step 3: From the SRA results, select "Send results to Run Selector".

Step 3

Step 4: From there, you can download the metadata or accession list to your local computer. You could also download the fastq files directly to local.

Copy and paste the accession list into a file using nano. Save to runaccessions.txt. Now use head to grab the first few results.

First, let's create a new directory to work in.

mkdir run_selector
cd run_selector
nano runaccessions.txt

head -n 3 runaccessions.txt > runids.txt

E-utilities

Using e-utilities to programmatically grab the run info

esearch and efetch, can be used to query and pull information from Entrez. They aren't the easiest to understand or use, but you can use them to get the same run info that we grabbed using the Run Selector with the following one liner:

module load edirect
esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

Here we provide the BioProject ID to esearch, which queries the SRA. That info can be piped to efetch, which will pull down the run info in comma separated format, which we can save to file using >.

Then we can use a combo of cat, cut, grep, and head to grab only the accession info we are interested in:

cat runinfo.csv | cut -f 1 -d ',' | grep SRR | head > runids.txt

Here, we opened the file with cat, piped it to the cut command, piped it to grep for only the accession IDs (skipping the column header), and get only the first few results.

So what is cut? It is a unix command, that is used to cut out selected portions of the file (man cut for more info). The -f option specifies which field to cut (the first field), and -d tells the program which delimiter to use. We know this is a comma-separate value file, so we do -d ',' to specify the comma.

Using the "parallel" tool for automating downloads

Now that we have accession numbers to work with, let's use parallel and fastq-dump to download the data.

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. It can serve as a nice replacement of the for loop; though, its behavior is different. Check out this post from the Biostars forum, Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them, and this from OMGenomics Lab, Using GNU parallel painlessly.

Let's see how this works.

mkdir parallel
cd parallel
cat ../runids.txt | parallel -j 1 fastq-dump -X 10000 --split-files {}

This gives us six files, two files for each run and 3 runs. It takes the place of the multiple fastq-dump commands we would need to use.

What's up with the curly braces {}?

The curly braces are default behavior for parallel. This is the default replacement string; when empty, it appends inputs. So the following gets the same result.

cat ../runids.txt | parallel fastq-dump -X 10000 --split-files

Using `swarm`

We can also take advantage of swarm for running jobs in parallel, as we did in the previous lesson.

Note

This should be run from Biowulf, not Helix.

cd ..
mkdir swarm
cd swarm

We can generate a script using:

cat ../runids.txt | while read file; do
    echo prefetch $file >> download.sh
    echo 'fasterq-dump -t /lscratch/$SLURM_JOB_ID' $file >> download.sh
done

Edit to include #SBATCH directives.

nano download.sh

At the top of the file add the following:

#SWARM --threads-per-process 6
#SWARM --gb-per-process 4  
#SWARM --gres=lscratch:10 
#SWARM --module sratoolkit 
#SWARM --partition=student

Submit the script.

swarm -f download.sh

European Nucleotide Archive (ENA)

Everything in the SRA is also in the ENA (See PRJNA257197 on ENA). Files in the ENA share the same naming convention as the SRA but are stored directly as gzipped fastq files and bam files. You can easily use wget or curl to pull files directly from the ENA without using the sratoolkit. Check out the ENA training modules for more information on downloading data from the ENA, including examples.

Using curl and wget

The curl command is used to retrieve data from web sites. A similar command is wget. The Unix system you are working with may have either curl or wget installed. To see which is active on your system, just type the command at the command line.

wget
curl

Let's see how we can use wget to grab a couple of files from the ENA.

We use wget followed by the path to the file:

wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/006/SRR1553426/SRR1553426_2.fastq.gz  
wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR155/006/SRR1553426/SRR1553426_1.fastq.gz

The -nc option refers to "no clobber", which will not overwrite existing files.

To see additional options, use

man wget

These files are compressed. To decompress a file use gunzip.

gunzip SRR1553426_2.fastq.gz

In the case of .zip you can use unzip, and with tar.gz, you can use tar -xvf.
When in doubt, use google.

Note

Many programs accept compressed fastq files, so rarely will you need to decompress a fastq file.

There is also a fantastic web service called SRA Explorer that can be used to easily generate code for file downloads from the SRA, if you are looking to download a maximum of 500 samples. Use caution, the SRA-explorer is not associated with the ENA or NCBI, and program support / updates cannot be guaranteed.

Help Session

Practice the skills learned in this lesson here.

Lesson 5: Downloading data from the SRA

Lesson 4 Review:

Learning Objectives:

Get Started

Connect to Helix

Navigate to your Module_1 directory

Create a lesson directory

A brief word about sequence formats

What is the SRA?

SRA terms to know

Using fastq-dump

fasterq-dump

Using prefetch

Navigating NCBI SRA

Where can we get an accession list or run information for a particular project?

E-utilities

Using the "parallel" tool for automating downloads

Using swarm

European Nucleotide Archive (ENA)

Help Session

Using `fastq-dump`

`fasterq-dump`

Using `prefetch`

Using `swarm`