Lesson 6: sra-tools, e-utilities, and parallel

This page uses some content directly from the Biostar Handbook by Istvan Albert.

Lesson 5 Review:

The majority of computational tasks on Biowulf should be submitted as jobs: sbatch or swarm
the SRA-toolkit can be used to retrieve data from the Sequence Read Archive

Learning Objectives:

Download data from the SRA with fastq-dump
- split files into forward and reverse reads
- download part, not all, of the data
Compare fastq-dump to fasterq-dump
Introduce prefetch
Look at XML-formatted data with sra-stat
Use e-utilities (esearch and efetch) to get "runinfo" csv-formatted data
Work with csv-formatted data using the cut command to isolate columns
Learn to automate data retrieval with the parallel command
Learn about alternative download options

For the remainder of this course, we will be working on the GOLD teaching environment in DNAnexus.

Create a project directory

Today, we will download data from the NCBI/SRA. Let's create a project directory named biostar_class and new folder within biostar_class named sra_data to store the data we are using today. We will use biostar_class for the remaining lessons in this course.

mkdir biostar_class
cd biostar_class
mkdir sra_data
cd sra_data

A brief word about sequence formats

FASTA – commonly used text format for downstream analysis such as sequence similarity searches
FASTQ – output format from NGS technologies with quality scores

What is the SRA?

The SRA (Sequence Read Archive) at NCBI is a large, public database of DNA sequencing data. The repository holds "short reads" generated by high-throughput next-generation sequencing, usually less than 1,000 bp.

We will download data from the SRA using the command line package sratoolkit. Note, you can also download files directly from NCBI via a web browser.

Helpful documentation for the SRA:
1. SRA handbook at NCBI
2. SRA Toolkit Documentation (NCBI)

SRA terms to know

NCBI BioProject: PRJN#### (example: PRJNA257197) contains the overall description of a single research initiative; a project will typically relate to multiple samples and datasets.
NCBI BioSample: SAMN#### or SRS#### (example: SAMN03254300) describe biological source material; each physically unique specimen is registered as a single BioSample with a unique set of attributes.
SRA Experiment: SRX#### a unique sequencing library for a specific sample
SRA Run: SRR#### or ERR#### (example: SRR1553610) is a manifest of data file(s) linked to a given sequencing library (experiment). --- Biostar handbook (XIII SEQUENCING DATA)

Using `fastq-dump`

fastq-dump and fasterq-dump can be used to download FASTQ-formatted data. Both download the data in SRA format and convert it to FASTQ format.

fastq-dump SRR1553607

creates the file:

SRR1553607.fastq

Check the file to make sure it is in fastq format. How would you do this? What is FASTQ format?

What are "spots"?

Spots are a legacy term referring to locations on the flow cell for Illumina sequencers. All of the bases for a single location constitute the spot including technical reads (e.g., adapters, primers, barcodes, etc.) and biological reads (forward, reverse). In general, you can think of a spot as you do a read. For more information on spots, see the linked discussion on Biostars. When downloading "spots", always split the spots into the original files using:

--split-files

For paired-end reads, we need to separate the data into two different files like this:

fastq-dump --split-files SRR1553607

which creates

SRR1553607_1.fastq
SRR1553607_2.fastq

NOTE: There is an additional option --split-3 that will split the reads into forward and reverse files and a third file with unmatched reads. Since many bioinformatic programs require matching paired end reads, this is the preferred option.

fastq-dump first downloads the data in SRA format, then converts it to FASTQ. If we want to work with a subset of the data, for example the first 10,000 reads, we can use -X:

fastq-dump --split-files -X 10000 SRR1553607 --outdir SRR1553607_subset

We have additionally included --outdir to save our subsetted data to a different directory, so that we do not overwrite our previous downloads.

To see other fastq-dump options, use

fastq-dump --help

Other options of interest:
--gzip or bzip2 allows you to compress the reads
--skip-technical allows you to only download biological reads

To generate an XML report on the data that shows us the "spot" or read count, and the count of bases "base_count", including the size of the data file:

sra-stat --xml --quick SRR1553607

which yields this data in XML format:

<Run accession="SRR1553607" spot_count="203445" base_count="41095890" base_count_bio="41095890" cmp_base_count="41095890">
  <Member member_name="A" spot_count="203445" base_count="41095890" base_count_bio="41095890" cmp_base_count="41095890" library="NM042.2_r1" sample="NM042.2_r1"/>
  <Size value="25452196" units="bytes"/>
  <AlignInfo>
  </AlignInfo>
  <QualityCount>
    <Quality value="2" count="3523771"/>
    <Quality value="5" count="40006"/>
    <Quality value="6" count="30160"/>
    <Quality value="7" count="79961"/>
    <Quality value="8" count="80987"/>
    <Quality value="9" count="21904"/>
    <Quality value="10" count="63700"/>
    <Quality value="11" count="23695"/>
    <Quality value="12" count="42479"/>
    <Quality value="13" count="24910"/>
    <Quality value="14" count="22036"/>
    <Quality value="15" count="135747"/>
    <Quality value="16" count="56174"/>
    <Quality value="17" count="81514"/>
    <Quality value="18" count="123393"/>
    <Quality value="19" count="79256"/>
    <Quality value="20" count="269807"/>
    <Quality value="21" count="68295"/>
    <Quality value="22" count="125196"/>
    <Quality value="23" count="267733"/>
    <Quality value="24" count="287688"/>
    <Quality value="25" count="296721"/>
    <Quality value="26" count="338062"/>
    <Quality value="27" count="507890"/>
    <Quality value="28" count="226423"/>
    <Quality value="29" count="569612"/>
    <Quality value="30" count="813014"/>
    <Quality value="31" count="1309505"/>
    <Quality value="32" count="812951"/>
    <Quality value="33" count="2398417"/>
    <Quality value="34" count="2324453"/>
    <Quality value="35" count="10399039"/>
    <Quality value="36" count="1250313"/>
    <Quality value="37" count="2681177"/>
    <Quality value="38" count="1210457"/>
    <Quality value="39" count="2744683"/>
    <Quality value="40" count="2268539"/>
    <Quality value="41" count="5496222"/>
  </QualityCount>
  <Databases>
    <Database>
      <Table name="SEQUENCE">
        <Statistics source="meta">
          <Rows count="203445"/>
          <Elements count="41095890"/>
        </Statistics>
      </Table>
    </Database>
  </Databases>
</Run>

where we can see the spot count, the base count, and the number of reads corresponding to quality scores.

spot_count="203445" base_count="41095890"

Side bar: (XML) is Extensible Markup Language, a document format that is both human and machine-readable. XML aims for: "simplicity, generality and usability across the Internet".

Using `seqkit stat`

We can get similar information from our downloaded files using a program called seqkit.

First, let's see what options are available?

seqkit --help

We can see that seqkit stats provides "simple statistics of FASTA/Q files". Let's try it out.

seqkit stat SRR1553607_1.fastq

`fasterq-dump`

We have already seen fasterq-dump in action (See Lesson 5). fasterq-dump is faster than fastq-dump because it uses multi-threading (default --threads 6). --split-3 and --skip-technical are defaults with fasterq-dump, so you do not need to worry about specifying how you want the files to be split.

However, you can not grab a subset of the file like you can with fastq-dump nor can you compress the file during download, so fasterq-dump is not necessarily a replacement for fastq-dump.

Using `prefetch`

Both fastq-dump and fasterq-dump are faster when following prefetch, and fasterq-dump paired with prefetch is the fastest way to pull the files from the SRA. There are alternative options, which we will mention later.

prefetch will first download all of the necessary files to your working directory. Runs are downloaded in the SRA format (compressed).

mkdir prefetch
cd prefetch
prefetch SRR1553607 
ls -l

fasterq-dump will then convert the files to the fastq format.

fasterq-dump SRR1553607

Automating multiple SRA runs downloads

How can we better automate downloads from the SRA?

For example, what if we want the sequence files from the publication Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak

First we need to get the BioProject (PRJN) number from the publication. In this case, it is PRJNA257197.

Next we're going to query the SRA db using esearch from the e-utilities package. esearch is used to search across Entrez databases.

esearch -db sra -query PRJNA257197

Here, we provided the database -db and the BioProject ID as a -query. Here is a list of unique identifiers and databases that work with e-utilites commands.

Our esearch query returned the following:

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>MCID_63091b9b7023a815e47da810</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>891</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

This is XML format, and we can see there are 891 sequencing runs. This can be piped to efetch, another e-utilities tool, which is used to return full records (run information) in a specified output format.

esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

In this case we want the format as runinfo, which returns the information in a comma separated format.

This file can be opened and explored in excel.

Screen Shot 2020-05-08 at 1 01 50 PM

We can modify the .csv file to pull out some info for our next query using the command line.

To remove the first column of the file, we will use cut:

cat runinfo.csv | cut -f 1 -d ',' | head

There is a bit happening here. First, we opened the file with cat, piped it to the cut command, and then piped it to head. So what is cut? It is a unix command, so we can look at more info with the man command. Remember, we can use man to get the manual information for unix commands.

man cut

There we find out that cut is used to cut out selected portions of the file. The -f option specifies with field to cut (the first field), and -d tells the program which delimiter to use. We know this is a comma-separate value file, so we do -d ',' to specify the comma.

Running the code above, we get:

Run
SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925

Now we can filter for lines that match SRR and store the first ten run ids in a file.

cat runinfo.csv | cut -f 1 -d',' | grep SRR | head > runids.txt

and get

SRR1972917
SRR1972918
SRR1972919
SRR1972920
SRR1972921
SRR1972922
SRR1972923
SRR1972924
SRR1972925
SRR1972926

If we want to retrieve files for each SRR number we can use the fastq-dump command like we did previously.

As an example (DO NOT RUN THIS):

fastq-dump -X 10000 --split-files SRR1972917
fastq-dump -X 10000 --split-files SRR1972918
fastq-dump -X 10000 --split-files SRR1972919

Or, we could use the parallel tool to automate downloads.

Using the "parallel" tool for automating downloads

GNU parallel executes commands in "parallel", one for each CPU core on your system. It can serve as a nice replacement of the for loop. See Tool: Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

cat runids.txt | parallel fastq-dump -X 10000 --split-files {}

This gives us twenty files, two files for each run and 10 runs. It takes the place of the multiple fastq-dump commands we would need to use.

What's up with the curly braces {}?

The curly braces are default behavior for parallel. This is the default replacement string; when empty, it appends inputs. So the following gets the same result.

cat runids.txt | parallel fastq-dump -X 10000 --split-files

The real power of parallel comes when using something between the brackets such as {.}, {/}, and {//}, like this...

nano filelist

Include the following:

/dir/subdir/file.txt
/other/list.csv
/raw/seqs.fastq

Save the file -> Ctrl O, Yes, Ctrl X.

cat filelist | parallel echo {} "without extension" {.}

/dir/subdir/file.txt without extension /dir/subdir/file
/other/list.csv without extension /other/list
/raw/seqs.fastq without extension /raw/seqs

This replaces the extension.

cat filelist | parallel echo {} "without path" {/}

/dir/subdir/file.txt without path file.txt
/other/list.csv without path list.csv
/raw/seqs.fastq without path seqs.fastq

This removes the path.

cat filelist | parallel echo {} "with root" {//}

/dir/subdir/file.txt with root /dir/subdir
/other/list.csv with root /other
/raw/seqs.fastq with root /raw

This keeps only the path.

European Nucleotide Archive (ENA)

Everything in the SRA is also in the ENA. Files in the ENA share the same naming convention as the SRA but are stored directly as gzipped fastq files and bam files. You can easily use wget or curl to pull files directly from the ENA without using the sratoolkit, which in my experience is faster. More on wget and curl in the next lesson.

There is also a fantastic web service called SRA Explorer that can be used to easily generate code for file downloads from the SRA. However, this is not associated with ENA or NCBI and program support / updates cannot be guaranteed.

Help Session

Let's search for and download some data.